Wednesday 6 July 2011

Crawler-Commons 0.1 released

As announced on various mailing-lists : 

The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list


The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort. 

The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/] and should be available on Maven Central soon.

Please send your questions, comments or suggestions to http://groups.google.com/group/crawler-commons
Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).

Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].

Sunday 12 June 2011

Nutch 1.3 released + BerlinBuzzwords presentation

Nutch 1.3 has been released and contains quite a few changes, some of which have been retrofitted from Nutch 2.0 in trunk.

The main modification is that Nutch now relies entirely on SOLR for indexing and searching and we removed our indexer based on Lucene as well as the search webapps (NUTCH-837). The dependencies are managed with Apache Ivy (NUTCH-821) and we've upgraded the versions of SOLR to 3.1 and Tika to 0.9. Another important change is that we have two separate runtime environments for local and deployed configurations (NUTCH-843). Nutch 1.3 contains a lot more improvements and bugfixes so if you use Nutch you should probably migrate to it.

The presentation I gave this week at BerlinBuzzwords is now available online and covered both 1.3 and 2.0, as well as an overview of Nutch. The conference itself was great and I met quite a few Nutch users and people who planned to use it as well as Doug Cutting, the creator of Nutch himself!

There are quite a few things planned for the next release(s) and also a large amount of work to do on the documentation which is a bit dated and patchy. Luckily some new committers have recently joined the project and seem keen to help with this.

Friday 27 May 2011

Parsing the Enron email dataset using Tika and Hadoop

In order to parse a large collection of emails, such as the Enron Email Dataset, we might choose to use Apache Hadoop, a scalable computing framework, and Apache Tika, a content analysis toolkit. This can be done easily with Behemoth, an open source platform for large scale document analysis developed by DigitalPebble. For more details of Behemoth, see the Behemoth Tutorial.

Using the August 21, 2009 version of the dataset, the first step is to use Behemoth's CorpusGenerator to create a corpus of BehemothDocuments from the Enron Dataset in HDFS. A BehemothDocument is the native object used by Behemoth. At ingest, it contains the original document, its content type and URL. After processing by a Behemoth module, it also contains the extracted text, additional metadata and annotations created about the document.

Once the dataset has been ingested, the next step is to use the Behemoth Tika module to create a Hadoop Map/Reduce job to extract the contents of the emails and metadata about them. Using Apache Tika 0.9, 5% of the documents fail to parse correctly. However using the latest version of Tika (Tika-1.0-snapshot revision 825923) only 0.2% documents fail.

One way to investigate why parsing is failing is by looking at the user logs generated within Hadoop, which contain details of the exceptions causing the failing documents. An alternative way is to write a custom reducer that sorts the exceptions thrown by Tika, with the exception stack being used as a key and a document URL as values. With Tika revision 825923, four exceptions are thrown, caused by two underlying problems: excessive line lengths of over 10,000 characters, the current default in the Tika mail parser, and malformed dates. The first problem can be solved by increasing the maximum line length in a MimeEntityConfig object and then modifying TikaProcessor to pass it into the ParseContext.

As for the second problem, currently the mail parser in Tika performs strict parsing, i.e. parsing a document fails when parsing a field fails. Tika-667 contains a contribution that makes it possible to turn off strict parsing, so some data can still be extracted from the emails with the malformed dates. This can also be configured via MimeEntityConfig. When these changes are incorporated, all documents are processed correctly.

Saturday 7 May 2011

Nutch talk at Berlin Buzzwords 2011

I'll be giving a talk on Apache Nutch at Berlin Buzzwords.

This talk will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, Lucene, SOLR, Tika or HBase. The presentation will contain examples of real-case uses.

The second part of the presentation will be focused on the latest developments in Nutch and the changed introduces by the forthcoming version 2.0.

Tuesday 22 March 2011

Search for US properties with SOLR and Maptimize

Our clients 5k50 have recently opened a preview of their real-estate search system which is based on Apache SOLR and Maptimize. Maptimize is a very nice tool which which manages the display of data on Google Maps by merging markers which are geographically close together.

We initially audited the existing SOLR setup then redesigned it to add more functionalities and optimise the search speed.  The search itself is an interesting mix of map-driven filtering with SOLR queries and faceting. Any changes to the map (click on a cluster, zoom in/out) are reflected in the search results and facets and vice-versa.

Navia is a nice showcase for some of the most commonly used features of SOLR (i.e. faceting, more-like-this, autocompletion) and has a great identity thanks to its mix of geo and text search.  It is currently in beta mode so we can expect a few more improvements over the next few weeks.

And please feel free to give it a try so that we can get plenty of data on the performance :-)

Saturday 19 March 2011

DigitalPebble is hiring!

We are looking for a candidate with the following skills and expertise :
  • strong background in NLP and Java
  • GATE, experience of writing plugins and PRs, excellent knowledge of JAPE
  • IE, Linked Data, Ontologies
  • statistical approaches and machine learning
  • large scale computing with Hadoop
  • knowledge of the following technologies / tools : Lucene, SOLR, NoSQL, Tika, UIMA, Mahout
  • good social and presentation skills
  • good spoken and written English, knowledge of other languages would be a plus
  • taste for challenges and problem solving

    DigitalPebble is located in Bristol (UK) and specialises in open source solutions for text engineering.

    More details on our activities can be found on our website. We would consider candidates working remotely with occasional travel to Bristol and our clients in UK and Europe. Being located in or near Bristol would be a plus.

    This job is an opportunity to get involved in the growth of a small company, work on interesting projects and take part in various Apache related projects and events. Bristol is also a great place to live.


   Please send your CV and cover letter before the 15th April 2011 to job@digitalpebble.com


    Best regards,

    Julien Nioche

Monday 21 February 2011

Watson, the computer Behemoth in Jeopardy!

Alex Popescu's excellent blog mentioned the DeepQA project and IBM's supercomputer Watson. Watson's recent appearance on the US TV show Jeopardy!. Interestingly, DeepQA uses both Apache Hadoop and UIMA to analyse large volumes of documents to build DeepQA's knowledge-base.

As explained in https://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
"To preprocess the corpus and create fast run-time indices we used Hadoop. UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes the
content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process."
which is exactly what Behemoth does (how very reassuring!).

The article also mentions UIMA-AS and it is not entirely clear what part of the system uses what : is UIMA-AS used for the runtime analysis of the questions and Hadoop for the background learning?

Would be interesting to know what sort of UIMA annotators were used internally for the analysis of the text and, more importantly from Behemoth's point of view, whether it could have been used for this project and/or what features would have been required to get it to work on DeepQA.