DigitalPebble's Blog: What's new in Storm-Crawler 0.6

We have just released version 0.6 of Storm-Crawler, an open source web crawling SDK based on Apache Storm. Storm-Crawler provides resources for building scalable, low-latency web crawlers and is used in production at various companies.

We have added loads of improvements and bug fixes since our previous release last June, thanks to the efforts of the community. The activity around the project has been very steady and a new committer (Jorge Luis Betancourt) has joined our ranks. We also had contributions from various users, which is great.

Here are the main features of version 0.6.

Dependencies upgrades

Storm 0.9.5
crawler-commons 0.6
Tika 1.10

Code reorganisation

Organise external content as separate sub-modules #145
Removed external/metrics #160

API changes

ParseFilter from interface to abstract class #159

Parse can output more than one document #135

New features and resources

SimpleFetcherBolt enforces politeness #181
New RobotsURLFilter #178
New ContentFilter to restrict text of document to XPath match #150
Adding support for using the canonical URL in the IndexerBolts #161
Improvement to SitemapParserBolt #143
Enforce robots meta instructions #148
Expand XPathFilter to accept a list of expressions as an argument #153
JSoupParserBolt does a basic check of the content type #151

External resources

The external (non-core) resources have been separated into discrete sub-modules as their number was getting larger.

SOLR

Our brand new module for Apache SOLR (see #152) is comparable to the existing ElasticSearch equivalent and provides an IndexerBolt, a MetricsConsumer and a SOLRSpout and StatusUpdaterBolt.

SQL

Not all web crawls require scalable big data solutions. I conducted a survey of Apache Nutch users some time ago which showed that most people used it on a single machine and less than a million URL. These are often people crawling a single website. With that in mind, we added a spout and StatusUpdaterBolt implementations to use MySQL as a storage for URL status which is useful for small recursive crawls. See #172 for details.

AWS CloudSearch

There is also a new AWS module containing an IndexerBolt for Amazon CloudSearch (see #174).

We hope that people find these improvements useful and would like to thank all users and contributors.

Friday 4 September 2015

What's new in Storm-Crawler 0.6

Dependencies upgrades

Code reorganisation

API changes

New features and resources

External resources

No comments:

Post a Comment