DigitalPebble's Blog: March 2016

The version 0.9 of Storm-Crawler has just been released. It contains many improvements and bugfixes and we recommend all existing users to upgrade to it.

Here are the main changes :

Core

Moved to Storm 0.10.0 #229
FetcherBolt can dump content of its queues to log #45
PluggableSchedulers #245
Bugfix HTTP protocol setConnectionRequestTimeout
Sitemap : option filter out URLs older than certain threshold #249
New URLFilter : remove links to self #252
HTTP protocol can limit amount of content fetched #206
AbstractStatusUpdaterBolt allows proper acking of tuples #241
Improvements to URL normalization #264 #205 #120
Improvements to Robots caching #265
Fetcher to dump the content of its queues to the log #45

Elasticsearch

Spout : one instance per shard #198
AggregationSpout #237
ES-based Spouts use a static Client instance #258
ES use NodeClient only if no address is specified #260
StatusUpdaterBolts must be able to ack/fail explicitly #216

The AggregationSpout in particular is a very useful feature. The existing ElasticSearchSpout offers very few guarantees regarding the diversity of hosts retrieved by the queries. This is improved by randomizing the results, however the latter has an impact on the memory used by ES for the field caching. Limiting the field caching leads to poor performance and the eviction mechanism slows the queries quite a lot. The AggregationSpout on the other hand guarantees a good diversity of URLs by bucketing the search results per hostname (or TLD or IP depending on the value of es.status.routing.fieldname). It is likely to get improved further once we move to Elasticsearch 2.x (see below).

Both ES spout implementations benefit from the sharding mechanism introduced in #198 : if the configuration specifies es.status.routing: true, the StatusUpdaterBolt will direct the URLs to specific shards based on the value of partition.url.mode, i.e. all the URLs for a particular host or TLD will be colocated on the same Elasticsearch shard. This means that the size of the shard can become uneven, depending on the distribution of URLs in your crawl but another implication is that it is now possible to have one Spout instance per shard and parallelise the reads from ES while preserving politeness.

Finally you should see a noticeable improvement in performance now that the StatusUpdaterBolt acks/fails the tuples explicitly #216. Prior to that all tuples were automatically acked as soon as they were buffered for indexing to ES i.e. they could get acked in the spout and removed from its internal cache even though the updates were not yet committed to the ES index. This meant that the spout could resend the same URL down the topology after querying ES for new URLs. Obviously quite wasteful but now luckily fixed!

What's next?

We should see plenty of further improvements in the next months, in particular an upgrade to Apache Storm 1.0 (which is due any time soon) and also a move to Elasticsearch 2.x (#257). Thanks to all users and contributors. Happy crawling!

Wednesday 16 March 2016

What's new in Storm-Crawler 0.9