DigitalPebble's Blog: What's new in StormCrawler 1.3

StormCrawler 1.3 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities and improved performance.

Dependencies upgrades

Jsoup 1.10.1
Crawler-Commons 0.7
RomeTools to 1.7.0
ICU4J 58.2

Core module

Hardcoded limit to the max # connections allowed by protocol #388
LangID module #364
JsoupParserBolt can use first N bytes for charset detection (or not at all) #391
SimpleFetcherBolt uses allowRedir from super class #394 (bugfix)
URLNormalizer : Decode non-standard percent encoding prior to re-encoding
MaxDepthFilter defaults to -1, 0 removes all outlinks, can set a custom max depth per URL with max.depth. Implements #399 and #400

The latter breaks compatibility with the previous versions: 0 was used to deactivate the filtering by depth, whereas now it is used to prevent any outlinks from being processed. Please change your config to -1 if you want to deactivate the filtering.

Elasticsearch

Flux for crawl and injection topologies #372
Use min delay for all types of Spouts #370
Remove Node client #377
ESSpout deals with deep paging before building query
Topology status updater triaged by URL to hit cache
Settings done via configuration #376
Add plugin to the clients via configuration #378
Spouts: load results with a non-blocking call #371
Concurrent requests in config #382
StatusUpdaterBolt - do not add URL already in buffer for ES if status is DISCOVERED
Allow fieldNameForRoutingKey to be outside metadata and use a different key for spouts #384
Use SHA256 as doc_id #385
Separate Kibana schema for status and metrics + put all schemas in a separate folder
Improvements to ES_IndexInit
ES crawl topology uses FetcherBolt

Please note that the cluster name is now defined alongside the other settings:

  es.status.settings:
    cluster.name: "elasticsearch"

One of the benefits of #376 and #378 is that you can now use StormCrawler with Elastic Cloud protected with Shield.

We are fast approaching our 1.000th commit! Thanks to all users and contributors for their help with StormCrawler. Happy crawling!

PS: I will be running a 1-day workshop in Berlin on the 2nd of February. Announcements will be made on our Twitter account.

Tuesday 10 January 2017

What's new in StormCrawler 1.3

No comments:

Post a Comment