StormCrawler 1.3 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities and improved performance.
- Jsoup 1.10.1
- Crawler-Commons 0.7
- RomeTools to 1.7.0
- ICU4J 58.2
- Hardcoded limit to the max # connections allowed by protocol #388
- LangID module #364
- JsoupParserBolt can use first N bytes for charset detection (or not at all) #391
- SimpleFetcherBolt uses allowRedir from super class #394 (bugfix)
- URLNormalizer : Decode non-standard percent encoding prior to re-encoding
- MaxDepthFilter defaults to -1, 0 removes all outlinks, can set a custom max depth per URL with max.depth. Implements #399 and #400
The latter breaks compatibility with the previous versions: 0 was used to deactivate the filtering by depth, whereas now it is used to prevent any outlinks from being processed. Please change your config to -1 if you want to deactivate the filtering.
- Flux for crawl and injection topologies #372
- Use min delay for all types of Spouts #370
- Remove Node client #377
- ESSpout deals with deep paging before building query
- Topology status updater triaged by URL to hit cache
- Settings done via configuration #376
- Add plugin to the clients via configuration #378
- Spouts: load results with a non-blocking call #371
- Concurrent requests in config #382
- StatusUpdaterBolt - do not add URL already in buffer for ES if status is DISCOVERED
- Allow fieldNameForRoutingKey to be outside metadata and use a different key for spouts #384
- Use SHA256 as doc_id #385
- Separate Kibana schema for status and metrics + put all schemas in a separate folder
- Improvements to ES_IndexInit
- ES crawl topology uses FetcherBolt
Please note that the cluster name is now defined alongside the other settings:
We are fast approaching our 1.000th commit! Thanks to all users and contributors for their help with StormCrawler. Happy crawling!
PS: I will be running a 1-day workshop in Berlin on the 2nd of February. Announcements will be made on our Twitter account.