Friday, 8 January 2016

What's new in Storm-Crawler 0.8

Storm Crawler

There has been quite a bit happening with Storm-crawler recently. We got a proper logo (see above) and the project has a website at We also just released the version 0.8, which contains the following changes :
  • the groupId for the Maven artefacts is now com.digitalpebble.stormcrawler (#218)
  • [ES] Deactivate _all field for status and metrics indices (#228)
  • [SOLR] Check that not more than one instance of the Spout exists (#213)
  • Upgraded to storm 0.9.6 
  • Discover sitemap files automatically from robots.txt (#211)
  • Use Travis-CI to check commits and PRs
  • Replaced RandomURLSpout with MemorySpout + added MemoryStatusUpdater (#224)
  • Maven archetype (#225)

The latter is the main attraction for this new release. It allows users to bootstrap a new storm-crawler based project using the following command : 

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=0.8 

in interactive mode where users can then specify the groupId, artifactId, version and package name to use for their new project. 

This results in a fully formed project, complete with a Maven pom file, the default CrawlTopology currently in the core module as well as a README, a crawler-conf.yaml and a set of resources (url and parse filters). This project can then be compiled and run in the usual ways.

Having this archetype should help new users to get a better understanding of how storm-crawler works and will also simplify the code if we remove the resources used in the archetype from the core module (see #227). This also illustrates nicely how lightweight stormcrawler is, most people will probably stop cloning the repository and just use the dependencies instead.

We should see plenty of further improvements in the next months, in particular an upgrade to Storm 0.10 (#229) and of course more content on our brand new website.

Thanks to all users and contributors. Happy crawling!