Monday, 15 May 2017

Avoid these common pitfalls when upgrading StormCrawler with Elasticsearch 5.x

The next (and probably imminent) release of StormCrawler will contain an update of Elasticsearch to version 5.3. This is definitely a good thing, as we want to keep up with the latest versions of Elasticsearch but has a few pitfalls when upgrading your existing application. Some of the changes are documented in the README but I will reiterate them here, just in case.

LOG4J dependencies
ES5 requires an upgrade in the logging dependencies of Apache Storm. You can update the dependencies in your existing Storm cluster by hand but since my patch is part of Storm 1.1.0, you should probably upgrade Storm altogether. StormCrawler 1.5 will depend on Storm 1.1.0 (but probably works with older versions as well).

Maven Shade Configuration
The pom file of your StormCrawler-based project needs modifying as well, you'll need to specify the Maven Shade Configuration and include:
<manifestEntries>
 <Change></Change>
 <Build-Date></Build-Date>
</manifestEntries>

See https://github.com/elastic/elasticsearch/issues/21627; this wasn't an issue with the previous versions of Elasticsearch.

Update es-conf.yaml
In particular, the value of es.status.bucket.field used to be _routing, which is an automatically generated field, however this is not available for the spouts anymore. Instead, use the same value as es.status.routing.fieldname e.g. metadata.hostname

Mapping
ES5 should be able to read your existing indices, however, if you create a new set of indices from scratch, make sure you use the latest version of the script.

I hope this will help you for a successful upgrade, I will cover the new functionalities and improvements coming with StormCrawler 1.5 when it is released.

Happy crawling