This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Clipping is a handy way to collect important slides you want to go back to later.