Crawling and Data Extraction 
with Apache Nutch 
Yewint Ko 
yewintko@bindez.com
Agenda 
• Crawlers 
• Use cases 
• Data extraction & content scraping 
• Apache Nutch 
• Nutch lifecycle 
• Nutch plugins 
• Scaling Nutch 
• Opportunities
I am … 
• Cofounder and architect at Bindez.com 
• More than 6 years in IT industry 
• 3 years in web archiving & IR field 
• Myanmar and SEA
Crawlers
How they works… 
• Downloading HTML 
pages 
• Start from seed list 
• Follow page depth 
• Multi threading 
• Respect Robot.txt
www.fb.com/robots.txt
Use cases 
• Information retrieval ( search ) 
• Market analysis ( brand watching) 
• Social media analysis ( hate speech monitor) 
• Recommendation systems (ticket, movie etc) 
• NLP and ML
Data Extraction 
the task of automatically extracting structured 
information from unstructured and/or semi-structured 
machine-readable documents.
Data Extraction 
Web is Raw 
• Raw text in different language 
• Raw text in different subject 
• Raw texts in different format 
Example : 
• Date time 
• Author 
• Tags 
• Images, videos, pdf, pptx, doc, odt
Html parsing / Content Scraping 
• Title 
• Main content 
• Banners 
• Ads 
• Header / footer
Html parsing / Content Scraping
Html parsing / Content Scraping
Apache Nutch
Apache Nutch 
• 2002/2003 – started by Doug Cutting & Mike 
Caffarella 
• Pure Java 
• 2005 – Map Reduce implementation in Nutch 
• 2006 - Hadoop support 
• 2006-07 – Tika integration 
• May 2010 – TLP project at Apache
Releases & Community 
Apache Nutch 1.9 2014-08-16 
Apache Nutch 1.8 2014-03-17 
Apache Nutch 2.2.1 2013-07-02 
Apache Nutch 1.7 2013-06-24 
Apache Nutch 2.2 2013-06-05 
Apache Nutch 1.6 2012-12-06 
Apache Nutch 2.1 2012-10-05 
Apache Nutch 1.5.1 2012-07-10 
Apache Nutch 2.0 2012-07-07 
Apache Nutch 1.5 2012-06-07 
Apache Nutch 1.4 2011-04-11 
Apache Nutch 1.3 2011-06-07 
nutch-1.0 2009-03-23 
nutch-0.9 2007-04-01 
nutch-0.8.1 2006-09-24 
nutch-0.8 2006-06-25 
nutch-0.7.2 2006-03-31
Releases & Community 
• Apache License 2.0 (Business friendly ) 
• Matured (10 years old ) 
• Tested on very large scale cluster ( Hadoop) 
• Active committers 
• New contributions and bugs report 
• Tons of mailing list subscribers
Nutch Lifecycle 
• Inject URLs 
- seed list, initial linkdb is empty 
• Generate 
- prepare for fetch : create link segs 
• Fetch - download raw htmls 
• Parse - parse contents , discover outlinks 
• Update - crawldb , linkdb
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Plugins 
• Extensibility 
• Flexibility 
• Maintainability
Plugins 
• IndexWriter – indexing integration 
• IndexingFilter – add additional index field 
• Parser – based parser 
• HtmlParseFilter – additional parser chains 
• Protocol -- ftp, http, etc 
• URLFilter -- limit the URLs 
• URLNormalizer -- convert URLs to normal 
• ScoringFilter – page score 
• SegmentMergeFilter – Merge segments
Scalability
Opportunities 
• Myanmar web super raw (Rich news media 
contents) 
• Myanmar web needs analytic solutions
Resources 
Website : http://nutch.apache.org/ 
WiKi: http://wiki.apache.org/general/ 
Plugins : https://wiki.apache.org/nutch/PluginCentral 
Browse: http://svn.apache.org/viewvc/nutch/ 
SVN : https://svn.apache.org/repos/asf/nutch/
Thank You! 
?
Follow me on … 
twitter.com/yewintko 
linkedin.com/yewintko 
Email me… 
yewintko@bindez.com

Dev Con 2014

  • 1.
    Crawling and DataExtraction with Apache Nutch Yewint Ko yewintko@bindez.com
  • 2.
    Agenda • Crawlers • Use cases • Data extraction & content scraping • Apache Nutch • Nutch lifecycle • Nutch plugins • Scaling Nutch • Opportunities
  • 3.
    I am … • Cofounder and architect at Bindez.com • More than 6 years in IT industry • 3 years in web archiving & IR field • Myanmar and SEA
  • 4.
  • 5.
    How they works… • Downloading HTML pages • Start from seed list • Follow page depth • Multi threading • Respect Robot.txt
  • 6.
  • 7.
    Use cases •Information retrieval ( search ) • Market analysis ( brand watching) • Social media analysis ( hate speech monitor) • Recommendation systems (ticket, movie etc) • NLP and ML
  • 9.
    Data Extraction thetask of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
  • 10.
    Data Extraction Webis Raw • Raw text in different language • Raw text in different subject • Raw texts in different format Example : • Date time • Author • Tags • Images, videos, pdf, pptx, doc, odt
  • 11.
    Html parsing /Content Scraping • Title • Main content • Banners • Ads • Header / footer
  • 12.
    Html parsing /Content Scraping
  • 13.
    Html parsing /Content Scraping
  • 14.
  • 15.
    Apache Nutch •2002/2003 – started by Doug Cutting & Mike Caffarella • Pure Java • 2005 – Map Reduce implementation in Nutch • 2006 - Hadoop support • 2006-07 – Tika integration • May 2010 – TLP project at Apache
  • 16.
    Releases & Community Apache Nutch 1.9 2014-08-16 Apache Nutch 1.8 2014-03-17 Apache Nutch 2.2.1 2013-07-02 Apache Nutch 1.7 2013-06-24 Apache Nutch 2.2 2013-06-05 Apache Nutch 1.6 2012-12-06 Apache Nutch 2.1 2012-10-05 Apache Nutch 1.5.1 2012-07-10 Apache Nutch 2.0 2012-07-07 Apache Nutch 1.5 2012-06-07 Apache Nutch 1.4 2011-04-11 Apache Nutch 1.3 2011-06-07 nutch-1.0 2009-03-23 nutch-0.9 2007-04-01 nutch-0.8.1 2006-09-24 nutch-0.8 2006-06-25 nutch-0.7.2 2006-03-31
  • 17.
    Releases & Community • Apache License 2.0 (Business friendly ) • Matured (10 years old ) • Tested on very large scale cluster ( Hadoop) • Active committers • New contributions and bugs report • Tons of mailing list subscribers
  • 18.
    Nutch Lifecycle •Inject URLs - seed list, initial linkdb is empty • Generate - prepare for fetch : create link segs • Fetch - download raw htmls • Parse - parse contents , discover outlinks • Update - crawldb , linkdb
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Plugins • Extensibility • Flexibility • Maintainability
  • 27.
    Plugins • IndexWriter– indexing integration • IndexingFilter – add additional index field • Parser – based parser • HtmlParseFilter – additional parser chains • Protocol -- ftp, http, etc • URLFilter -- limit the URLs • URLNormalizer -- convert URLs to normal • ScoringFilter – page score • SegmentMergeFilter – Merge segments
  • 28.
  • 29.
    Opportunities • Myanmarweb super raw (Rich news media contents) • Myanmar web needs analytic solutions
  • 30.
    Resources Website :http://nutch.apache.org/ WiKi: http://wiki.apache.org/general/ Plugins : https://wiki.apache.org/nutch/PluginCentral Browse: http://svn.apache.org/viewvc/nutch/ SVN : https://svn.apache.org/repos/asf/nutch/
  • 31.
  • 32.
    Follow me on… twitter.com/yewintko linkedin.com/yewintko Email me… yewintko@bindez.com

Editor's Notes

  • #27 Explain indexing Interesting plugins
  • #28 Explain indexing Interesting plugins
  • #30 Explain indexing Interesting plugins