Dev Con 2014

Crawling and Data Extraction
with Apache Nutch
Yewint Ko
yewintko@bindez.com

Agenda
• Crawlers
• Use cases
• Data extraction & content scraping
• Apache Nutch
• Nutch lifecycle
• Nutch plugins
• Scaling Nutch
• Opportunities

I am …
• Cofounder and architect at Bindez.com
• More than 6 years in IT industry
• 3 years in web archiving & IR field
• Myanmar and SEA

How they works…
• Downloading HTML
pages
• Start from seed list
• Follow page depth
• Multi threading
• Respect Robot.txt

Use cases
• Information retrieval ( search )
• Market analysis ( brand watching)
• Social media analysis ( hate speech monitor)
• Recommendation systems (ticket, movie etc)
• NLP and ML

Data Extraction
the task of automatically extracting structured
information from unstructured and/or semi-structured
machine-readable documents.

Data Extraction
Web is Raw
• Raw text in different language
• Raw text in different subject
• Raw texts in different format
Example :
• Date time
• Author
• Tags
• Images, videos, pdf, pptx, doc, odt

Html parsing / Content Scraping
• Title
• Main content
• Banners
• Ads
• Header / footer

Html parsing / Content Scraping

Apache Nutch
• 2002/2003 – started by Doug Cutting & Mike
Caffarella
• Pure Java
• 2005 – Map Reduce implementation in Nutch
• 2006 - Hadoop support
• 2006-07 – Tika integration
• May 2010 – TLP project at Apache

Releases & Community
Apache Nutch 1.9 2014-08-16
Apache Nutch 2.2.1 2013-07-02
Apache Nutch 1.5.1 2012-07-10
nutch-1.0 2009-03-23
nutch-0.9 2007-04-01
nutch-0.8.1 2006-09-24
nutch-0.8 2006-06-25
nutch-0.7.2 2006-03-31

Releases & Community
• Apache License 2.0 (Business friendly )
• Matured (10 years old )
• Tested on very large scale cluster ( Hadoop)
• Active committers
• New contributions and bugs report
• Tons of mailing list subscribers

Nutch Lifecycle
• Inject URLs
- seed list, initial linkdb is empty
• Generate
- prepare for fetch : create link segs
• Fetch - download raw htmls
• Parse - parse contents , discover outlinks
• Update - crawldb , linkdb

Plugins
• Extensibility
• Flexibility
• Maintainability

Plugins
• IndexWriter – indexing integration
• IndexingFilter – add additional index field
• Parser – based parser
• HtmlParseFilter – additional parser chains
• Protocol -- ftp, http, etc
• URLFilter -- limit the URLs
• URLNormalizer -- convert URLs to normal
• ScoringFilter – page score
• SegmentMergeFilter – Merge segments

Opportunities
• Myanmar web super raw (Rich news media
contents)
• Myanmar web needs analytic solutions

Resources
Website : http://nutch.apache.org/
WiKi: http://wiki.apache.org/general/
Plugins : https://wiki.apache.org/nutch/PluginCentral
Browse: http://svn.apache.org/viewvc/nutch/
SVN : https://svn.apache.org/repos/asf/nutch/

Follow me on …
twitter.com/yewintko
linkedin.com/yewintko
Email me…
yewintko@bindez.com

Dev Con 2014

More Related Content

What's hot

Viewers also liked

Similar to Dev Con 2014

Dev Con 2014

Editor's Notes