SlideShare a Scribd company logo
Crawling and Data Extraction 
with Apache Nutch 
Yewint Ko 
yewintko@bindez.com
Agenda 
• Crawlers 
• Use cases 
• Data extraction & content scraping 
• Apache Nutch 
• Nutch lifecycle 
• Nutch plugins 
• Scaling Nutch 
• Opportunities
I am … 
• Cofounder and architect at Bindez.com 
• More than 6 years in IT industry 
• 3 years in web archiving & IR field 
• Myanmar and SEA
Crawlers
How they works… 
• Downloading HTML 
pages 
• Start from seed list 
• Follow page depth 
• Multi threading 
• Respect Robot.txt
www.fb.com/robots.txt
Use cases 
• Information retrieval ( search ) 
• Market analysis ( brand watching) 
• Social media analysis ( hate speech monitor) 
• Recommendation systems (ticket, movie etc) 
• NLP and ML
Data Extraction 
the task of automatically extracting structured 
information from unstructured and/or semi-structured 
machine-readable documents.
Data Extraction 
Web is Raw 
• Raw text in different language 
• Raw text in different subject 
• Raw texts in different format 
Example : 
• Date time 
• Author 
• Tags 
• Images, videos, pdf, pptx, doc, odt
Html parsing / Content Scraping 
• Title 
• Main content 
• Banners 
• Ads 
• Header / footer
Html parsing / Content Scraping
Html parsing / Content Scraping
Apache Nutch
Apache Nutch 
• 2002/2003 – started by Doug Cutting & Mike 
Caffarella 
• Pure Java 
• 2005 – Map Reduce implementation in Nutch 
• 2006 - Hadoop support 
• 2006-07 – Tika integration 
• May 2010 – TLP project at Apache
Releases & Community 
Apache Nutch 1.9 2014-08-16 
Apache Nutch 1.8 2014-03-17 
Apache Nutch 2.2.1 2013-07-02 
Apache Nutch 1.7 2013-06-24 
Apache Nutch 2.2 2013-06-05 
Apache Nutch 1.6 2012-12-06 
Apache Nutch 2.1 2012-10-05 
Apache Nutch 1.5.1 2012-07-10 
Apache Nutch 2.0 2012-07-07 
Apache Nutch 1.5 2012-06-07 
Apache Nutch 1.4 2011-04-11 
Apache Nutch 1.3 2011-06-07 
nutch-1.0 2009-03-23 
nutch-0.9 2007-04-01 
nutch-0.8.1 2006-09-24 
nutch-0.8 2006-06-25 
nutch-0.7.2 2006-03-31
Releases & Community 
• Apache License 2.0 (Business friendly ) 
• Matured (10 years old ) 
• Tested on very large scale cluster ( Hadoop) 
• Active committers 
• New contributions and bugs report 
• Tons of mailing list subscribers
Nutch Lifecycle 
• Inject URLs 
- seed list, initial linkdb is empty 
• Generate 
- prepare for fetch : create link segs 
• Fetch - download raw htmls 
• Parse - parse contents , discover outlinks 
• Update - crawldb , linkdb
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Nutch Lifecycle
Plugins 
• Extensibility 
• Flexibility 
• Maintainability
Plugins 
• IndexWriter – indexing integration 
• IndexingFilter – add additional index field 
• Parser – based parser 
• HtmlParseFilter – additional parser chains 
• Protocol -- ftp, http, etc 
• URLFilter -- limit the URLs 
• URLNormalizer -- convert URLs to normal 
• ScoringFilter – page score 
• SegmentMergeFilter – Merge segments
Scalability
Opportunities 
• Myanmar web super raw (Rich news media 
contents) 
• Myanmar web needs analytic solutions
Resources 
Website : http://nutch.apache.org/ 
WiKi: http://wiki.apache.org/general/ 
Plugins : https://wiki.apache.org/nutch/PluginCentral 
Browse: http://svn.apache.org/viewvc/nutch/ 
SVN : https://svn.apache.org/repos/asf/nutch/
Thank You! 
?
Follow me on … 
twitter.com/yewintko 
linkedin.com/yewintko 
Email me… 
yewintko@bindez.com

More Related Content

What's hot

Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
Itamar
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
Maxim Shelest
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
valiantval2
 
ION Belfast - IETF Update - Chris Grundemann
ION Belfast - IETF Update - Chris GrundemannION Belfast - IETF Update - Chris Grundemann
ION Belfast - IETF Update - Chris Grundemann
Deploy360 Programme (Internet Society)
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
Greg Brown
 
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
datascienceiqss
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Emanuel Berndl
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Jason Austin
 
ElasticSearch Getting Started
ElasticSearch Getting StartedElasticSearch Getting Started
ElasticSearch Getting Started
Onuralp Taner
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ruslan Zavacky
 

What's hot (14)

Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
 
Elasticsearch python
Elasticsearch pythonElasticsearch python
Elasticsearch python
 
ION Belfast - IETF Update - Chris Grundemann
ION Belfast - IETF Update - Chris GrundemannION Belfast - IETF Update - Chris Grundemann
ION Belfast - IETF Update - Chris Grundemann
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
Towards a common deposit api (the dataverse example) Elizabeth Quigley + Phil...
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data ModelAnno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
Anno4j - Idiomatic Persistence and Querying for the W3C Annotation Data Model
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
ElasticSearch Getting Started
ElasticSearch Getting StartedElasticSearch Getting Started
ElasticSearch Getting Started
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 

Viewers also liked

My journal to be a tech entrepreneur 4.22.05 pm
My journal to be a tech entrepreneur 4.22.05 pmMy journal to be a tech entrepreneur 4.22.05 pm
My journal to be a tech entrepreneur 4.22.05 pm
yewint ko
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
UID BIT Coursework
UID BIT CourseworkUID BIT Coursework
UID BIT Coursework
Myint Oo ( Jack )
 
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
Nay Linn Ko
 
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn KoUser Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
Nay Linn Ko
 
Advance Java course work under NCC Education June 2011
Advance Java course work  under NCC Education June 2011Advance Java course work  under NCC Education June 2011
Advance Java course work under NCC Education June 2011
Md. Mahbub Alam
 
Cw comp1645 171_mo233_20141113_194808_1415
Cw comp1645 171_mo233_20141113_194808_1415Cw comp1645 171_mo233_20141113_194808_1415
Cw comp1645 171_mo233_20141113_194808_1415
Owen Muzi
 
NayLinnKo_BIT_InteractionDesign
NayLinnKo_BIT_InteractionDesignNayLinnKo_BIT_InteractionDesign
NayLinnKo_BIT_InteractionDesign
Nay Linn Ko
 
NayLinnKo Information Requirements Analysis BIT
NayLinnKo Information Requirements Analysis BITNayLinnKo Information Requirements Analysis BIT
NayLinnKo Information Requirements Analysis BIT
Nay Linn Ko
 
MYINT OO IRA BIT COURSEWORK
MYINT OO IRA BIT COURSEWORKMYINT OO IRA BIT COURSEWORK
MYINT OO IRA BIT COURSEWORK
Myint Oo ( Jack )
 
MYINT OO ID BIT COURSEWORK
MYINT OO ID BIT COURSEWORKMYINT OO ID BIT COURSEWORK
MYINT OO ID BIT COURSEWORK
Myint Oo ( Jack )
 
Interaction Design
Interaction DesignInteraction Design
Interaction Design
Md. Mahbub Alam
 
NayLinnKo Information Systems Management BIT
NayLinnKo Information Systems Management BITNayLinnKo Information Systems Management BIT
NayLinnKo Information Systems Management BIT
Nay Linn Ko
 
MYINT OO ISM BIT COURSEWORK
MYINT OO ISM BIT COURSEWORKMYINT OO ISM BIT COURSEWORK
MYINT OO ISM BIT COURSEWORK
Myint Oo ( Jack )
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
BIT PROJECT
BIT PROJECT BIT PROJECT
BIT PROJECT
Myint Oo ( Jack )
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
Cüneyt Yeşilkaya
 

Viewers also liked (19)

My journal to be a tech entrepreneur 4.22.05 pm
My journal to be a tech entrepreneur 4.22.05 pmMy journal to be a tech entrepreneur 4.22.05 pm
My journal to be a tech entrepreneur 4.22.05 pm
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
UID BIT Coursework
UID BIT CourseworkUID BIT Coursework
UID BIT Coursework
 
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
Development Frameworks and Methods (University of Greenwich BIT Coursework) b...
 
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn KoUser Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
User Interface Design (University of Greenwich BIT Coursework) by Nay Linn Ko
 
Advance Java course work under NCC Education June 2011
Advance Java course work  under NCC Education June 2011Advance Java course work  under NCC Education June 2011
Advance Java course work under NCC Education June 2011
 
Cw comp1645 171_mo233_20141113_194808_1415
Cw comp1645 171_mo233_20141113_194808_1415Cw comp1645 171_mo233_20141113_194808_1415
Cw comp1645 171_mo233_20141113_194808_1415
 
NayLinnKo_BIT_InteractionDesign
NayLinnKo_BIT_InteractionDesignNayLinnKo_BIT_InteractionDesign
NayLinnKo_BIT_InteractionDesign
 
NayLinnKo Information Requirements Analysis BIT
NayLinnKo Information Requirements Analysis BITNayLinnKo Information Requirements Analysis BIT
NayLinnKo Information Requirements Analysis BIT
 
MYINT OO IRA BIT COURSEWORK
MYINT OO IRA BIT COURSEWORKMYINT OO IRA BIT COURSEWORK
MYINT OO IRA BIT COURSEWORK
 
MYINT OO ID BIT COURSEWORK
MYINT OO ID BIT COURSEWORKMYINT OO ID BIT COURSEWORK
MYINT OO ID BIT COURSEWORK
 
Interaction Design
Interaction DesignInteraction Design
Interaction Design
 
NayLinnKo Information Systems Management BIT
NayLinnKo Information Systems Management BITNayLinnKo Information Systems Management BIT
NayLinnKo Information Systems Management BIT
 
MYINT OO ISM BIT COURSEWORK
MYINT OO ISM BIT COURSEWORKMYINT OO ISM BIT COURSEWORK
MYINT OO ISM BIT COURSEWORK
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
BIT PROJECT
BIT PROJECT BIT PROJECT
BIT PROJECT
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
 

Similar to Dev Con 2014

Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
Sammy Fung
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Music recommendations API with Neo4j
Music recommendations API with Neo4jMusic recommendations API with Neo4j
Music recommendations API with Neo4j
Boris Guarisma
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
Sylvain Zimmer
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
Anthony Baker
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web server
IWMW
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in production
Daniel Coupal
 
Asynchronous Frameworks.pptx
Asynchronous Frameworks.pptxAsynchronous Frameworks.pptx
Asynchronous Frameworks.pptx
ganeshkarthy
 
Meetup bangalore 9_novupdated
Meetup bangalore 9_novupdatedMeetup bangalore 9_novupdated
Meetup bangalore 9_novupdated
D.Rajesh Kumar
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
DataWorks Summit
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Nikos Katirtzis
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
Nikos Katirtzis
 
Best Practices for Design Hardware APIs
Best Practices for Design Hardware APIsBest Practices for Design Hardware APIs
Best Practices for Design Hardware APIs
Matt Haines
 
CSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
CSOM (Client Side Object Model). Explained @ SharePoint Saturday HoustonCSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
CSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
Kunaal Kapoor
 
Apache Sever Technology By Greg Williams
Apache Sever Technology By Greg WilliamsApache Sever Technology By Greg Williams
Apache Sever Technology By Greg Williams
GregWilliams65325
 
IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012
Stuart Myles
 
Guide to open source
Guide to open source Guide to open source
Guide to open source
Javier Perez
 
No need to leave Connections. Bring your Domino applications into the Activit...
No need to leave Connections. Bring your Domino applications into the Activit...No need to leave Connections. Bring your Domino applications into the Activit...
No need to leave Connections. Bring your Domino applications into the Activit...
LetsConnect
 
Sharepoint
SharepointSharepoint
Sharepoint
Naqash Ahmed
 
Introduction to Lectures in Apple iClub at DA-IICT
Introduction to Lectures in Apple iClub  at DA-IICTIntroduction to Lectures in Apple iClub  at DA-IICT
Introduction to Lectures in Apple iClub at DA-IICT
Nitesh Bhatia
 

Similar to Dev Con 2014 (20)

Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Music recommendations API with Neo4j
Music recommendations API with Neo4jMusic recommendations API with Neo4j
Music recommendations API with Neo4j
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web server
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in production
 
Asynchronous Frameworks.pptx
Asynchronous Frameworks.pptxAsynchronous Frameworks.pptx
Asynchronous Frameworks.pptx
 
Meetup bangalore 9_novupdated
Meetup bangalore 9_novupdatedMeetup bangalore 9_novupdated
Meetup bangalore 9_novupdated
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
 
Improving your team’s source code searching capabilities
Improving your team’s source code searching capabilitiesImproving your team’s source code searching capabilities
Improving your team’s source code searching capabilities
 
Best Practices for Design Hardware APIs
Best Practices for Design Hardware APIsBest Practices for Design Hardware APIs
Best Practices for Design Hardware APIs
 
CSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
CSOM (Client Side Object Model). Explained @ SharePoint Saturday HoustonCSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
CSOM (Client Side Object Model). Explained @ SharePoint Saturday Houston
 
Apache Sever Technology By Greg Williams
Apache Sever Technology By Greg WilliamsApache Sever Technology By Greg Williams
Apache Sever Technology By Greg Williams
 
IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012IPTC Semantic Web Working Group Summer 2012
IPTC Semantic Web Working Group Summer 2012
 
Guide to open source
Guide to open source Guide to open source
Guide to open source
 
No need to leave Connections. Bring your Domino applications into the Activit...
No need to leave Connections. Bring your Domino applications into the Activit...No need to leave Connections. Bring your Domino applications into the Activit...
No need to leave Connections. Bring your Domino applications into the Activit...
 
Sharepoint
SharepointSharepoint
Sharepoint
 
Introduction to Lectures in Apple iClub at DA-IICT
Introduction to Lectures in Apple iClub  at DA-IICTIntroduction to Lectures in Apple iClub  at DA-IICT
Introduction to Lectures in Apple iClub at DA-IICT
 

Recently uploaded

怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
rtunex8r
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
thezot
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
Tarandeep Singh
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
Donato Onofri
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
3a0sd7z3
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
3a0sd7z3
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
dtagbe
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
GNAMBIKARAO
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
Infosec train
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
APNIC
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
APNIC
 

Recently uploaded (11)

怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
 

Dev Con 2014

  • 1. Crawling and Data Extraction with Apache Nutch Yewint Ko yewintko@bindez.com
  • 2. Agenda • Crawlers • Use cases • Data extraction & content scraping • Apache Nutch • Nutch lifecycle • Nutch plugins • Scaling Nutch • Opportunities
  • 3. I am … • Cofounder and architect at Bindez.com • More than 6 years in IT industry • 3 years in web archiving & IR field • Myanmar and SEA
  • 5. How they works… • Downloading HTML pages • Start from seed list • Follow page depth • Multi threading • Respect Robot.txt
  • 7. Use cases • Information retrieval ( search ) • Market analysis ( brand watching) • Social media analysis ( hate speech monitor) • Recommendation systems (ticket, movie etc) • NLP and ML
  • 8.
  • 9. Data Extraction the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
  • 10. Data Extraction Web is Raw • Raw text in different language • Raw text in different subject • Raw texts in different format Example : • Date time • Author • Tags • Images, videos, pdf, pptx, doc, odt
  • 11. Html parsing / Content Scraping • Title • Main content • Banners • Ads • Header / footer
  • 12. Html parsing / Content Scraping
  • 13. Html parsing / Content Scraping
  • 15. Apache Nutch • 2002/2003 – started by Doug Cutting & Mike Caffarella • Pure Java • 2005 – Map Reduce implementation in Nutch • 2006 - Hadoop support • 2006-07 – Tika integration • May 2010 – TLP project at Apache
  • 16. Releases & Community Apache Nutch 1.9 2014-08-16 Apache Nutch 1.8 2014-03-17 Apache Nutch 2.2.1 2013-07-02 Apache Nutch 1.7 2013-06-24 Apache Nutch 2.2 2013-06-05 Apache Nutch 1.6 2012-12-06 Apache Nutch 2.1 2012-10-05 Apache Nutch 1.5.1 2012-07-10 Apache Nutch 2.0 2012-07-07 Apache Nutch 1.5 2012-06-07 Apache Nutch 1.4 2011-04-11 Apache Nutch 1.3 2011-06-07 nutch-1.0 2009-03-23 nutch-0.9 2007-04-01 nutch-0.8.1 2006-09-24 nutch-0.8 2006-06-25 nutch-0.7.2 2006-03-31
  • 17. Releases & Community • Apache License 2.0 (Business friendly ) • Matured (10 years old ) • Tested on very large scale cluster ( Hadoop) • Active committers • New contributions and bugs report • Tons of mailing list subscribers
  • 18. Nutch Lifecycle • Inject URLs - seed list, initial linkdb is empty • Generate - prepare for fetch : create link segs • Fetch - download raw htmls • Parse - parse contents , discover outlinks • Update - crawldb , linkdb
  • 26. Plugins • Extensibility • Flexibility • Maintainability
  • 27. Plugins • IndexWriter – indexing integration • IndexingFilter – add additional index field • Parser – based parser • HtmlParseFilter – additional parser chains • Protocol -- ftp, http, etc • URLFilter -- limit the URLs • URLNormalizer -- convert URLs to normal • ScoringFilter – page score • SegmentMergeFilter – Merge segments
  • 29. Opportunities • Myanmar web super raw (Rich news media contents) • Myanmar web needs analytic solutions
  • 30. Resources Website : http://nutch.apache.org/ WiKi: http://wiki.apache.org/general/ Plugins : https://wiki.apache.org/nutch/PluginCentral Browse: http://svn.apache.org/viewvc/nutch/ SVN : https://svn.apache.org/repos/asf/nutch/
  • 32. Follow me on … twitter.com/yewintko linkedin.com/yewintko Email me… yewintko@bindez.com

Editor's Notes

  1. Explain indexing Interesting plugins
  2. Explain indexing Interesting plugins
  3. Explain indexing Interesting plugins