RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Jazz WangJazz Wang
Yao-Tsung WangYao-Tsung Wang
jazz@nchc.org.twjazz@nchc.org.tw
Jazz WangJazz Wang
Build Your PersonalBuild Your Personal
Search Engine using CrawlzillaSearch Engine using Crawlzilla
Build Your PersonalBuild Your Personal
Search Engine using CrawlzillaSearch Engine using Crawlzilla

2/50
WHO AM I ? JAZZWHO AM I ? JAZZ ？？WHO AM I ? JAZZWHO AM I ? JAZZ ？？
• Speaker ：
– Jazz Yao-Tsung Wang / ECE Master
– Associate Researcher , NCHC, NARL, Taiwan
– Co-Founder of Hadoop.TW
– jazz@nchc.org.tw
• My slides are available on the website
– http://trac.nchc.org.tw/cloud , http://www.slideshare.net/jazzwang
FLOSS UserFLOSS User
Debian/Ubutnu
Access Grid
Motion/VLC
Red5
Debian Router
DRBL/Clonezilla
Hadoop
FLOSS EvalistFLOSS Evalist
DRBL/Clonezilla
Partclone/Tuxboot
Hadoop Ecosystem
FLOSS DeveloperFLOSS Developer
DRBL/Clonezilla
Hadoop Ecosystem

4/50
Data Explosion!!Data Explosion!!
Source ： The Expanding Digital Universe,
A Forecast of Worldwide Information Growth Through 2010,
March 2007, An IDC White Paper - sponsored by EMC
http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

5/50
Source ： Extracting Value from Chaos,
June 2011, An IDC White Paper - sponsored by EMC
http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf
According to IDC reports:
2006 161 EB
2007 281 EB
2008 487 EB
2009 800 EB (0.8 ZB)
2010 988 EB (predict)
2010 1200 EB (1.2 ZB)
2011 1773 EB (predict)
2011 1800 EB (1.8 ZB)
Data expanded 1.6x each year !!Data expanded 1.6x each year !!

6/50
Commonly used software tools can't
capture, manage and process.
'Big Data' = few dozen TeraBytes to PetaBytes in single data set.
What is Big Data?!What is Big Data?!
Multiple files,
totally 20TB
Single Database,
totally 20TB
Single file,
totally 20TB
Source ： http://en.wikipedia.org/wiki/Big_data

7/50
Gartner Big Data Model ?Gartner Big Data Model ?
Challenge of 'Big Data' is to manage Volum, Variety and Velocity.Challenge of 'Big Data' is to manage Volum, Variety and Velocity.
Volume
(amount of data)
Velocity
(speed of data in/out)
Variety
(data types, sources)
Batch Job
Realtime
TB
EB
Unstructured
Semi-structured
Structured
PB
Source ：
[1] Laney, Douglas. "3D Data Management: Controlling
Data Volume, Velocity and Variety" (6 February 2001)
[2] Gartner Says Solving 'Big Data' Challenge Involves
More Than Just Managing Volumes of Data, June 2011

8/50
12D of Information Management?12D of Information Management?
Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme
Information Management, 7 April 2011, http://www.gartner.com/id=1622715
Big Data
is just the
beginning
Of Extrem
Information
Management

9/50
Possible Applications of Big Data?Possible Applications of Big Data?
Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
Source: http://lib.stanford.edu/files/see_pasig_dic.pdf
Top 1 : Human Genomics – 7000 PB / Year
Top 2 : Digital Photos – 1000 PB+/ Year
Top 3 : E-mail (no Spam) – 300 PB+ / Year

10/50
HOW to
deal with
Big Data?

11/50
The SMAQ stack for big dataThe SMAQ stack for big data
Source ： The SMAQ stack for big data ， Edd Dumbill ， 22 September 2010 ，　　　
　　　　　 http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

12/50

13/50

14/50
Three Core Technologies of Google ....Three Core Technologies of Google ....
• Google shared their design of web-search engine
– SOSP 2003 :
– “The Google File System”
– http://labs.google.com/papers/gfs.html
– OSDI 2004 :
– “MapReduce : Simplifed Data Processing on Large Cluster”
– http://labs.google.com/papers/mapreduce.html
– OSDI 2006 :
– “Bigtable: A Distributed Storage System for Structured Data”
– http://labs.google.com/papers/bigtable-osdi06.pdf

15/50
Open Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core Technologies
Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)
Sector Distributed File SystemSector Distributed File System
Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)
Sector Distributed File SystemSector Distributed File System
Hadoop MapReduce APIHadoop MapReduce API
Sphere MapReduce API, ...Sphere MapReduce API, ...
Hadoop MapReduce APIHadoop MapReduce API
Sphere MapReduce API, ...Sphere MapReduce API, ...
HBase,HBase, HypertableHypertable
Cassandra, ....Cassandra, ....
HBase,HBase, HypertableHypertable
Cassandra, ....Cassandra, ....
S = StorageS = Storage
Google File SystemGoogle File System
To store petabytes of dataTo store petabytes of data
S = StorageS = Storage
Google File SystemGoogle File System
To store petabytes of dataTo store petabytes of data
MMapapRReduceeduce
To parallel process dataTo parallel process data
MMapapRReduceeduce
To parallel process dataTo parallel process data
Q = QueryQ = Query
BigTableBigTable
A huge key-value datastoreA huge key-value datastore
Q = QueryQ = Query
BigTableBigTable
A huge key-value datastoreA huge key-value datastore
Google's Stack Open Source Projects

16/50
HadoopHadoopHadoopHadoop
• http://hadoop.apache.org
• Apache Top Level Project
• Major sponsor is Yahoo!
• Developed by Doug Cutting,
Reference from Google Filesystem
• Written by Java, it provides HDFS and
MapReduce API
• Used in Yahoo since year 2006
• It had been deploy to 4000+ nodes in Yahoo
• Design to process dataset in Petabyte
• Facebook 、 Last.fm 、 Joost are also powered by
Hadoop

17/50
WHO will
needs Hadoop?
Let's take
'Search Engine' as
an example

18/50
Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!
FileFile
MailMail PidgenPidgen
DatabaseDatabase
Web PagesWeb Pages

19/50
To speed up search, We need “Index”To speed up search, We need “Index”
Keyword
Page Number

20/50
History of Hadoop …History of Hadoop … 2001~20052001~2005
• Lucene
– http://lucene.apache.org/
– a high-performance, full-featured text search
engine library written entirely in Java.
– Lucene create an inverse index of every word in
different documents. It enhance performance of
text searching.

21/50
History of Hadoop …History of Hadoop … 2005~20062005~2006
• Nutch
– http://nutch.apache.org/
– Nutch is open source web-search software.
– It builds on Lucene and Solr, adding web-
specifics, such as a crawler, a link-graph
database, parsers for HTML and other
document formats, etc.

22/50
History of Hadoop …History of Hadoop … 2006 ~ Now2006 ~ Now
• Nutch encounter performance issue.
• Reference from Google's papers.
• Added DFS & MapReduce implement to Nutch
• According to user feedback on the mail list of Nutch ....
• Hadoop became separated project since Nutch 0.8
• Nutch DFS → Hadoop Distributed File System (HDFS)
• Yahoo hire Dong Cutting to build a team of web search
engine at year 2006.
– Only 14 team members (engineers, clusters, users, etc.)
• Doung Cutting joined Cloudera at year 2009.

23/50
Do you like to write notes?Do you like to write notes?

24/50
Tools that I used to write notesTools that I used to write notes
Oddmuse WikiOddmuse Wiki
http://www.oddmuse.org/
2005~2008

25/50
PmWikiPmWiki
http://www.pmwiki.org/

26/50
ScrapBookScrapBook
https://addons.mozilla.org/zh-TW/firefox/addon/scrapbook/
2005~NOW

27/50
Trac = Wiki + Version Control (SVN or GIT)Trac = Wiki + Version Control (SVN or GIT)
http://trac.edgewall.org/
2006~NOW

28/50
ReadItLaterReadItLater
http://readitlaterlist.com/
2010~NOW

29/50
It's
painful to
search all
my notes!

30/50
How do I search all my notesHow do I search all my notes
from different websites?from different websites?

31/50
Feature of CrawlzillaFeature of Crawlzilla
●
Cluster-based
●
Chinese Word
Segmentation
●
Multiple Users
with Multiple
Index Pools
●
Re-Crawl
●
Schedule /
Crontab
●
Display Index
Database (Top 50
sites, keywords)
●
Support UTF-8
Chinese Search
Results
●
Web-based User
Interface

32/50
Hadoop
Tomcat
Crawlzilla System Management
Lucene
Nutch
JSP + Servlet +
JavaBean
PC1 PC2 PC3
Web UI ( Crawlzilla Website + Search Engine)
System Architecture of CrawlzillaSystem Architecture of Crawlzilla

33/50
Crawlzilla Web UI
Users

Index DB

34/50
Comparison with other projectsComparison with other projects
Spidr Larbin Jcrawl Nutch Crawlzilla
Install
Rube
Package
Install
Gmake
Compiler
and Install
Java
Compiler
and Install
Deploy
Configure
Files
Provide Auto
Installation
Crawl website
pages
O O O O O
Parser Content X X X O O
Cluster Computing X X X O O
Interface Command Command Command Command Web-UI
Support Chinese
Segmentation
X X X X O

35/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 1 ： Registration
http://demo.crawlzilla.info (1)
(2)
(3)

36/50
▲ Step 2 ： Acceptance Notification
Wait for notification from Administrator ！
(1)
(2)

37/50
▲ Step 3 ： Login with your account
Login http://demo.crawlzilla.info (1)
(2)

38/50
▲ Step 4 ： Setup Name, URLs and depth
Setup new Search Pool
(1)
(2)
(3)
(4)
(5)

39/50
▲ Step 5 ： wait for crawlzilla to generate Index DB
Wait for Crawlzilla to generate Index for you
(1)

40/50
▲ You could know how long to generate Index DB
Index Pools Management
(1)
(2)

41/50
▲ Manually Re-Crawl or Delete Index Database
You can manually 'Re-Crawl' and generate new Index DB
(1) (2)

42/50
Crawlzilla 1.0Crawlzilla 1.0 多人版雲端服務多人版雲端服務 (8)(8)
▲ 您可以在索引庫管理看到目前爬取已使用的時間
可以於「系統排程」處進行排程重新爬取（ schedule ）
(1)
(2)
(3)
(4)

43/50
▲ Display the basic information of Index Database
Display the Index Database Basic Informations
(1)
(2)

44/50
▲ Add these syntax to your website
HTML Syntax for Your Personal Search Engine
(1) (2)

45/50
Total Web pages had been indexed,
And Top 50 URLs
(1)
(2)

46/50
It also shows indexed Document Types
and Top 50 Keywords
(1) (2)

47/50
Crawlzilla Web UI
User
Index DB
ftp:// file:// Skydrive

48/50
Start from Here!Start from Here!
●
Crawlzilla Demo Cloud Service
● http://demo.crawlzilla.info
●
Crawlzilla @ Google Code Project Hosting
● http://code.google.com/p/crawlzilla/
●
Crawlzilla @ Source Forge (Toturial in English)
● http://sourceforge.net/p/crawlzilla/home/
●
Crawlzilla User Group @ Google
● http://groups.google.com/group/crawlzilla-user
●
NCHC Cloud Computing Research Group
● http://trac.nchc.org.tw/cloud

49/50
Authors of CrawlzillaAuthors of Crawlzilla
Waue Chen (Left)
waue0920@gmail.com
Rock Kuo (Center)
goldjay1231@gmail.com
Shunfa Yang (Right)
shunfa@nchc.org.tw
shunfa@gmail.com

50/50
Questions?Questions?
Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud
Questions?Questions?
Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud
Jazz WangJazz Wang
Jazz WangJazz Wang

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Similar to RMLL 2013 : Build Your Personal Search Engine using Crawlzilla (20)

Recently uploaded

Recently uploaded (20)

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla