SlideShare a Scribd company logo
1 of 50
Download to read offline
Jazz WangJazz Wang
Yao-Tsung WangYao-Tsung Wang
jazz@nchc.org.twjazz@nchc.org.tw
Jazz WangJazz Wang
Yao-Tsung WangYao-Tsung Wang
jazz@nchc.org.twjazz@nchc.org.tw
Build Your PersonalBuild Your Personal
Search Engine using CrawlzillaSearch Engine using Crawlzilla
Build Your PersonalBuild Your Personal
Search Engine using CrawlzillaSearch Engine using Crawlzilla
2/50
WHO AM I ? JAZZWHO AM I ? JAZZ ??WHO AM I ? JAZZWHO AM I ? JAZZ ??
• Speaker :
– Jazz Yao-Tsung Wang / ECE Master
– Associate Researcher , NCHC, NARL, Taiwan
– Co-Founder of Hadoop.TW
– jazz@nchc.org.tw
• My slides are available on the website
– http://trac.nchc.org.tw/cloud , http://www.slideshare.net/jazzwang
FLOSS UserFLOSS User
Debian/Ubutnu
Access Grid
Motion/VLC
Red5
Debian Router
DRBL/Clonezilla
Hadoop
FLOSS EvalistFLOSS Evalist
DRBL/Clonezilla
Partclone/Tuxboot
Hadoop Ecosystem
FLOSS DeveloperFLOSS Developer
DRBL/Clonezilla
Hadoop Ecosystem
3/50
WHAT is
Big Data?
4/50
Data Explosion!!Data Explosion!!
Source : The Expanding Digital Universe,
A Forecast of Worldwide Information Growth Through 2010,
March 2007, An IDC White Paper - sponsored by EMC
http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
5/50
Source : Extracting Value from Chaos,
June 2011, An IDC White Paper - sponsored by EMC
http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf
According to IDC reports:
2006 161 EB
2007 281 EB
2008 487 EB
2009 800 EB (0.8 ZB)
2010 988 EB (predict)
2010 1200 EB (1.2 ZB)
2011 1773 EB (predict)
2011 1800 EB (1.8 ZB)
Data expanded 1.6x each year !!Data expanded 1.6x each year !!
6/50
Commonly used software tools can't
capture, manage and process.
'Big Data' = few dozen TeraBytes to PetaBytes in single data set.
What is Big Data?!What is Big Data?!
Multiple files,
totally 20TB
Single Database,
totally 20TB
Single file,
totally 20TB
Source : http://en.wikipedia.org/wiki/Big_data
7/50
Gartner Big Data Model ?Gartner Big Data Model ?
Challenge of 'Big Data' is to manage Volum, Variety and Velocity.Challenge of 'Big Data' is to manage Volum, Variety and Velocity.
Volume
(amount of data)
Velocity
(speed of data in/out)
Variety
(data types, sources)
Batch Job
Realtime
TB
EB
Unstructured
Semi-structured
Structured
PB
Source :
[1] Laney, Douglas. "3D Data Management: Controlling
Data Volume, Velocity and Variety" (6 February 2001)
[2] Gartner Says Solving 'Big Data' Challenge Involves
More Than Just Managing Volumes of Data, June 2011
8/50
12D of Information Management?12D of Information Management?
Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme
Information Management, 7 April 2011, http://www.gartner.com/id=1622715
Big Data
is just the
beginning
Of Extrem
Information
Management
9/50
Possible Applications of Big Data?Possible Applications of Big Data?
Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
Source: http://lib.stanford.edu/files/see_pasig_dic.pdf
Top 1 : Human Genomics – 7000 PB / Year
Top 2 : Digital Photos – 1000 PB+/ Year
Top 3 : E-mail (no Spam) – 300 PB+ / Year
10/50
HOW to
deal with
Big Data?
11/50
The SMAQ stack for big dataThe SMAQ stack for big data
Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,   
      http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
12/50
The SMAQ stack for big dataThe SMAQ stack for big data
Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,   
      http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
13/50
Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,   
      http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
The SMAQ stack for big dataThe SMAQ stack for big data
14/50
Three Core Technologies of Google ....Three Core Technologies of Google ....
• Google shared their design of web-search engine
– SOSP 2003 :
– “The Google File System”
– http://labs.google.com/papers/gfs.html
– OSDI 2004 :
– “MapReduce : Simplifed Data Processing on Large Cluster”
– http://labs.google.com/papers/mapreduce.html
– OSDI 2006 :
– “Bigtable: A Distributed Storage System for Structured Data”
– http://labs.google.com/papers/bigtable-osdi06.pdf
15/50
Open Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core Technologies
Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)
Sector Distributed File SystemSector Distributed File System
Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)
Sector Distributed File SystemSector Distributed File System
Hadoop MapReduce APIHadoop MapReduce API
Sphere MapReduce API, ...Sphere MapReduce API, ...
Hadoop MapReduce APIHadoop MapReduce API
Sphere MapReduce API, ...Sphere MapReduce API, ...
HBase,HBase, HypertableHypertable
Cassandra, ....Cassandra, ....
HBase,HBase, HypertableHypertable
Cassandra, ....Cassandra, ....
S = StorageS = Storage
Google File SystemGoogle File System
To store petabytes of dataTo store petabytes of data
S = StorageS = Storage
Google File SystemGoogle File System
To store petabytes of dataTo store petabytes of data
MMapapRReduceeduce
To parallel process dataTo parallel process data
MMapapRReduceeduce
To parallel process dataTo parallel process data
Q = QueryQ = Query
BigTableBigTable
A huge key-value datastoreA huge key-value datastore
Q = QueryQ = Query
BigTableBigTable
A huge key-value datastoreA huge key-value datastore
Google's Stack Open Source Projects
16/50
HadoopHadoopHadoopHadoop
• http://hadoop.apache.org
• Apache Top Level Project
• Major sponsor is Yahoo!
• Developed by Doug Cutting,
Reference from Google Filesystem
• Written by Java, it provides HDFS and
MapReduce API
• Used in Yahoo since year 2006
• It had been deploy to 4000+ nodes in Yahoo
• Design to process dataset in Petabyte
• Facebook 、 Last.fm 、 Joost are also powered by
Hadoop
17/50
WHO will
needs Hadoop?
Let's take
'Search Engine' as
an example
18/50
Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!
FileFile
MailMail PidgenPidgen
DatabaseDatabase
Web PagesWeb Pages
19/50
To speed up search, We need “Index”To speed up search, We need “Index”
Keyword
Page Number
20/50
History of Hadoop …History of Hadoop … 2001~20052001~2005
• Lucene
– http://lucene.apache.org/
– a high-performance, full-featured text search
engine library written entirely in Java.
– Lucene create an inverse index of every word in
different documents. It enhance performance of
text searching.
21/50
History of Hadoop …History of Hadoop … 2005~20062005~2006
• Nutch
– http://nutch.apache.org/
– Nutch is open source web-search software.
– It builds on Lucene and Solr, adding web-
specifics, such as a crawler, a link-graph
database, parsers for HTML and other
document formats, etc.
22/50
History of Hadoop …History of Hadoop … 2006 ~ Now2006 ~ Now
• Nutch encounter performance issue.
• Reference from Google's papers.
• Added DFS & MapReduce implement to Nutch
• According to user feedback on the mail list of Nutch ....
• Hadoop became separated project since Nutch 0.8
• Nutch DFS → Hadoop Distributed File System (HDFS)
• Yahoo hire Dong Cutting to build a team of web search
engine at year 2006.
– Only 14 team members (engineers, clusters, users, etc.)
• Doung Cutting joined Cloudera at year 2009.
23/50
Do you like to write notes?Do you like to write notes?
24/50
Tools that I used to write notesTools that I used to write notes
Oddmuse WikiOddmuse Wiki
http://www.oddmuse.org/
2005~2008
25/50
Tools that I used to write notesTools that I used to write notes
PmWikiPmWiki
http://www.pmwiki.org/
26/50
Tools that I used to write notesTools that I used to write notes
ScrapBookScrapBook
https://addons.mozilla.org/zh-TW/firefox/addon/scrapbook/
2005~NOW
27/50
Tools that I used to write notesTools that I used to write notes
Trac = Wiki + Version Control (SVN or GIT)Trac = Wiki + Version Control (SVN or GIT)
http://trac.edgewall.org/
2006~NOW
28/50
Tools that I used to write notesTools that I used to write notes
ReadItLaterReadItLater
http://readitlaterlist.com/
2010~NOW
29/50
It's
painful to
search all
my notes!
30/50
How do I search all my notesHow do I search all my notes
from different websites?from different websites?
31/50
Feature of CrawlzillaFeature of Crawlzilla
●
Cluster-based
●
Chinese Word
Segmentation
●
Multiple Users
with Multiple
Index Pools
●
Re-Crawl
●
Schedule /
Crontab
●
Display Index
Database (Top 50
sites, keywords)
●
Support UTF-8
Chinese Search
Results
●
Web-based User
Interface
32/50
Hadoop
Tomcat
Crawlzilla System Management
Lucene
Nutch
JSP + Servlet +
JavaBean
PC1 PC2 PC3
Web UI ( Crawlzilla Website + Search Engine)
System Architecture of CrawlzillaSystem Architecture of Crawlzilla
33/50
Crawlzilla Web UI
Users

Index DB
34/50
Comparison with other projectsComparison with other projects
Spidr Larbin Jcrawl Nutch Crawlzilla
Install
Rube
Package
Install
Gmake
Compiler
and Install
Java
Compiler
and Install
Deploy
Configure
Files
Provide Auto
Installation
Crawl website
pages
O O O O O
Parser Content X X X O O
Cluster Computing X X X O O
Interface Command Command Command Command Web-UI
Support Chinese
Segmentation
X X X X O
35/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 1 : Registration
http://demo.crawlzilla.info (1)
(2)
(3)
36/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 2 : Acceptance Notification
Wait for notification from Administrator !
(1)
(2)
37/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 3 : Login with your account
Login http://demo.crawlzilla.info (1)
(2)
38/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 4 : Setup Name, URLs and depth
Setup new Search Pool
(1)
(2)
(3)
(4)
(5)
39/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Step 5 : wait for crawlzilla to generate Index DB
Wait for Crawlzilla to generate Index for you
(1)
40/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ You could know how long to generate Index DB
Index Pools Management
(1)
(2)
41/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Manually Re-Crawl or Delete Index Database
You can manually 'Re-Crawl' and generate new Index DB
(1) (2)
42/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
Crawlzilla 1.0Crawlzilla 1.0 多人版雲端服務多人版雲端服務 (8)(8)
▲ 您可以在索引庫管理看到目前爬取已使用的時間
可以於「系統排程」處進行排程重新爬取( schedule )
(1)
(2)
(3)
(4)
43/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Display the basic information of Index Database
Display the Index Database Basic Informations
(1)
(2)
44/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
▲ Add these syntax to your website
HTML Syntax for Your Personal Search Engine
(1) (2)
45/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
Total Web pages had been indexed,
And Top 50 URLs
(1)
(2)
46/50
Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0
It also shows indexed Document Types
and Top 50 Keywords
(1) (2)
47/50
Crawlzilla Web UI
User
Index DB
ftp:// file:// Skydrive
48/50
Start from Here!Start from Here!
●
Crawlzilla Demo Cloud Service
● http://demo.crawlzilla.info
●
Crawlzilla @ Google Code Project Hosting
● http://code.google.com/p/crawlzilla/
●
Crawlzilla @ Source Forge (Toturial in English)
● http://sourceforge.net/p/crawlzilla/home/
●
Crawlzilla User Group @ Google
● http://groups.google.com/group/crawlzilla-user
●
NCHC Cloud Computing Research Group
● http://trac.nchc.org.tw/cloud
49/50
Authors of CrawlzillaAuthors of Crawlzilla
Waue Chen (Left)
waue0920@gmail.com
Rock Kuo (Center)
goldjay1231@gmail.com
Shunfa Yang (Right)
shunfa@nchc.org.tw
shunfa@gmail.com
50/50
Questions?Questions?
Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud
Questions?Questions?
Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud
Jazz WangJazz Wang
Yao-Tsung WangYao-Tsung Wang
jazz@nchc.org.twjazz@nchc.org.tw
Jazz WangJazz Wang
Yao-Tsung WangYao-Tsung Wang
jazz@nchc.org.twjazz@nchc.org.tw

More Related Content

What's hot

SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
Korea Sdec
 
MySQL & NoSQL from a PHP Perspective
MySQL & NoSQL from a PHP PerspectiveMySQL & NoSQL from a PHP Perspective
MySQL & NoSQL from a PHP Perspective
Tim Juravich
 

What's hot (19)

ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
RDFa
RDFaRDFa
RDFa
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Dizajn REST API-ja
Dizajn REST API-jaDizajn REST API-ja
Dizajn REST API-ja
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
3.15.17 DSpace: How to Contribute Webinar Slides
3.15.17 DSpace: How to Contribute Webinar Slides3.15.17 DSpace: How to Contribute Webinar Slides
3.15.17 DSpace: How to Contribute Webinar Slides
 
Search all the things
Search all the thingsSearch all the things
Search all the things
 
The Cultural Linked Data Backbone
The Cultural Linked Data BackboneThe Cultural Linked Data Backbone
The Cultural Linked Data Backbone
 
OCLC Linked Data Progress
OCLC Linked Data ProgressOCLC Linked Data Progress
OCLC Linked Data Progress
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013
 
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features  @ Geecon Krakow 2016Cassandra 3 new features  @ Geecon Krakow 2016
Cassandra 3 new features @ Geecon Krakow 2016
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
Where is my data (in the cloud) tamir dresher
Where is my data (in the cloud)   tamir dresherWhere is my data (in the cloud)   tamir dresher
Where is my data (in the cloud) tamir dresher
 
Open Source Community Metrics for FOSDEM
Open Source Community Metrics for FOSDEMOpen Source Community Metrics for FOSDEM
Open Source Community Metrics for FOSDEM
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
MySQL & NoSQL from a PHP Perspective
MySQL & NoSQL from a PHP PerspectiveMySQL & NoSQL from a PHP Perspective
MySQL & NoSQL from a PHP Perspective
 

Viewers also liked

Cloud Computing for Bioinformatics
Cloud Computing for BioinformaticsCloud Computing for Bioinformatics
Cloud Computing for Bioinformatics
Jazz Yao-Tsung Wang
 
Big Data : The Missing Puzzle of Mobile Computing
Big Data : The Missing Puzzle of Mobile ComputingBig Data : The Missing Puzzle of Mobile Computing
Big Data : The Missing Puzzle of Mobile Computing
Jazz Yao-Tsung Wang
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
Jazz Yao-Tsung Wang
 
The Trend Of Cloud Computing And How Should Public Sectors Adjust
The Trend Of Cloud Computing And How Should Public Sectors AdjustThe Trend Of Cloud Computing And How Should Public Sectors Adjust
The Trend Of Cloud Computing And How Should Public Sectors Adjust
Jazz Yao-Tsung Wang
 
Big Data Taiwan : Supply Chain and Communities
Big Data Taiwan : Supply Chain and CommunitiesBig Data Taiwan : Supply Chain and Communities
Big Data Taiwan : Supply Chain and Communities
Jazz Yao-Tsung Wang
 
Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
Jazz Yao-Tsung Wang
 
Introduction to Diskless Remote Boot in Linux
Introduction to Diskless Remote Boot in LinuxIntroduction to Diskless Remote Boot in Linux
Introduction to Diskless Remote Boot in Linux
Jazz Yao-Tsung Wang
 
Unattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseedUnattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseed
Jazz Yao-Tsung Wang
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
Jazz Yao-Tsung Wang
 
淺談台灣巨量資料產業供應鏈串聯現況
淺談台灣巨量資料產業供應鏈串聯現況淺談台灣巨量資料產業供應鏈串聯現況
淺談台灣巨量資料產業供應鏈串聯現況
Jazz Yao-Tsung Wang
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening
Jazz Yao-Tsung Wang
 
Big data taiwan_supply_chain_and_communities_20130912
Big data taiwan_supply_chain_and_communities_20130912Big data taiwan_supply_chain_and_communities_20130912
Big data taiwan_supply_chain_and_communities_20130912
Jazz Yao-Tsung Wang
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
Jazz Yao-Tsung Wang
 

Viewers also liked (20)

Cloud Computing for Bioinformatics
Cloud Computing for BioinformaticsCloud Computing for Bioinformatics
Cloud Computing for Bioinformatics
 
Big Data : The Missing Puzzle of Mobile Computing
Big Data : The Missing Puzzle of Mobile ComputingBig Data : The Missing Puzzle of Mobile Computing
Big Data : The Missing Puzzle of Mobile Computing
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
The Trend Of Cloud Computing And How Should Public Sectors Adjust
The Trend Of Cloud Computing And How Should Public Sectors AdjustThe Trend Of Cloud Computing And How Should Public Sectors Adjust
The Trend Of Cloud Computing And How Should Public Sectors Adjust
 
Build Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaBuild Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and Haduzilla
 
Big Data Taiwan : Supply Chain and Communities
Big Data Taiwan : Supply Chain and CommunitiesBig Data Taiwan : Supply Chain and Communities
Big Data Taiwan : Supply Chain and Communities
 
Hadoop.TW : Now and Future
Hadoop.TW : Now and FutureHadoop.TW : Now and Future
Hadoop.TW : Now and Future
 
HPC For Bioinformatics
HPC For BioinformaticsHPC For Bioinformatics
HPC For Bioinformatics
 
Hadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TWHadoop Deployment Model @ OSDC.TW
Hadoop Deployment Model @ OSDC.TW
 
Introduction to Diskless Remote Boot in Linux
Introduction to Diskless Remote Boot in LinuxIntroduction to Diskless Remote Boot in Linux
Introduction to Diskless Remote Boot in Linux
 
Unattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseedUnattended Apache BigTop installer CD using preseed
Unattended Apache BigTop installer CD using preseed
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
 
Enterprise Data Lake in Action
Enterprise Data Lake in ActionEnterprise Data Lake in Action
Enterprise Data Lake in Action
 
淺談台灣巨量資料產業供應鏈串聯現況
淺談台灣巨量資料產業供應鏈串聯現況淺談台灣巨量資料產業供應鏈串聯現況
淺談台灣巨量資料產業供應鏈串聯現況
 
13 11-08
13 11-0813 11-08
13 11-08
 
淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況
 
Big Data Communities in Taiwan
Big Data Communities in TaiwanBig Data Communities in Taiwan
Big Data Communities in Taiwan
 
13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening13 09-28 hadoop-in_taiwan_2013_opening
13 09-28 hadoop-in_taiwan_2013_opening
 
Big data taiwan_supply_chain_and_communities_20130912
Big data taiwan_supply_chain_and_communities_20130912Big data taiwan_supply_chain_and_communities_20130912
Big data taiwan_supply_chain_and_communities_20130912
 
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
2014-10-17 探析台灣巨量資料產業供應鏈串聯現況
 

Similar to RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18
Pierre Joye
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
Koji Kawamura
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
lamont_lockwood
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013
Aaron Blythe
 
Developing With Openbravo Rl Eppt
Developing With Openbravo Rl EpptDeveloping With Openbravo Rl Eppt
Developing With Openbravo Rl Eppt
vobree
 

Similar to RMLL 2013 : Build Your Personal Search Engine using Crawlzilla (20)

Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18
 
Niatalk24jan10
Niatalk24jan10Niatalk24jan10
Niatalk24jan10
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 
Guidance, Code and Education: ScalaCenter and the Scala Community, Heather Mi...
Guidance, Code and Education: ScalaCenter and the Scala Community, Heather Mi...Guidance, Code and Education: ScalaCenter and the Scala Community, Heather Mi...
Guidance, Code and Education: ScalaCenter and the Scala Community, Heather Mi...
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Developing With Openbravo Rl Eppt
Developing With Openbravo Rl EpptDeveloping With Openbravo Rl Eppt
Developing With Openbravo Rl Eppt
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
DataHub
DataHubDataHub
DataHub
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

  • 1. Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Build Your PersonalBuild Your Personal Search Engine using CrawlzillaSearch Engine using Crawlzilla Build Your PersonalBuild Your Personal Search Engine using CrawlzillaSearch Engine using Crawlzilla
  • 2. 2/50 WHO AM I ? JAZZWHO AM I ? JAZZ ??WHO AM I ? JAZZWHO AM I ? JAZZ ?? • Speaker : – Jazz Yao-Tsung Wang / ECE Master – Associate Researcher , NCHC, NARL, Taiwan – Co-Founder of Hadoop.TW – jazz@nchc.org.tw • My slides are available on the website – http://trac.nchc.org.tw/cloud , http://www.slideshare.net/jazzwang FLOSS UserFLOSS User Debian/Ubutnu Access Grid Motion/VLC Red5 Debian Router DRBL/Clonezilla Hadoop FLOSS EvalistFLOSS Evalist DRBL/Clonezilla Partclone/Tuxboot Hadoop Ecosystem FLOSS DeveloperFLOSS Developer DRBL/Clonezilla Hadoop Ecosystem
  • 4. 4/50 Data Explosion!!Data Explosion!! Source : The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, March 2007, An IDC White Paper - sponsored by EMC http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
  • 5. 5/50 Source : Extracting Value from Chaos, June 2011, An IDC White Paper - sponsored by EMC http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf According to IDC reports: 2006 161 EB 2007 281 EB 2008 487 EB 2009 800 EB (0.8 ZB) 2010 988 EB (predict) 2010 1200 EB (1.2 ZB) 2011 1773 EB (predict) 2011 1800 EB (1.8 ZB) Data expanded 1.6x each year !!Data expanded 1.6x each year !!
  • 6. 6/50 Commonly used software tools can't capture, manage and process. 'Big Data' = few dozen TeraBytes to PetaBytes in single data set. What is Big Data?!What is Big Data?! Multiple files, totally 20TB Single Database, totally 20TB Single file, totally 20TB Source : http://en.wikipedia.org/wiki/Big_data
  • 7. 7/50 Gartner Big Data Model ?Gartner Big Data Model ? Challenge of 'Big Data' is to manage Volum, Variety and Velocity.Challenge of 'Big Data' is to manage Volum, Variety and Velocity. Volume (amount of data) Velocity (speed of data in/out) Variety (data types, sources) Batch Job Realtime TB EB Unstructured Semi-structured Structured PB Source : [1] Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety" (6 February 2001) [2] Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data, June 2011
  • 8. 8/50 12D of Information Management?12D of Information Management? Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme Information Management, 7 April 2011, http://www.gartner.com/id=1622715 Big Data is just the beginning Of Extrem Information Management
  • 9. 9/50 Possible Applications of Big Data?Possible Applications of Big Data? Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf Source: http://lib.stanford.edu/files/see_pasig_dic.pdf Top 1 : Human Genomics – 7000 PB / Year Top 2 : Digital Photos – 1000 PB+/ Year Top 3 : E-mail (no Spam) – 300 PB+ / Year
  • 11. 11/50 The SMAQ stack for big dataThe SMAQ stack for big data Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
  • 12. 12/50 The SMAQ stack for big dataThe SMAQ stack for big data Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
  • 13. 13/50 Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html The SMAQ stack for big dataThe SMAQ stack for big data
  • 14. 14/50 Three Core Technologies of Google ....Three Core Technologies of Google .... • Google shared their design of web-search engine – SOSP 2003 : – “The Google File System” – http://labs.google.com/papers/gfs.html – OSDI 2004 : – “MapReduce : Simplifed Data Processing on Large Cluster” – http://labs.google.com/papers/mapreduce.html – OSDI 2006 : – “Bigtable: A Distributed Storage System for Structured Data” – http://labs.google.com/papers/bigtable-osdi06.pdf
  • 15. 15/50 Open Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core Technologies Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS) Sector Distributed File SystemSector Distributed File System Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS) Sector Distributed File SystemSector Distributed File System Hadoop MapReduce APIHadoop MapReduce API Sphere MapReduce API, ...Sphere MapReduce API, ... Hadoop MapReduce APIHadoop MapReduce API Sphere MapReduce API, ...Sphere MapReduce API, ... HBase,HBase, HypertableHypertable Cassandra, ....Cassandra, .... HBase,HBase, HypertableHypertable Cassandra, ....Cassandra, .... S = StorageS = Storage Google File SystemGoogle File System To store petabytes of dataTo store petabytes of data S = StorageS = Storage Google File SystemGoogle File System To store petabytes of dataTo store petabytes of data MMapapRReduceeduce To parallel process dataTo parallel process data MMapapRReduceeduce To parallel process dataTo parallel process data Q = QueryQ = Query BigTableBigTable A huge key-value datastoreA huge key-value datastore Q = QueryQ = Query BigTableBigTable A huge key-value datastoreA huge key-value datastore Google's Stack Open Source Projects
  • 16. 16/50 HadoopHadoopHadoopHadoop • http://hadoop.apache.org • Apache Top Level Project • Major sponsor is Yahoo! • Developed by Doug Cutting, Reference from Google Filesystem • Written by Java, it provides HDFS and MapReduce API • Used in Yahoo since year 2006 • It had been deploy to 4000+ nodes in Yahoo • Design to process dataset in Petabyte • Facebook 、 Last.fm 、 Joost are also powered by Hadoop
  • 17. 17/50 WHO will needs Hadoop? Let's take 'Search Engine' as an example
  • 18. 18/50 Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !! FileFile MailMail PidgenPidgen DatabaseDatabase Web PagesWeb Pages
  • 19. 19/50 To speed up search, We need “Index”To speed up search, We need “Index” Keyword Page Number
  • 20. 20/50 History of Hadoop …History of Hadoop … 2001~20052001~2005 • Lucene – http://lucene.apache.org/ – a high-performance, full-featured text search engine library written entirely in Java. – Lucene create an inverse index of every word in different documents. It enhance performance of text searching.
  • 21. 21/50 History of Hadoop …History of Hadoop … 2005~20062005~2006 • Nutch – http://nutch.apache.org/ – Nutch is open source web-search software. – It builds on Lucene and Solr, adding web- specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
  • 22. 22/50 History of Hadoop …History of Hadoop … 2006 ~ Now2006 ~ Now • Nutch encounter performance issue. • Reference from Google's papers. • Added DFS & MapReduce implement to Nutch • According to user feedback on the mail list of Nutch .... • Hadoop became separated project since Nutch 0.8 • Nutch DFS → Hadoop Distributed File System (HDFS) • Yahoo hire Dong Cutting to build a team of web search engine at year 2006. – Only 14 team members (engineers, clusters, users, etc.) • Doung Cutting joined Cloudera at year 2009.
  • 23. 23/50 Do you like to write notes?Do you like to write notes?
  • 24. 24/50 Tools that I used to write notesTools that I used to write notes Oddmuse WikiOddmuse Wiki http://www.oddmuse.org/ 2005~2008
  • 25. 25/50 Tools that I used to write notesTools that I used to write notes PmWikiPmWiki http://www.pmwiki.org/
  • 26. 26/50 Tools that I used to write notesTools that I used to write notes ScrapBookScrapBook https://addons.mozilla.org/zh-TW/firefox/addon/scrapbook/ 2005~NOW
  • 27. 27/50 Tools that I used to write notesTools that I used to write notes Trac = Wiki + Version Control (SVN or GIT)Trac = Wiki + Version Control (SVN or GIT) http://trac.edgewall.org/ 2006~NOW
  • 28. 28/50 Tools that I used to write notesTools that I used to write notes ReadItLaterReadItLater http://readitlaterlist.com/ 2010~NOW
  • 30. 30/50 How do I search all my notesHow do I search all my notes from different websites?from different websites?
  • 31. 31/50 Feature of CrawlzillaFeature of Crawlzilla ● Cluster-based ● Chinese Word Segmentation ● Multiple Users with Multiple Index Pools ● Re-Crawl ● Schedule / Crontab ● Display Index Database (Top 50 sites, keywords) ● Support UTF-8 Chinese Search Results ● Web-based User Interface
  • 32. 32/50 Hadoop Tomcat Crawlzilla System Management Lucene Nutch JSP + Servlet + JavaBean PC1 PC2 PC3 Web UI ( Crawlzilla Website + Search Engine) System Architecture of CrawlzillaSystem Architecture of Crawlzilla
  • 34. 34/50 Comparison with other projectsComparison with other projects Spidr Larbin Jcrawl Nutch Crawlzilla Install Rube Package Install Gmake Compiler and Install Java Compiler and Install Deploy Configure Files Provide Auto Installation Crawl website pages O O O O O Parser Content X X X O O Cluster Computing X X X O O Interface Command Command Command Command Web-UI Support Chinese Segmentation X X X X O
  • 35. 35/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 1 : Registration http://demo.crawlzilla.info (1) (2) (3)
  • 36. 36/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 2 : Acceptance Notification Wait for notification from Administrator ! (1) (2)
  • 37. 37/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 3 : Login with your account Login http://demo.crawlzilla.info (1) (2)
  • 38. 38/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 4 : Setup Name, URLs and depth Setup new Search Pool (1) (2) (3) (4) (5)
  • 39. 39/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 5 : wait for crawlzilla to generate Index DB Wait for Crawlzilla to generate Index for you (1)
  • 40. 40/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ You could know how long to generate Index DB Index Pools Management (1) (2)
  • 41. 41/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Manually Re-Crawl or Delete Index Database You can manually 'Re-Crawl' and generate new Index DB (1) (2)
  • 42. 42/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 Crawlzilla 1.0Crawlzilla 1.0 多人版雲端服務多人版雲端服務 (8)(8) ▲ 您可以在索引庫管理看到目前爬取已使用的時間 可以於「系統排程」處進行排程重新爬取( schedule ) (1) (2) (3) (4)
  • 43. 43/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Display the basic information of Index Database Display the Index Database Basic Informations (1) (2)
  • 44. 44/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Add these syntax to your website HTML Syntax for Your Personal Search Engine (1) (2)
  • 45. 45/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 Total Web pages had been indexed, And Top 50 URLs (1) (2)
  • 46. 46/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 It also shows indexed Document Types and Top 50 Keywords (1) (2)
  • 47. 47/50 Crawlzilla Web UI User Index DB ftp:// file:// Skydrive
  • 48. 48/50 Start from Here!Start from Here! ● Crawlzilla Demo Cloud Service ● http://demo.crawlzilla.info ● Crawlzilla @ Google Code Project Hosting ● http://code.google.com/p/crawlzilla/ ● Crawlzilla @ Source Forge (Toturial in English) ● http://sourceforge.net/p/crawlzilla/home/ ● Crawlzilla User Group @ Google ● http://groups.google.com/group/crawlzilla-user ● NCHC Cloud Computing Research Group ● http://trac.nchc.org.tw/cloud
  • 49. 49/50 Authors of CrawlzillaAuthors of Crawlzilla Waue Chen (Left) waue0920@gmail.com Rock Kuo (Center) goldjay1231@gmail.com Shunfa Yang (Right) shunfa@nchc.org.tw shunfa@gmail.com
  • 50. 50/50 Questions?Questions? Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud Questions?Questions? Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw