RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

814 views

Published on

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
  • Hello
    my dear, hope all is well ? My name is
    blessing i saw your profile as i registered here
    today and became interested for us to know
    more
    about each other, and if you can send and email to
    my email address, i will give you my pictures and
    also tell you more about myself, here is my

    private email address (blessing_kaine4u@yahoo.in)
    I believe we can move from here am waiting for your
    mail to my email address above have a nice day
    yours Blessing
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
814
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

  1. 1. Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Build Your PersonalBuild Your Personal Search Engine using CrawlzillaSearch Engine using Crawlzilla Build Your PersonalBuild Your Personal Search Engine using CrawlzillaSearch Engine using Crawlzilla
  2. 2. 2/50 WHO AM I ? JAZZWHO AM I ? JAZZ ??WHO AM I ? JAZZWHO AM I ? JAZZ ?? • Speaker : – Jazz Yao-Tsung Wang / ECE Master – Associate Researcher , NCHC, NARL, Taiwan – Co-Founder of Hadoop.TW – jazz@nchc.org.tw • My slides are available on the website – http://trac.nchc.org.tw/cloud , http://www.slideshare.net/jazzwang FLOSS UserFLOSS User Debian/Ubutnu Access Grid Motion/VLC Red5 Debian Router DRBL/Clonezilla Hadoop FLOSS EvalistFLOSS Evalist DRBL/Clonezilla Partclone/Tuxboot Hadoop Ecosystem FLOSS DeveloperFLOSS Developer DRBL/Clonezilla Hadoop Ecosystem
  3. 3. 3/50 WHAT is Big Data?
  4. 4. 4/50 Data Explosion!!Data Explosion!! Source : The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, March 2007, An IDC White Paper - sponsored by EMC http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
  5. 5. 5/50 Source : Extracting Value from Chaos, June 2011, An IDC White Paper - sponsored by EMC http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf According to IDC reports: 2006 161 EB 2007 281 EB 2008 487 EB 2009 800 EB (0.8 ZB) 2010 988 EB (predict) 2010 1200 EB (1.2 ZB) 2011 1773 EB (predict) 2011 1800 EB (1.8 ZB) Data expanded 1.6x each year !!Data expanded 1.6x each year !!
  6. 6. 6/50 Commonly used software tools can't capture, manage and process. 'Big Data' = few dozen TeraBytes to PetaBytes in single data set. What is Big Data?!What is Big Data?! Multiple files, totally 20TB Single Database, totally 20TB Single file, totally 20TB Source : http://en.wikipedia.org/wiki/Big_data
  7. 7. 7/50 Gartner Big Data Model ?Gartner Big Data Model ? Challenge of 'Big Data' is to manage Volum, Variety and Velocity.Challenge of 'Big Data' is to manage Volum, Variety and Velocity. Volume (amount of data) Velocity (speed of data in/out) Variety (data types, sources) Batch Job Realtime TB EB Unstructured Semi-structured Structured PB Source : [1] Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety" (6 February 2001) [2] Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data, June 2011
  8. 8. 8/50 12D of Information Management?12D of Information Management? Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme Information Management, 7 April 2011, http://www.gartner.com/id=1622715 Big Data is just the beginning Of Extrem Information Management
  9. 9. 9/50 Possible Applications of Big Data?Possible Applications of Big Data? Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf Source: http://lib.stanford.edu/files/see_pasig_dic.pdf Top 1 : Human Genomics – 7000 PB / Year Top 2 : Digital Photos – 1000 PB+/ Year Top 3 : E-mail (no Spam) – 300 PB+ / Year
  10. 10. 10/50 HOW to deal with Big Data?
  11. 11. 11/50 The SMAQ stack for big dataThe SMAQ stack for big data Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
  12. 12. 12/50 The SMAQ stack for big dataThe SMAQ stack for big data Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
  13. 13. 13/50 Source : The SMAQ stack for big data , Edd Dumbill , 22 September 2010 ,          http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html The SMAQ stack for big dataThe SMAQ stack for big data
  14. 14. 14/50 Three Core Technologies of Google ....Three Core Technologies of Google .... • Google shared their design of web-search engine – SOSP 2003 : – “The Google File System” – http://labs.google.com/papers/gfs.html – OSDI 2004 : – “MapReduce : Simplifed Data Processing on Large Cluster” – http://labs.google.com/papers/mapreduce.html – OSDI 2006 : – “Bigtable: A Distributed Storage System for Structured Data” – http://labs.google.com/papers/bigtable-osdi06.pdf
  15. 15. 15/50 Open Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core Technologies Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS) Sector Distributed File SystemSector Distributed File System Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS) Sector Distributed File SystemSector Distributed File System Hadoop MapReduce APIHadoop MapReduce API Sphere MapReduce API, ...Sphere MapReduce API, ... Hadoop MapReduce APIHadoop MapReduce API Sphere MapReduce API, ...Sphere MapReduce API, ... HBase,HBase, HypertableHypertable Cassandra, ....Cassandra, .... HBase,HBase, HypertableHypertable Cassandra, ....Cassandra, .... S = StorageS = Storage Google File SystemGoogle File System To store petabytes of dataTo store petabytes of data S = StorageS = Storage Google File SystemGoogle File System To store petabytes of dataTo store petabytes of data MMapapRReduceeduce To parallel process dataTo parallel process data MMapapRReduceeduce To parallel process dataTo parallel process data Q = QueryQ = Query BigTableBigTable A huge key-value datastoreA huge key-value datastore Q = QueryQ = Query BigTableBigTable A huge key-value datastoreA huge key-value datastore Google's Stack Open Source Projects
  16. 16. 16/50 HadoopHadoopHadoopHadoop • http://hadoop.apache.org • Apache Top Level Project • Major sponsor is Yahoo! • Developed by Doug Cutting, Reference from Google Filesystem • Written by Java, it provides HDFS and MapReduce API • Used in Yahoo since year 2006 • It had been deploy to 4000+ nodes in Yahoo • Design to process dataset in Petabyte • Facebook 、 Last.fm 、 Joost are also powered by Hadoop
  17. 17. 17/50 WHO will needs Hadoop? Let's take 'Search Engine' as an example
  18. 18. 18/50 Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !! FileFile MailMail PidgenPidgen DatabaseDatabase Web PagesWeb Pages
  19. 19. 19/50 To speed up search, We need “Index”To speed up search, We need “Index” Keyword Page Number
  20. 20. 20/50 History of Hadoop …History of Hadoop … 2001~20052001~2005 • Lucene – http://lucene.apache.org/ – a high-performance, full-featured text search engine library written entirely in Java. – Lucene create an inverse index of every word in different documents. It enhance performance of text searching.
  21. 21. 21/50 History of Hadoop …History of Hadoop … 2005~20062005~2006 • Nutch – http://nutch.apache.org/ – Nutch is open source web-search software. – It builds on Lucene and Solr, adding web- specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
  22. 22. 22/50 History of Hadoop …History of Hadoop … 2006 ~ Now2006 ~ Now • Nutch encounter performance issue. • Reference from Google's papers. • Added DFS & MapReduce implement to Nutch • According to user feedback on the mail list of Nutch .... • Hadoop became separated project since Nutch 0.8 • Nutch DFS → Hadoop Distributed File System (HDFS) • Yahoo hire Dong Cutting to build a team of web search engine at year 2006. – Only 14 team members (engineers, clusters, users, etc.) • Doung Cutting joined Cloudera at year 2009.
  23. 23. 23/50 Do you like to write notes?Do you like to write notes?
  24. 24. 24/50 Tools that I used to write notesTools that I used to write notes Oddmuse WikiOddmuse Wiki http://www.oddmuse.org/ 2005~2008
  25. 25. 25/50 Tools that I used to write notesTools that I used to write notes PmWikiPmWiki http://www.pmwiki.org/
  26. 26. 26/50 Tools that I used to write notesTools that I used to write notes ScrapBookScrapBook https://addons.mozilla.org/zh-TW/firefox/addon/scrapbook/ 2005~NOW
  27. 27. 27/50 Tools that I used to write notesTools that I used to write notes Trac = Wiki + Version Control (SVN or GIT)Trac = Wiki + Version Control (SVN or GIT) http://trac.edgewall.org/ 2006~NOW
  28. 28. 28/50 Tools that I used to write notesTools that I used to write notes ReadItLaterReadItLater http://readitlaterlist.com/ 2010~NOW
  29. 29. 29/50 It's painful to search all my notes!
  30. 30. 30/50 How do I search all my notesHow do I search all my notes from different websites?from different websites?
  31. 31. 31/50 Feature of CrawlzillaFeature of Crawlzilla ● Cluster-based ● Chinese Word Segmentation ● Multiple Users with Multiple Index Pools ● Re-Crawl ● Schedule / Crontab ● Display Index Database (Top 50 sites, keywords) ● Support UTF-8 Chinese Search Results ● Web-based User Interface
  32. 32. 32/50 Hadoop Tomcat Crawlzilla System Management Lucene Nutch JSP + Servlet + JavaBean PC1 PC2 PC3 Web UI ( Crawlzilla Website + Search Engine) System Architecture of CrawlzillaSystem Architecture of Crawlzilla
  33. 33. 33/50 Crawlzilla Web UI Users Index DB
  34. 34. 34/50 Comparison with other projectsComparison with other projects Spidr Larbin Jcrawl Nutch Crawlzilla Install Rube Package Install Gmake Compiler and Install Java Compiler and Install Deploy Configure Files Provide Auto Installation Crawl website pages O O O O O Parser Content X X X O O Cluster Computing X X X O O Interface Command Command Command Command Web-UI Support Chinese Segmentation X X X X O
  35. 35. 35/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 1 : Registration http://demo.crawlzilla.info (1) (2) (3)
  36. 36. 36/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 2 : Acceptance Notification Wait for notification from Administrator ! (1) (2)
  37. 37. 37/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 3 : Login with your account Login http://demo.crawlzilla.info (1) (2)
  38. 38. 38/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 4 : Setup Name, URLs and depth Setup new Search Pool (1) (2) (3) (4) (5)
  39. 39. 39/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Step 5 : wait for crawlzilla to generate Index DB Wait for Crawlzilla to generate Index for you (1)
  40. 40. 40/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ You could know how long to generate Index DB Index Pools Management (1) (2)
  41. 41. 41/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Manually Re-Crawl or Delete Index Database You can manually 'Re-Crawl' and generate new Index DB (1) (2)
  42. 42. 42/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 Crawlzilla 1.0Crawlzilla 1.0 多人版雲端服務多人版雲端服務 (8)(8) ▲ 您可以在索引庫管理看到目前爬取已使用的時間 可以於「系統排程」處進行排程重新爬取( schedule ) (1) (2) (3) (4)
  43. 43. 43/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Display the basic information of Index Database Display the Index Database Basic Informations (1) (2)
  44. 44. 44/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 ▲ Add these syntax to your website HTML Syntax for Your Personal Search Engine (1) (2)
  45. 45. 45/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 Total Web pages had been indexed, And Top 50 URLs (1) (2)
  46. 46. 46/50 Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0 It also shows indexed Document Types and Top 50 Keywords (1) (2)
  47. 47. 47/50 Crawlzilla Web UI User Index DB ftp:// file:// Skydrive
  48. 48. 48/50 Start from Here!Start from Here! ● Crawlzilla Demo Cloud Service ● http://demo.crawlzilla.info ● Crawlzilla @ Google Code Project Hosting ● http://code.google.com/p/crawlzilla/ ● Crawlzilla @ Source Forge (Toturial in English) ● http://sourceforge.net/p/crawlzilla/home/ ● Crawlzilla User Group @ Google ● http://groups.google.com/group/crawlzilla-user ● NCHC Cloud Computing Research Group ● http://trac.nchc.org.tw/cloud
  49. 49. 49/50 Authors of CrawlzillaAuthors of Crawlzilla Waue Chen (Left) waue0920@gmail.com Rock Kuo (Center) goldjay1231@gmail.com Shunfa Yang (Right) shunfa@nchc.org.tw shunfa@gmail.com
  50. 50. 50/50 Questions?Questions? Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud Questions?Questions? Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw Jazz WangJazz Wang Yao-Tsung WangYao-Tsung Wang jazz@nchc.org.twjazz@nchc.org.tw

×