How to build_a_search_engine

462 views

Published on

How to build a small search engine using hadoop (HDFS) and Lucene.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
462
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How to build_a_search_engine

  1. 1. How to build a small distributed search engine using open source software
  2. 2. Building a distributed search engine Search engine subsytems: ● Page database ● List of the pages to retrieve ● Pages retrieval and save ● Page content parsing ● Full-text indexing of the contents ● Graph database of the links for ranking
  3. 3. Building a distributed search engine Open Source Software • Apache Hadoop • • • • MapReduce HDFS HBase Apache Lucene
  4. 4. Building a distributed search engine HDFS Hadoop Distributed File System
  5. 5. Building a distributed search engine HDFS – Assumptions and goals ● Hardware failure ● Big data ● Write once / read many ● Moving computation, not data
  6. 6. Building a distributed search engine
  7. 7. Building a distributed search engine
  8. 8. Building a distributed search engine Lucene
  9. 9. Building a distributed search engine Lucene - Inverse Indexing Term Doc Id Weight JUG 301 198 120 0.97 0.65 0.43 301 278 451 103 763 0.94 0.15 0.87 0.45 0.77 Lugano
  10. 10. Building a distributed search engine Lucene - Indexing main classes  IndexWriter  Directory  Analyzer  Document  Field
  11. 11. Building a distributed search engine Lucene - Searching main classes  IndexSearcher  Collector  Query  TopDocs  ScoreDoc
  12. 12. Building a distributed search engine Lucene - Analyzers    StopWords  ”the book is on the table” → [book, table] Stemming  [paint, paints, painted, …] → paint Synonims  [cat, feline] → cat
  13. 13. Building a distributed search engine Lucene - Search options  Fields    Wildcards    Title: JUG body: ”JUG Lugano” J?G → [JUG, JAG, ...] J*G →[JUG, JEEG, JUNG, …] Fuzzy (basata su vocabolario)  JUG~[n] → [MUG, JAG, …]
  14. 14. Building a distributed search engine Lucene - Search options  Range    Boost    JUG^5 Lugano ”JUG Lugano”^5 Proximity   Year: [2002 TO 2012] Name: {Alberto TO Andrea} ”JUG Lugano”~5 Boolean and existance  AND, OR, NOT, (), +, -
  15. 15. Building a distributed search engine HDFS - Lucene Integration  File copy from/to HDFS  Patch IndexWriter/Director IndexWriter/Directory  Rewrite of IndexWriter on RAM  Lucene 4
  16. 16. Building a distributed search engine And now... Hands on!

×