Search Engine - How to Make it

230 views
170 views

Published on

Technical Presentation of How to Build Search Engine with open source technologies

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
230
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Search Engine - How to Make it

  1. 1. Search Engine How To Make itWednesday, December 12, 12
  2. 2. Search Engine Search Quality Measurement retrieved documents (RET) relevant documents RET ∩ REL (REL) All documents database search: web search: - low recall - high recall - high precision - low precisionWednesday, December 12, 12
  3. 3. Search Engine File System File Text Parser Crawler System Documents AaBb (title, Documents PDF AaBb Document 3rd party apps Crawler API Text HTML Parser summary, Enhancing (Categorized, HTML Document author, Taxonomized) Image ... datetime) Database Database Crawler PDF Parser Language Indexer Stop Analyzer Analyzer Web Client Index Document Index Searcher Index Landing Page Searcher Mobile ClientWednesday, December 12, 12
  4. 4. Search Engine • Process in Search Engine • Crawling • Parsing • Indexing • SearchingWednesday, December 12, 12
  5. 5. Search Engine • Process in Search Engine • Crawling • Parsing • Duplicate Content Detection • Document Enhancement • Indexing • Searching • Document ServingWednesday, December 12, 12
  6. 6. Search Engine • Crawling • Collecting Data • Input : Data content to Search • Output : Raw Content Data in its original formatWednesday, December 12, 12
  7. 7. Search Engine • Crawling File System File Crawler System AaBb 3rd party Crawler API PDF AaBb apps Text HTML Document Image ... Database Database CrawlerWednesday, December 12, 12
  8. 8. Search Engine • Parsing • Process to extract elements from crawled documents • Input : Raw Contents • Output : Textual Structured DocumentsWednesday, December 12, 12
  9. 9. Search Engine • Parsing Text Parser Documents AaBb (title, PDF AaBb Text HTML Parser summary, HTML Document author, Image ... datetime) PDF ParserWednesday, December 12, 12
  10. 10. Search Engine • Content Duplication Detection • Bigger Data means Bigger Duplication on Data • Search Engine implement similiar document detectionWednesday, December 12, 12
  11. 11. Search Engine • Document Representation Model: Term Frequency(Tf) Contoh: Document 1(d1)=”andi likes to watch movie. His wife likes it too” Document 2(d2)=”andi also likes to watch soccer game.” Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer} Document representation in model Tf: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1}Wednesday, December 12, 12
  12. 12. Search Engine • Document Similiarity Similarity between document d1 dan d2 : S(d1, d2) S(d1, d2)=|d1-d2| Contoh: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1| S(d1, d2)=7 With above definition, less value we got means more those two documents are getting more similiarWednesday, December 12, 12
  13. 13. Search Engine • Alghoritms 1. Counting Tf for every document 2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d 3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document 4. Repeat process 2 dan 3 until there is no value of S that less than ThesholdWednesday, December 12, 12
  14. 14. Search Engine • Document Enhancement • Give tagging based on taxonomyWednesday, December 12, 12
  15. 15. Search Engine • Document Enhancement Documents (title, Documents Document summary, Enhancing (Categorized, author, Taxonomized) datetime)Wednesday, December 12, 12
  16. 16. Search Engine • Indexing • Indexing process from all information that have been gathered in one document • Faster Searching process • Able to search based on certain fieldWednesday, December 12, 12
  17. 17. Search Engine • Indexing Language Analyzer Documents (Categorized, Indexer Index Taxonomized) Stop AnalyzerWednesday, December 12, 12
  18. 18. Search Engine • Searching Web Client Index Index Searcher Mobile ClientWednesday, December 12, 12
  19. 19. Search Engine • Document Serving • Search Engine also has a function to display resultWednesday, December 12, 12
  20. 20. Search Engine Web Client Index Index Document Index Searcher Searcher Landing Page Mobile ClientWednesday, December 12, 12
  21. 21. Search Engine • Recommended Open Source Technology • Search Engine : Lucene, Nutch • Programming Library : Hadoop, Scala Actor • Database : MongoDB, PostgreSQL • Programming Language : Java, Scala, PHPWednesday, December 12, 12
  22. 22. Thank YouWednesday, December 12, 12

×