TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
How To Measure Search Quality
1. Search Engine
How To Make it
Wednesday, December 12, 12
2. Search Engine
Search Quality Measurement
retrieved documents
(RET)
relevant documents RET ∩ REL
(REL)
All documents
database search: web search:
- low recall - high recall
- high precision - low precision
Wednesday, December 12, 12
3. Search Engine
File System
File Text Parser
Crawler
System
Documents
AaBb (title, Documents
PDF
AaBb Document
3rd party
apps
Crawler API
Text
HTML Parser
summary, Enhancing (Categorized,
HTML
Document
author, Taxonomized)
Image
...
datetime)
Database
Database Crawler PDF Parser
Language
Indexer Stop Analyzer
Analyzer
Web Client
Index
Document Index Searcher Index
Landing Page Searcher Mobile Client
Wednesday, December 12, 12
4. Search Engine
• Process in Search Engine
• Crawling
• Parsing
• Indexing
• Searching
Wednesday, December 12, 12
5. Search Engine
• Process in Search Engine
• Crawling
• Parsing
• Duplicate Content Detection
• Document Enhancement
• Indexing
• Searching
• Document Serving
Wednesday, December 12, 12
6. Search Engine
• Crawling
• Collecting Data
• Input : Data content to Search
• Output : Raw Content Data in its
original format
Wednesday, December 12, 12
7. Search Engine
• Crawling
File System
File
Crawler
System
AaBb
3rd party Crawler API PDF
AaBb
apps Text
HTML
Document
Image
...
Database
Database Crawler
Wednesday, December 12, 12
8. Search Engine
• Parsing
• Process to extract elements from
crawled documents
• Input : Raw Contents
• Output : Textual Structured
Documents
Wednesday, December 12, 12
9. Search Engine
• Parsing
Text Parser
Documents
AaBb (title,
PDF
AaBb
Text
HTML Parser
summary,
HTML
Document author,
Image
...
datetime)
PDF Parser
Wednesday, December 12, 12
10. Search Engine
• Content Duplication Detection
• Bigger Data means Bigger
Duplication on Data
• Search Engine implement similiar
document detection
Wednesday, December 12, 12
11. Search Engine
• Document Representation
Model: Term Frequency(Tf)
Contoh:
Document 1(d1)=”andi likes to watch movie. His wife likes it too”
Document 2(d2)=”andi also likes to watch soccer game.”
Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}
Document representation in model Tf:
d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
Wednesday, December 12, 12
12. Search Engine
• Document Similiarity
Similarity between document d1 dan d2 : S(d1, d2)
S(d1, d2)=|d1-d2|
Contoh:
d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|
S(d1, d2)=7
With above definition, less value we got means more those two documents
are getting more similiar
Wednesday, December 12, 12
13. Search Engine
• Alghoritms
1. Counting Tf for every document
2. Find the smallest value of S(d, di) from all
documents collection to get the most similiar of
document d
3. if the value of S(d, di) < threshold then
document d and compared with create date, then
erase older document
4. Repeat process 2 dan 3 until there is no value
of S that less than Theshold
Wednesday, December 12, 12
14. Search Engine
• Document Enhancement
• Give tagging based on taxonomy
Wednesday, December 12, 12
16. Search Engine
• Indexing
• Indexing process from all information
that have been gathered in one
document
• Faster Searching process
• Able to search based on certain field
Wednesday, December 12, 12
17. Search Engine
• Indexing
Language
Analyzer
Documents
(Categorized, Indexer Index
Taxonomized)
Stop Analyzer
Wednesday, December 12, 12
18. Search Engine
• Searching
Web Client
Index
Index Searcher
Mobile Client
Wednesday, December 12, 12
19. Search Engine
• Document Serving
• Search Engine also has a function to
display result
Wednesday, December 12, 12
20. Search Engine
Web Client
Index Index Document
Index Searcher Searcher Landing Page
Mobile Client
Wednesday, December 12, 12
21. Search Engine
• Recommended Open Source
Technology
• Search Engine : Lucene, Nutch
• Programming Library : Hadoop, Scala Actor
• Database : MongoDB, PostgreSQL
• Programming Language : Java, Scala, PHP
Wednesday, December 12, 12