SlideShare a Scribd company logo
1 of 22
Download to read offline
Search Engine
                             How To Make it




Wednesday, December 12, 12
Search Engine
                      Search Quality Measurement

                                                             retrieved documents
                                                             (RET)
                      relevant documents       RET ∩ REL
                      (REL)




                                            All documents




             database search:              web search:
             - low recall                  - high recall
             - high precision              - low precision
Wednesday, December 12, 12
Search Engine
                        File System
            File                                    Text Parser
                          Crawler
           System

                                                                      Documents
                                      AaBb                               (title,                      Documents
                                          PDF
                                       AaBb                                              Document
          3rd party
            apps
                        Crawler API
                                         Text
                                                    HTML Parser
                                                                      summary,           Enhancing   (Categorized,
                                         HTML
                                       Document
                                                                        author,                      Taxonomized)
                                        Image
                                           ...
                                                                       datetime)
                         Database
          Database        Crawler                   PDF Parser




                                                                                  Language
                                                                                                       Indexer       Stop Analyzer
                                                                                  Analyzer




                                                                   Web Client
                                                                                        Index
                       Document           Index                                        Searcher         Index

                      Landing Page       Searcher                 Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Indexing
                        • Searching


Wednesday, December 12, 12
Search Engine
                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Duplicate Content Detection
                        • Document Enhancement
                        • Indexing
                        • Searching
                        • Document Serving
Wednesday, December 12, 12
Search Engine

                   • Crawling
                        • Collecting Data
                        • Input : Data content to Search
                        • Output : Raw Content Data in its
                          original format



Wednesday, December 12, 12
Search Engine
                   • Crawling

                                         File System
                              File
                                           Crawler
                             System




                                                       AaBb
                             3rd party   Crawler API       PDF
                                                        AaBb
                               apps                       Text
                                                          HTML
                                                        Document
                                                         Image
                                                            ...
                                          Database
                             Database      Crawler




Wednesday, December 12, 12
Search Engine
                   • Parsing
                        • Process to extract elements from
                          crawled documents
                        • Input : Raw Contents
                        • Output : Textual Structured
                          Documents


Wednesday, December 12, 12
Search Engine
                   • Parsing


                                         Text Parser



                                                       Documents
                             AaBb                         (title,
                                 PDF
                              AaBb
                                Text
                                         HTML Parser
                                                       summary,
                                HTML
                              Document                   author,
                               Image
                                  ...
                                                        datetime)
                                         PDF Parser




Wednesday, December 12, 12
Search Engine

                   • Content Duplication Detection
                        • Bigger Data means Bigger
                          Duplication on Data
                        • Search Engine implement similiar
                          document detection



Wednesday, December 12, 12
Search Engine
                   • Document Representation
                             Model: Term Frequency(Tf)
                             Contoh:
                              Document 1(d1)=”andi likes to watch movie. His wife likes it too”

                              Document 2(d2)=”andi also likes to watch soccer game.”
                              Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}


                              Document representation in model Tf:
                              d1={1, 2, 2, 2, 1, 1, 0}
                              d2={1, 1, 1, 0, 0, 0, 1}




Wednesday, December 12, 12
Search Engine
                   • Document Similiarity
                             Similarity between document d1 dan d2 : S(d1, d2)

                             S(d1, d2)=|d1-d2|
                             Contoh:
                             d1={1, 2, 2, 2, 1, 1, 0}

                             d2={1, 1, 1, 0, 0, 0, 1}

                              S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|

                             S(d1, d2)=7

                             With above definition, less value we got means more those two documents
                             are getting more similiar

Wednesday, December 12, 12
Search Engine
                   • Alghoritms
                             1. Counting Tf for every document

                             2. Find the smallest value of S(d, di) from all
                             documents collection to get the most similiar of
                             document d
                             3. if the value of S(d, di) < threshold then
                             document d and compared with create date, then
                             erase older document
                             4. Repeat process 2 dan 3 until there is no value
                             of S that less than Theshold


Wednesday, December 12, 12
Search Engine


                   • Document Enhancement
                        • Give tagging based on taxonomy




Wednesday, December 12, 12
Search Engine
                   • Document Enhancement



                         Documents
                            (title,                Documents
                                      Document
                         summary,     Enhancing
                                                  (Categorized,
                           author,                Taxonomized)
                          datetime)




Wednesday, December 12, 12
Search Engine
                   • Indexing
                        • Indexing process from all information
                          that have been gathered in one
                          document
                             • Faster Searching process
                             • Able to search based on certain field


Wednesday, December 12, 12
Search Engine
                   • Indexing
                                              Language
                                              Analyzer




                              Documents
                             (Categorized,     Indexer       Index
                             Taxonomized)




                                             Stop Analyzer
Wednesday, December 12, 12
Search Engine
                   • Searching



                                                 Web Client
                                      Index
                             Index   Searcher
                                                Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Document Serving
                        • Search Engine also has a function to
                          display result




Wednesday, December 12, 12
Search Engine


                                         Web Client
                              Index                      Index      Document
         Index               Searcher                   Searcher   Landing Page
                                        Mobile Client




Wednesday, December 12, 12
Search Engine
                   • Recommended Open Source
                     Technology
                             • Search Engine : Lucene, Nutch

                             • Programming Library : Hadoop, Scala Actor

                             • Database : MongoDB, PostgreSQL

                             • Programming Language : Java, Scala, PHP




Wednesday, December 12, 12
Thank You



Wednesday, December 12, 12

More Related Content

What's hot

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBMongoDB
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and RecommendersLucidworks
 
Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchAgnes Molnar
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...lucenerevolution
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioComperio - Search Matters.
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerMark Tabladillo
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageNeo4j
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Smarter share point kc user group fast presentation march 2015
Smarter share point kc user group fast presentation   march 2015Smarter share point kc user group fast presentation   march 2015
Smarter share point kc user group fast presentation march 2015Kyle Bodenstab
 

What's hot (18)

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 Search
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL Server
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Smarter share point kc user group fast presentation march 2015
Smarter share point kc user group fast presentation   march 2015Smarter share point kc user group fast presentation   march 2015
Smarter share point kc user group fast presentation march 2015
 

Viewers also liked

Getting more from Google Analytics
Getting more from Google AnalyticsGetting more from Google Analytics
Getting more from Google AnalyticsFind50 Marketing
 
Organic Web Search - why it matters.
Organic Web Search - why it matters.Organic Web Search - why it matters.
Organic Web Search - why it matters.Find50 Marketing
 
Isaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac Holeman
 
Increasing and Improving your web traffic
Increasing and Improving your web trafficIncreasing and Improving your web traffic
Increasing and Improving your web trafficFind50 Marketing
 

Viewers also liked (6)

Getting more from Google Analytics
Getting more from Google AnalyticsGetting more from Google Analytics
Getting more from Google Analytics
 
Introduction To Ad Words
Introduction To Ad WordsIntroduction To Ad Words
Introduction To Ad Words
 
Organic Web Search - why it matters.
Organic Web Search - why it matters.Organic Web Search - why it matters.
Organic Web Search - why it matters.
 
Isaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo College
 
Increasing and Improving your web traffic
Increasing and Improving your web trafficIncreasing and Improving your web traffic
Increasing and Improving your web traffic
 
Better Digital Marketing
Better Digital MarketingBetter Digital Marketing
Better Digital Marketing
 

Similar to How To Measure Search Quality

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchAgnes Molnar
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houbergknowledgelakemarketing
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With HadoopCloudera, Inc.
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05bhughes26
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Search, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeySearch, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeyablebagel
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsBenjamin Athawes
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"Lucidworks (Archived)
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
AWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAmazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services
 

Similar to How To Measure Search Quality (20)

FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houberg
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Search, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeySearch, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journey
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROs
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
AWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearch
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Arakno
AraknoArakno
Arakno
 

Recently uploaded

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

How To Measure Search Quality

  • 1. Search Engine How To Make it Wednesday, December 12, 12
  • 2. Search Engine Search Quality Measurement retrieved documents (RET) relevant documents RET ∩ REL (REL) All documents database search: web search: - low recall - high recall - high precision - low precision Wednesday, December 12, 12
  • 3. Search Engine File System File Text Parser Crawler System Documents AaBb (title, Documents PDF AaBb Document 3rd party apps Crawler API Text HTML Parser summary, Enhancing (Categorized, HTML Document author, Taxonomized) Image ... datetime) Database Database Crawler PDF Parser Language Indexer Stop Analyzer Analyzer Web Client Index Document Index Searcher Index Landing Page Searcher Mobile Client Wednesday, December 12, 12
  • 4. Search Engine • Process in Search Engine • Crawling • Parsing • Indexing • Searching Wednesday, December 12, 12
  • 5. Search Engine • Process in Search Engine • Crawling • Parsing • Duplicate Content Detection • Document Enhancement • Indexing • Searching • Document Serving Wednesday, December 12, 12
  • 6. Search Engine • Crawling • Collecting Data • Input : Data content to Search • Output : Raw Content Data in its original format Wednesday, December 12, 12
  • 7. Search Engine • Crawling File System File Crawler System AaBb 3rd party Crawler API PDF AaBb apps Text HTML Document Image ... Database Database Crawler Wednesday, December 12, 12
  • 8. Search Engine • Parsing • Process to extract elements from crawled documents • Input : Raw Contents • Output : Textual Structured Documents Wednesday, December 12, 12
  • 9. Search Engine • Parsing Text Parser Documents AaBb (title, PDF AaBb Text HTML Parser summary, HTML Document author, Image ... datetime) PDF Parser Wednesday, December 12, 12
  • 10. Search Engine • Content Duplication Detection • Bigger Data means Bigger Duplication on Data • Search Engine implement similiar document detection Wednesday, December 12, 12
  • 11. Search Engine • Document Representation Model: Term Frequency(Tf) Contoh: Document 1(d1)=”andi likes to watch movie. His wife likes it too” Document 2(d2)=”andi also likes to watch soccer game.” Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer} Document representation in model Tf: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} Wednesday, December 12, 12
  • 12. Search Engine • Document Similiarity Similarity between document d1 dan d2 : S(d1, d2) S(d1, d2)=|d1-d2| Contoh: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1| S(d1, d2)=7 With above definition, less value we got means more those two documents are getting more similiar Wednesday, December 12, 12
  • 13. Search Engine • Alghoritms 1. Counting Tf for every document 2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d 3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document 4. Repeat process 2 dan 3 until there is no value of S that less than Theshold Wednesday, December 12, 12
  • 14. Search Engine • Document Enhancement • Give tagging based on taxonomy Wednesday, December 12, 12
  • 15. Search Engine • Document Enhancement Documents (title, Documents Document summary, Enhancing (Categorized, author, Taxonomized) datetime) Wednesday, December 12, 12
  • 16. Search Engine • Indexing • Indexing process from all information that have been gathered in one document • Faster Searching process • Able to search based on certain field Wednesday, December 12, 12
  • 17. Search Engine • Indexing Language Analyzer Documents (Categorized, Indexer Index Taxonomized) Stop Analyzer Wednesday, December 12, 12
  • 18. Search Engine • Searching Web Client Index Index Searcher Mobile Client Wednesday, December 12, 12
  • 19. Search Engine • Document Serving • Search Engine also has a function to display result Wednesday, December 12, 12
  • 20. Search Engine Web Client Index Index Document Index Searcher Searcher Landing Page Mobile Client Wednesday, December 12, 12
  • 21. Search Engine • Recommended Open Source Technology • Search Engine : Lucene, Nutch • Programming Library : Hadoop, Scala Actor • Database : MongoDB, PostgreSQL • Programming Language : Java, Scala, PHP Wednesday, December 12, 12