SlideShare a Scribd company logo
1 of 24
Download to read offline
Search engines in the
industry
a use case
Different interests
● researchers / engineers look for high
precision and recall
● editors / writers are concerned about
matching of queries and results
● marketers want to change / adapt results
Designing a search engine
● functional requirements
○ search
■ keywords, boolean retrieval, natural language
○ indexing
■ data sources
■ data types
○ administration
■ manage scoring / boosting functions
Designing a search engine
● architectural requirements
○ resiliency
○ scalability
○ no downtime
○ work with existing infrastructure
○ platforms
○ migrating from legacy systems
○ talk to other systems
Designing a search engine
● performance requirements
○ search
■ query per second
■ time per search request
○ index
■ document per second
■ time per indexing request
○ SLA?
Designing a search engine
● search engine performance requirements
○ recall percentiles threshold
○ precision percentiles threshold
○ minimize empty results
● often mostly unknown
○ published vs unpublished / to be written documents
● almost always umanageable
○ cannot decide when
■ it’ll be ready
■ it’ll have to be indexed
■ it’ll have to be searchable
● heterogeneous
○ different writers, languages, topics, styles, etc.
Data
Process
Project
● ~50M heterogeneous documents
● Migrating from old commercial solution to
Apache Solr
● Google like search
● Targeted search for different types of
contents
Advanced capabilities
● Smart understanding of queries
● Smart suggestion of queries
● Suggestion of similar / important contents
● Automatic classification of contents
Responsibilities
● architecture analysis and design
○ scaling under high load
● continuous definition of algorithms for
indexing and searching
● system maintenance
Skills required
● basics of information retrieval
● a bit of distributed systems
● some natural language processing
● some machine learning
Architecture analysis and design
● Shape up a prototype architecture
○ separate machines for indexing and search
○ multiple load balanced machines for searching
○ define indexing and search algorithms
● Evaluate architecture
○ stress tests (performance)
○ quality tests (accuracy)
● Iterate
Architecture analysis and design
● analyze existing documents
○ avg size
○ language
○ topics, style, etc.
● analyze existing query logs
○ avg response time
○ avg length (how much it takes to specify a query?)
○ avg query per second
Most time spent on
● testing how documents get indexed
● testing how user queries get transformer in
platform specific queries
● tweaking indexing algorithms
● tweaking search algorithms
● tweaking ranking
● platform optimization for scalability
Challenges
● Architecture constraints
● Performance
● Diverging stakeholders concerns
● Dynamically scaling search
Sample architecture constraint #1
● Data storage has to be on NFS
● Lucene is IO intensive
● NFS makes it slower
● Concurrent read writes makes it error prone
Sample architecture constraint #2
● Change search engine
● Systems talking to the SE need to switch
API
● Only in the long run
● In the short run an adapter layer for old APIs
on new APIs has to be developed
Indexing performance
● Most of the indexing time is spent converting
data from the old (indxing) format to the new
(indexing) format
● The adaption layer between old and new API
becomes the bottleneck
● Time to switch to the new API natively
Diverging concerns
● Article authors check the search engine
exactly handles their writings wanting perfect
recall and precision
○ so lot of time is spent on adjusting ranking
● Markters want to be able to overcome
ranking and put something they want to sell
○ ranking algorithm gets breached
● Need flexible algorithms
Scale dinamically
● Search engine needs not to break even
under high peaks of load
● Such peaks are often unpredictable
● Need a fast way to add more computing
power
Takeaways
● small iterations (no waterfalls!)
○ analyze portion of data / queries
○ change search / index algorithms
○ test, involve stakeholders
○ forces ability to reindex quickly
● look at data (documents, query logs)

More Related Content

Similar to Search engines in the industry

Text search with Elasticsearch on AWS
Text search with Elasticsearch on AWSText search with Elasticsearch on AWS
Text search with Elasticsearch on AWSŁukasz Przybyłek
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Paul Withers
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Richard Boulton
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionMaggie Pint
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...Command Prompt., Inc
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017DevOpsDays Tel Aviv
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be builtNikhil Garg
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsInfluxData
 

Similar to Search engines in the industry (20)

Text search with Elasticsearch on AWS
Text search with Elasticsearch on AWSText search with Elasticsearch on AWS
Text search with Elasticsearch on AWS
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
OutSystems Tips and Tricks
OutSystems Tips and TricksOutSystems Tips and Tricks
OutSystems Tips and Tricks
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
Friday 1484
Friday 1484Friday 1484
Friday 1484
 
Digital Marketing
Digital MarketingDigital Marketing
Digital Marketing
 
digital marketing
digital marketingdigital marketing
digital marketing
 
Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8Designing a generic Python Search Engine API - BarCampLondon 8
Designing a generic Python Search Engine API - BarCampLondon 8
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
A culture of Automation - Joe Smith - DevOpsDays Tel Aviv 2017
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
Open source ml systems that need to be built
Open source ml systems that need to be builtOpen source ml systems that need to be built
Open source ml systems that need to be built
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your Workloads
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

More from Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Search engines in the industry

  • 1. Search engines in the industry a use case
  • 2.
  • 3. Different interests ● researchers / engineers look for high precision and recall ● editors / writers are concerned about matching of queries and results ● marketers want to change / adapt results
  • 4. Designing a search engine ● functional requirements ○ search ■ keywords, boolean retrieval, natural language ○ indexing ■ data sources ■ data types ○ administration ■ manage scoring / boosting functions
  • 5. Designing a search engine ● architectural requirements ○ resiliency ○ scalability ○ no downtime ○ work with existing infrastructure ○ platforms ○ migrating from legacy systems ○ talk to other systems
  • 6. Designing a search engine ● performance requirements ○ search ■ query per second ■ time per search request ○ index ■ document per second ■ time per indexing request ○ SLA?
  • 7. Designing a search engine ● search engine performance requirements ○ recall percentiles threshold ○ precision percentiles threshold ○ minimize empty results
  • 8. ● often mostly unknown ○ published vs unpublished / to be written documents ● almost always umanageable ○ cannot decide when ■ it’ll be ready ■ it’ll have to be indexed ■ it’ll have to be searchable ● heterogeneous ○ different writers, languages, topics, styles, etc. Data
  • 10. Project ● ~50M heterogeneous documents ● Migrating from old commercial solution to Apache Solr ● Google like search ● Targeted search for different types of contents
  • 11. Advanced capabilities ● Smart understanding of queries ● Smart suggestion of queries ● Suggestion of similar / important contents ● Automatic classification of contents
  • 12. Responsibilities ● architecture analysis and design ○ scaling under high load ● continuous definition of algorithms for indexing and searching ● system maintenance
  • 13. Skills required ● basics of information retrieval ● a bit of distributed systems ● some natural language processing ● some machine learning
  • 14. Architecture analysis and design ● Shape up a prototype architecture ○ separate machines for indexing and search ○ multiple load balanced machines for searching ○ define indexing and search algorithms ● Evaluate architecture ○ stress tests (performance) ○ quality tests (accuracy) ● Iterate
  • 15. Architecture analysis and design ● analyze existing documents ○ avg size ○ language ○ topics, style, etc. ● analyze existing query logs ○ avg response time ○ avg length (how much it takes to specify a query?) ○ avg query per second
  • 16. Most time spent on ● testing how documents get indexed ● testing how user queries get transformer in platform specific queries ● tweaking indexing algorithms ● tweaking search algorithms ● tweaking ranking ● platform optimization for scalability
  • 17. Challenges ● Architecture constraints ● Performance ● Diverging stakeholders concerns ● Dynamically scaling search
  • 18. Sample architecture constraint #1 ● Data storage has to be on NFS ● Lucene is IO intensive ● NFS makes it slower ● Concurrent read writes makes it error prone
  • 19. Sample architecture constraint #2 ● Change search engine ● Systems talking to the SE need to switch API ● Only in the long run ● In the short run an adapter layer for old APIs on new APIs has to be developed
  • 20. Indexing performance ● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format ● The adaption layer between old and new API becomes the bottleneck ● Time to switch to the new API natively
  • 21. Diverging concerns ● Article authors check the search engine exactly handles their writings wanting perfect recall and precision ○ so lot of time is spent on adjusting ranking ● Markters want to be able to overcome ranking and put something they want to sell ○ ranking algorithm gets breached ● Need flexible algorithms
  • 22. Scale dinamically ● Search engine needs not to break even under high peaks of load ● Such peaks are often unpredictable ● Need a fast way to add more computing power
  • 23.
  • 24. Takeaways ● small iterations (no waterfalls!) ○ analyze portion of data / queries ○ change search / index algorithms ○ test, involve stakeholders ○ forces ability to reindex quickly ● look at data (documents, query logs)