Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Search engines in the
industry
a use case
Different interests
● researchers / engineers look for high
precision and recall
● editors / writers are concerned about
m...
Designing a search engine
● functional requirements
○ search
■ keywords, boolean retrieval, natural language
○ indexing
■ ...
Designing a search engine
● architectural requirements
○ resiliency
○ scalability
○ no downtime
○ work with existing infra...
Designing a search engine
● performance requirements
○ search
■ query per second
■ time per search request
○ index
■ docum...
Designing a search engine
● search engine performance requirements
○ recall percentiles threshold
○ precision percentiles ...
● often mostly unknown
○ published vs unpublished / to be written documents
● almost always umanageable
○ cannot decide wh...
Process
Project
● ~50M heterogeneous documents
● Migrating from old commercial solution to
Apache Solr
● Google like search
● Targ...
Advanced capabilities
● Smart understanding of queries
● Smart suggestion of queries
● Suggestion of similar / important c...
Responsibilities
● architecture analysis and design
○ scaling under high load
● continuous definition of algorithms for
in...
Skills required
● basics of information retrieval
● a bit of distributed systems
● some natural language processing
● some...
Architecture analysis and design
● Shape up a prototype architecture
○ separate machines for indexing and search
○ multipl...
Architecture analysis and design
● analyze existing documents
○ avg size
○ language
○ topics, style, etc.
● analyze existi...
Most time spent on
● testing how documents get indexed
● testing how user queries get transformer in
platform specific que...
Challenges
● Architecture constraints
● Performance
● Diverging stakeholders concerns
● Dynamically scaling search
Sample architecture constraint #1
● Data storage has to be on NFS
● Lucene is IO intensive
● NFS makes it slower
● Concurr...
Sample architecture constraint #2
● Change search engine
● Systems talking to the SE need to switch
API
● Only in the long...
Indexing performance
● Most of the indexing time is spent converting
data from the old (indxing) format to the new
(indexi...
Diverging concerns
● Article authors check the search engine
exactly handles their writings wanting perfect
recall and pre...
Scale dinamically
● Search engine needs not to break even
under high peaks of load
● Such peaks are often unpredictable
● ...
Takeaways
● small iterations (no waterfalls!)
○ analyze portion of data / queries
○ change search / index algorithms
○ tes...
Search engines in the industry
Search engines in the industry
Upcoming SlideShare
Loading in …5
×

Search engines in the industry

1,005 views

Published on

Published in: Technology, Design
  • Be the first to comment

Search engines in the industry

  1. 1. Search engines in the industry a use case
  2. 2. Different interests ● researchers / engineers look for high precision and recall ● editors / writers are concerned about matching of queries and results ● marketers want to change / adapt results
  3. 3. Designing a search engine ● functional requirements ○ search ■ keywords, boolean retrieval, natural language ○ indexing ■ data sources ■ data types ○ administration ■ manage scoring / boosting functions
  4. 4. Designing a search engine ● architectural requirements ○ resiliency ○ scalability ○ no downtime ○ work with existing infrastructure ○ platforms ○ migrating from legacy systems ○ talk to other systems
  5. 5. Designing a search engine ● performance requirements ○ search ■ query per second ■ time per search request ○ index ■ document per second ■ time per indexing request ○ SLA?
  6. 6. Designing a search engine ● search engine performance requirements ○ recall percentiles threshold ○ precision percentiles threshold ○ minimize empty results
  7. 7. ● often mostly unknown ○ published vs unpublished / to be written documents ● almost always umanageable ○ cannot decide when ■ it’ll be ready ■ it’ll have to be indexed ■ it’ll have to be searchable ● heterogeneous ○ different writers, languages, topics, styles, etc. Data
  8. 8. Process
  9. 9. Project ● ~50M heterogeneous documents ● Migrating from old commercial solution to Apache Solr ● Google like search ● Targeted search for different types of contents
  10. 10. Advanced capabilities ● Smart understanding of queries ● Smart suggestion of queries ● Suggestion of similar / important contents ● Automatic classification of contents
  11. 11. Responsibilities ● architecture analysis and design ○ scaling under high load ● continuous definition of algorithms for indexing and searching ● system maintenance
  12. 12. Skills required ● basics of information retrieval ● a bit of distributed systems ● some natural language processing ● some machine learning
  13. 13. Architecture analysis and design ● Shape up a prototype architecture ○ separate machines for indexing and search ○ multiple load balanced machines for searching ○ define indexing and search algorithms ● Evaluate architecture ○ stress tests (performance) ○ quality tests (accuracy) ● Iterate
  14. 14. Architecture analysis and design ● analyze existing documents ○ avg size ○ language ○ topics, style, etc. ● analyze existing query logs ○ avg response time ○ avg length (how much it takes to specify a query?) ○ avg query per second
  15. 15. Most time spent on ● testing how documents get indexed ● testing how user queries get transformer in platform specific queries ● tweaking indexing algorithms ● tweaking search algorithms ● tweaking ranking ● platform optimization for scalability
  16. 16. Challenges ● Architecture constraints ● Performance ● Diverging stakeholders concerns ● Dynamically scaling search
  17. 17. Sample architecture constraint #1 ● Data storage has to be on NFS ● Lucene is IO intensive ● NFS makes it slower ● Concurrent read writes makes it error prone
  18. 18. Sample architecture constraint #2 ● Change search engine ● Systems talking to the SE need to switch API ● Only in the long run ● In the short run an adapter layer for old APIs on new APIs has to be developed
  19. 19. Indexing performance ● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format ● The adaption layer between old and new API becomes the bottleneck ● Time to switch to the new API natively
  20. 20. Diverging concerns ● Article authors check the search engine exactly handles their writings wanting perfect recall and precision ○ so lot of time is spent on adjusting ranking ● Markters want to be able to overcome ranking and put something they want to sell ○ ranking algorithm gets breached ● Need flexible algorithms
  21. 21. Scale dinamically ● Search engine needs not to break even under high peaks of load ● Such peaks are often unpredictable ● Need a fast way to add more computing power
  22. 22. Takeaways ● small iterations (no waterfalls!) ○ analyze portion of data / queries ○ change search / index algorithms ○ test, involve stakeholders ○ forces ability to reindex quickly ● look at data (documents, query logs)

×