Search engines in the industry

786 views
707 views

Published on

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
786
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Search engines in the industry

  1. 1. Search engines in the industry a use case
  2. 2. Different interests ● researchers / engineers look for high precision and recall ● editors / writers are concerned about matching of queries and results ● marketers want to change / adapt results
  3. 3. Designing a search engine ● functional requirements ○ search ■ keywords, boolean retrieval, natural language ○ indexing ■ data sources ■ data types ○ administration ■ manage scoring / boosting functions
  4. 4. Designing a search engine ● architectural requirements ○ resiliency ○ scalability ○ no downtime ○ work with existing infrastructure ○ platforms ○ migrating from legacy systems ○ talk to other systems
  5. 5. Designing a search engine ● performance requirements ○ search ■ query per second ■ time per search request ○ index ■ document per second ■ time per indexing request ○ SLA?
  6. 6. Designing a search engine ● search engine performance requirements ○ recall percentiles threshold ○ precision percentiles threshold ○ minimize empty results
  7. 7. ● often mostly unknown ○ published vs unpublished / to be written documents ● almost always umanageable ○ cannot decide when ■ it’ll be ready ■ it’ll have to be indexed ■ it’ll have to be searchable ● heterogeneous ○ different writers, languages, topics, styles, etc. Data
  8. 8. Process
  9. 9. Project ● ~50M heterogeneous documents ● Migrating from old commercial solution to Apache Solr ● Google like search ● Targeted search for different types of contents
  10. 10. Advanced capabilities ● Smart understanding of queries ● Smart suggestion of queries ● Suggestion of similar / important contents ● Automatic classification of contents
  11. 11. Responsibilities ● architecture analysis and design ○ scaling under high load ● continuous definition of algorithms for indexing and searching ● system maintenance
  12. 12. Skills required ● basics of information retrieval ● a bit of distributed systems ● some natural language processing ● some machine learning
  13. 13. Architecture analysis and design ● Shape up a prototype architecture ○ separate machines for indexing and search ○ multiple load balanced machines for searching ○ define indexing and search algorithms ● Evaluate architecture ○ stress tests (performance) ○ quality tests (accuracy) ● Iterate
  14. 14. Architecture analysis and design ● analyze existing documents ○ avg size ○ language ○ topics, style, etc. ● analyze existing query logs ○ avg response time ○ avg length (how much it takes to specify a query?) ○ avg query per second
  15. 15. Most time spent on ● testing how documents get indexed ● testing how user queries get transformer in platform specific queries ● tweaking indexing algorithms ● tweaking search algorithms ● tweaking ranking ● platform optimization for scalability
  16. 16. Challenges ● Architecture constraints ● Performance ● Diverging stakeholders concerns ● Dynamically scaling search
  17. 17. Sample architecture constraint #1 ● Data storage has to be on NFS ● Lucene is IO intensive ● NFS makes it slower ● Concurrent read writes makes it error prone
  18. 18. Sample architecture constraint #2 ● Change search engine ● Systems talking to the SE need to switch API ● Only in the long run ● In the short run an adapter layer for old APIs on new APIs has to be developed
  19. 19. Indexing performance ● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format ● The adaption layer between old and new API becomes the bottleneck ● Time to switch to the new API natively
  20. 20. Diverging concerns ● Article authors check the search engine exactly handles their writings wanting perfect recall and precision ○ so lot of time is spent on adjusting ranking ● Markters want to be able to overcome ranking and put something they want to sell ○ ranking algorithm gets breached ● Need flexible algorithms
  21. 21. Scale dinamically ● Search engine needs not to break even under high peaks of load ● Such peaks are often unpredictable ● Need a fast way to add more computing power
  22. 22. Takeaways ● small iterations (no waterfalls!) ○ analyze portion of data / queries ○ change search / index algorithms ○ test, involve stakeholders ○ forces ability to reindex quickly ● look at data (documents, query logs)

×