Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Architecture of a search engine

6,560 views

Published on

My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.

Published in: Technology, Design
  • Be the first to comment

Architecture of a search engine

  1. 1. Architecture of a Search Engine Paris Tech Talks #7 - April ’14 @sylvainutard - @algolia
  2. 2. • Today Search means Google • Search is a daily activity • Search is complex • DB are (probably) not handling text queries • Speed and relevance are keys • Fuzzy matching: typos! 2 Search
  3. 3. • Databases • Optimized for INSERT/UPDATE/DELETE/ SELECT (that's a lot) • Strong query syntax (mostly SQL) • Some operations scan all your documents (missing index?) 3 Why Search engines?
  4. 4. • Search engines • HIGHLY optimized for “SELECT” (only) • Full-text queries: understand what is a word • Query execution time driven by the number of matching documents • And obviously, “LIKE '%foo bar%’" is not full- text search 4 Why Search engines?
  5. 5. 5 Why Search engines? Search Push data periodically or in realtime Full-text search Primary storage (DB, files, ...) Search engine Application
  6. 6. • Input = documents • Composed by multiple attributes (textual, numerical, geo) • Output = documents • Full-text query and/or numerical filters • Understandable results: match score (ranking) + highlighting 6 How it works
  7. 7. • 2 distinct processes • Indexing: storing documents in a highly optimized way to answer queries • Query • Matching documents • Ranking matched documents 7 Implementation
  8. 8. • Indexing means building an “index“ or “inverted lists“ • A dedicated data structure optimized for search • Input = a set of documents containing words • Output = a set of words associated to documents 8 Implementation: Indexing process
  9. 9. 9 Implementation: Indexing process foo bar baz Doc 1 bar foo Doc 2 baz baz qux Doc 3 foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Indexing Inverted lists Documents Index
  10. 10. • Queries • Goal = Retrieve all documents matching a user query • Order results from the highest ranked to the lowest 10 Implementation: Query process
  11. 11. 11 Implementation: Query process foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz" Sort matching documents Pagination • 1-word query = inverted lists intersection
  12. 12. 12 Implementation: Query process • N-words query = inverted lists intersection foo bar baz qux Doc 1, Doc 2 Doc 1, Doc 3 Doc 1, Doc 2 Doc 3 Inverted lists Index User query "baz qux" Sort matching documents Intersect inverted lists Pagination
  13. 13. • But how do you handle typing mistakes? • Edit-distance algorithms (ex: Levenshtein) ! • levenshtein(bar, baz) = 1 (substitution)! • levenshtein(bar, br) = 1 (deletion)! • levenshtein(bar, foobar) = 3 (addition)! • Comparing a word with all known words would be too costly 13 Implementation: Query process
  14. 14. 14 Implementation: Query process • The words dictionary is stored in a TRIE to enable Levenshtein-based lookups (recursive-based traversal) Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f
  15. 15. 15 Implementation: Query process Example: faz Doc 1 (pos=1, 3) Doc 2 (pos=3) Doc 1 (pos=2) Doc 3 (pos=1) Index Doc 1 (pos=4) Doc 3 (pos=2) b c a o r z o f faz (distance=1) faz (distance=0) faz (distance=1) faz (distance=1) faz (distance=2) faz (distance=1) faz (distance=2) faz (distance=3)
  16. 16. • How are the matching documents ranked? • Number of match occurrences? TF-IDF ? • Numerical value reflecting popularity? • Number of typing mistakes? • Proximity between matched words? • … 16 Implementation: Query process
  17. 17. 17 Several implementations
  18. 18. • What I didn’t speak about: • Numerical/Geo queries (Including operators) • Advanced query syntax (boolean operators, proximity operators) • Faceting & Aggregations (Categorization) • Sharding (Horizontal scalability) • Incremental indexing (Generational data structures) • … (see u next time) 18 Missing subjects
  19. 19. Q/A Now or later sylvain@algolia.com

×