Searching for The Matrix in haystack        (with Elasticsearch)         Synopsi.TV case study           Tomáš Sirný      ...
The Environment●   Recommendation service for movies, TV shows●   People mark titles they watched(check-in), rate    them●...
The Problem●   Input box for search on top of web page●   Many movies, TV shows in database●   Lot of them have similar ti...
The Red Pill
The Blue Pill
The Tool●   Elasticsearch – designed for searching in    documents●   Based on Lucene – de facto standard●   Young yet fea...
The (Wannabe) Solution●   Differentiate titles●   Have cover, plot, cast, directors●   Year●   Popularity (whatever it mea...
The Text – First Attempt●   Text Query (now Match Query)●   phrase_prefix type – all words in input with    matching of pr...
The Text – First Attempt●   slop parameter - allows change of order, skip    words                 „matrix revolutions“   ...
The Sorting – First Attempt●   Default scoring considers only occurence text in    documents●   We also want other propert...
The Rating●   Allows to prefer more „popular“ titles●   External – top lists, links, etc.●   Internal – usage data from sy...
The Tuning of Rating●   Get rid off external data●   Only score „completeness“ of each document●   Release year           ...
The Tuning of Query●    Name field analyzed, edgeNGram filterindex:    analysis:     filter:      my_ngram:        type: e...
The AKAs●   Also know as – names of title in different    countries●   Lot of additional data, sometimes only „noise“●   „...
The AKAs●   Array of AKAs – problems with scoring of short    names●   Nested AKA documents - query does not return    nes...
The Sorting – Second Attempt●   Custom Filter Score Query – apply set of filters,    each filter boosts documents which pa...
The Sorting – Used Score Filters●   Release date (in case of TV show last episode)    in last 6 months●   Release date in ...
The Sorting – Short Input●   Special case 1 – 3 letters●   Very rare to exact match●   Should work after typing of first l...
The Year in Input●   Matrix 1999●   Matrix Reloaded (2003)●   Matrix 2000- released to 2000●   Matrix 2000+ released since...
One More Thing – Advanced Search●   Titles have also data about their usage●   „Watched by Friends“ Filter    Shows titles...
The End  ThanksTomáš Sirný@junckritter
Searching for The Matrix in haystack  (with Elasticsearch)
Searching for The Matrix in haystack  (with Elasticsearch)
Upcoming SlideShare
Loading in …5
×

Searching for The Matrix in haystack (with Elasticsearch)

1,660 views
1,420 views

Published on

Case study of technical solutions of search box for movies/TV Shows with Elasticsearch.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,660
On SlideShare
0
From Embeds
0
Number of Embeds
85
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Searching for The Matrix in haystack (with Elasticsearch)

  1. 1. Searching for The Matrix in haystack (with Elasticsearch) Synopsi.TV case study Tomáš Sirný @junckritter Pyvo/Rubyslava November 2012
  2. 2. The Environment● Recommendation service for movies, TV shows● People mark titles they watched(check-in), rate them● Get recommendations● Make „Watch Later“ or other-purpose lists● …● Search (to check-in, add to list, share, etc.)
  3. 3. The Problem● Input box for search on top of web page● Many movies, TV shows in database● Lot of them have similar titles, use similar words● Some are more probable to be searched for● Few input information – 3, 4 letters● Autocomplete, not only exact match
  4. 4. The Red Pill
  5. 5. The Blue Pill
  6. 6. The Tool● Elasticsearch – designed for searching in documents● Based on Lucene – de facto standard● Young yet feature-rich● Quick development (despite 1 core developer)● Business company recently founded● 10M funding in A-round
  7. 7. The (Wannabe) Solution● Differentiate titles● Have cover, plot, cast, directors● Year● Popularity (whatever it means)● Prefer ones with more data, more popular
  8. 8. The Text – First Attempt● Text Query (now Match Query)● phrase_prefix type – all words in input with matching of prefixes („m“, „ma“, „mat“, …), same order of words● operator and● not_analyzed „name“ field (not broke down to words)
  9. 9. The Text – First Attempt● slop parameter - allows change of order, skip words „matrix revolutions“ „revolutions matrix“ „matrix first revolutions“
  10. 10. The Sorting – First Attempt● Default scoring considers only occurence text in documents● We also want other properties of document to count● Custom Score Query● Define script for scoring „script“: „_score * doc[„rating“].value“
  11. 11. The Rating● Allows to prefer more „popular“ titles● External – top lists, links, etc.● Internal – usage data from system● Problem for newly added titles – lack of data of both types
  12. 12. The Tuning of Rating● Get rid off external data● Only score „completeness“ of each document● Release year „script“: „3 * log(_score) + 1 * log(doc["year"].date.year – 1880) + 0.75 * log(doc["watched_count"].value +1)“
  13. 13. The Tuning of Query● Name field analyzed, edgeNGram filterindex: analysis: filter: my_ngram: type: edgeNGram min_gram : 1 max_gram : 11 side : front analyzer: my_analyzer: type: custom tokenizer: standard filter: [lowercase, asciifolding, my_ngram]
  14. 14. The AKAs● Also know as – names of title in different countries● Lot of additional data, sometimes only „noise“● „original“ is still most important
  15. 15. The AKAs● Array of AKAs – problems with scoring of short names● Nested AKA documents - query does not return nested document which matched● AKA document is child of title – have own information (original, country, slug)● Top Children Query – which AKA matched● Another query with Ids Filter – get titles
  16. 16. The Sorting – Second Attempt● Custom Filter Score Query – apply set of filters, each filter boosts documents which pass its condition● boost parameter of filter – differentiate importance of that filter● score_mode – sum, product of boost values
  17. 17. The Sorting – Used Score Filters● Release date (in case of TV show last episode) in last 6 months● Release date in next 3 months● „original“ AKA● Have all important categories filled● Not Short genre● Not TV movie
  18. 18. The Sorting – Short Input● Special case 1 – 3 letters● Very rare to exact match● Should work after typing of first letter● Only titles from this year● 3 letters – also titles in near future and previous year
  19. 19. The Year in Input● Matrix 1999● Matrix Reloaded (2003)● Matrix 2000- released to 2000● Matrix 2000+ released since 2000
  20. 20. One More Thing – Advanced Search● Titles have also data about their usage● „Watched by Friends“ Filter Shows titles with IDs of your „friends“ in proper field (TermsFilter([IDS]))● „Not Watched“ filter Show titles in which is your ID absent (NotFilter(TermFilter(ID))● combination – titles to watch to catch up with friends
  21. 21. The End ThanksTomáš Sirný@junckritter

×