Searching for The Matrix in haystack
        (with Elasticsearch)
         Synopsi.TV case study



           Tomáš Sirný
           @junckritter

 Pyvo/Rubyslava November 2012
The Environment
●   Recommendation service for movies, TV shows
●   People mark titles they watched(check-in), rate
    them
●   Get recommendations
●   Make „Watch Later“ or other-purpose lists
●   …
●   Search (to check-in, add to list, share, etc.)
The Problem
●   Input box for search on top of web page
●   Many movies, TV shows in database
●   Lot of them have similar titles, use similar
    words
●   Some are more probable to be searched for
●   Few input information – 3, 4 letters
●   Autocomplete, not only exact match
The Red Pill
The Blue Pill
The Tool
●   Elasticsearch – designed for searching in
    documents
●   Based on Lucene – de facto standard
●   Young yet feature-rich
●   Quick development (despite 1 core developer)
●   Business company recently founded
●   10M funding in A-round
The (Wannabe) Solution
●   Differentiate titles
●   Have cover, plot, cast, directors
●   Year
●   Popularity (whatever it means)
●   Prefer ones with more data, more popular
The Text – First Attempt

●   Text Query (now Match Query)
●   phrase_prefix type – all words in input with
    matching of prefixes („m“, „ma“, „mat“, …), same
    order of words
●   operator and
●   not_analyzed „name“ field (not broke down to
    words)
The Text – First Attempt

●   slop parameter - allows change of order, skip
    words
                 „matrix revolutions“

                 „revolutions matrix“

              „matrix first revolutions“
The Sorting – First Attempt
●   Default scoring considers only occurence text in
    documents
●   We also want other properties of document to
    count
●   Custom Score Query
●   Define script for scoring

        „script“: „_score * doc[„rating“].value“
The Rating
●   Allows to prefer more „popular“ titles
●   External – top lists, links, etc.
●   Internal – usage data from system
●   Problem for newly added titles – lack of data of
    both types
The Tuning of Rating
●   Get rid off external data
●   Only score „completeness“ of each document
●   Release year


               „script“: „3 * log(_score) +
       1 * log(doc["year"].date.year – 1880) +
    0.75 * log(doc["watched_count"].value +1)“
The Tuning of Query
●    Name field analyzed, edgeNGram filter

index:
    analysis:
     filter:
      my_ngram:
        type: edgeNGram
        min_gram : 1
        max_gram : 11
        side : front
     analyzer:
      my_analyzer:
        type: custom
        tokenizer: standard
        filter: [lowercase, asciifolding, my_ngram]
The AKA's

●   Also know as – names of title in different
    countries
●   Lot of additional data, sometimes only „noise“
●   „original“ is still most important
The AKA's
●   Array of AKAs – problems with scoring of short
    names
●   Nested AKA documents - query does not return
    nested document which matched

●   AKA document is child of title – have own
    information (original, country, slug)
●   Top Children Query – which AKA matched
●   Another query with Ids Filter – get titles
The Sorting – Second Attempt
●   Custom Filter Score Query – apply set of filters,
    each filter boosts documents which pass its
    condition
●   boost parameter of filter – differentiate
    importance of that filter
●   score_mode – sum, product of boost values
The Sorting – Used Score Filters
●   Release date (in case of TV show last episode)
    in last 6 months
●   Release date in next 3 months
●   „original“ AKA
●   Have all important categories filled
●   Not Short genre
●   Not TV movie
The Sorting – Short Input
●   Special case 1 – 3 letters
●   Very rare to exact match
●   Should work after typing of first letter
●   Only titles from this year
●   3 letters – also titles in near future and previous
    year
The Year in Input
●   Matrix 1999
●   Matrix Reloaded (2003)
●   Matrix 2000- released to 2000
●   Matrix 2000+ released since 2000
One More Thing – Advanced Search
●   Titles have also data about their usage
●   „Watched by Friends“ Filter
    Shows titles with IDs of your „friends“ in proper
    field (TermsFilter([IDS]))
●   „Not Watched“ filter
    Show titles in which is your ID absent
    (NotFilter(TermFilter(ID))
●   combination – titles to watch to catch up with
    friends
The End




  Thanks


Tomáš Sirný
@junckritter

Searching for The Matrix in haystack (with Elasticsearch)

  • 1.
    Searching for TheMatrix in haystack (with Elasticsearch) Synopsi.TV case study Tomáš Sirný @junckritter Pyvo/Rubyslava November 2012
  • 2.
    The Environment ● Recommendation service for movies, TV shows ● People mark titles they watched(check-in), rate them ● Get recommendations ● Make „Watch Later“ or other-purpose lists ● … ● Search (to check-in, add to list, share, etc.)
  • 3.
    The Problem ● Input box for search on top of web page ● Many movies, TV shows in database ● Lot of them have similar titles, use similar words ● Some are more probable to be searched for ● Few input information – 3, 4 letters ● Autocomplete, not only exact match
  • 4.
  • 5.
  • 6.
    The Tool ● Elasticsearch – designed for searching in documents ● Based on Lucene – de facto standard ● Young yet feature-rich ● Quick development (despite 1 core developer) ● Business company recently founded ● 10M funding in A-round
  • 7.
    The (Wannabe) Solution ● Differentiate titles ● Have cover, plot, cast, directors ● Year ● Popularity (whatever it means) ● Prefer ones with more data, more popular
  • 8.
    The Text –First Attempt ● Text Query (now Match Query) ● phrase_prefix type – all words in input with matching of prefixes („m“, „ma“, „mat“, …), same order of words ● operator and ● not_analyzed „name“ field (not broke down to words)
  • 9.
    The Text –First Attempt ● slop parameter - allows change of order, skip words „matrix revolutions“ „revolutions matrix“ „matrix first revolutions“
  • 10.
    The Sorting –First Attempt ● Default scoring considers only occurence text in documents ● We also want other properties of document to count ● Custom Score Query ● Define script for scoring „script“: „_score * doc[„rating“].value“
  • 11.
    The Rating ● Allows to prefer more „popular“ titles ● External – top lists, links, etc. ● Internal – usage data from system ● Problem for newly added titles – lack of data of both types
  • 12.
    The Tuning ofRating ● Get rid off external data ● Only score „completeness“ of each document ● Release year „script“: „3 * log(_score) + 1 * log(doc["year"].date.year – 1880) + 0.75 * log(doc["watched_count"].value +1)“
  • 13.
    The Tuning ofQuery ● Name field analyzed, edgeNGram filter index: analysis: filter: my_ngram: type: edgeNGram min_gram : 1 max_gram : 11 side : front analyzer: my_analyzer: type: custom tokenizer: standard filter: [lowercase, asciifolding, my_ngram]
  • 14.
    The AKA's ● Also know as – names of title in different countries ● Lot of additional data, sometimes only „noise“ ● „original“ is still most important
  • 16.
    The AKA's ● Array of AKAs – problems with scoring of short names ● Nested AKA documents - query does not return nested document which matched ● AKA document is child of title – have own information (original, country, slug) ● Top Children Query – which AKA matched ● Another query with Ids Filter – get titles
  • 17.
    The Sorting –Second Attempt ● Custom Filter Score Query – apply set of filters, each filter boosts documents which pass its condition ● boost parameter of filter – differentiate importance of that filter ● score_mode – sum, product of boost values
  • 18.
    The Sorting –Used Score Filters ● Release date (in case of TV show last episode) in last 6 months ● Release date in next 3 months ● „original“ AKA ● Have all important categories filled ● Not Short genre ● Not TV movie
  • 19.
    The Sorting –Short Input ● Special case 1 – 3 letters ● Very rare to exact match ● Should work after typing of first letter ● Only titles from this year ● 3 letters – also titles in near future and previous year
  • 20.
    The Year inInput ● Matrix 1999 ● Matrix Reloaded (2003) ● Matrix 2000- released to 2000 ● Matrix 2000+ released since 2000
  • 21.
    One More Thing– Advanced Search ● Titles have also data about their usage ● „Watched by Friends“ Filter Shows titles with IDs of your „friends“ in proper field (TermsFilter([IDS])) ● „Not Watched“ filter Show titles in which is your ID absent (NotFilter(TermFilter(ID)) ● combination – titles to watch to catch up with friends
  • 23.
    The End Thanks Tomáš Sirný @junckritter