Searching for The Matrix in haystack (with Elasticsearch)

Searching for The Matrix in haystack
(with Elasticsearch)
Synopsi.TV case study

Tomáš Sirný
@junckritter

Pyvo/Rubyslava November 2012

The Environment
● Recommendation service for movies, TV shows
● People mark titles they watched(check-in), rate
them
● Get recommendations
● Make „Watch Later“ or other-purpose lists
● …
● Search (to check-in, add to list, share, etc.)

The Problem
● Input box for search on top of web page
● Many movies, TV shows in database
● Lot of them have similar titles, use similar
words
● Some are more probable to be searched for
● Few input information – 3, 4 letters
● Autocomplete, not only exact match

The Tool
● Elasticsearch – designed for searching in
documents
● Based on Lucene – de facto standard
● Young yet feature-rich
● Quick development (despite 1 core developer)
● Business company recently founded
● 10M funding in A-round

The (Wannabe) Solution
● Differentiate titles
● Have cover, plot, cast, directors
● Year
● Popularity (whatever it means)
● Prefer ones with more data, more popular

The Text – First Attempt

● Text Query (now Match Query)
● phrase_prefix type – all words in input with
matching of prefixes („m“, „ma“, „mat“, …), same
order of words
● operator and
● not_analyzed „name“ field (not broke down to
words)

The Text – First Attempt

● slop parameter - allows change of order, skip
words
„matrix revolutions“

„revolutions matrix“

„matrix first revolutions“

The Sorting – First Attempt
● Default scoring considers only occurence text in
documents
● We also want other properties of document to
count
● Custom Score Query
● Define script for scoring

„script“: „_score * doc[„rating“].value“

The Rating
● Allows to prefer more „popular“ titles
● External – top lists, links, etc.
● Internal – usage data from system
● Problem for newly added titles – lack of data of
both types

The Tuning of Rating
● Get rid off external data
● Only score „completeness“ of each document
● Release year

„script“: „3 * log(_score) +
1 * log(doc["year"].date.year – 1880) +
0.75 * log(doc["watched_count"].value +1)“

The Tuning of Query
● Name field analyzed, edgeNGram filter

index:
analysis:
filter:
my_ngram:
type: edgeNGram
min_gram : 1
max_gram : 11
side : front
analyzer:
my_analyzer:
type: custom
tokenizer: standard
filter: [lowercase, asciifolding, my_ngram]

The AKA's

● Also know as – names of title in different
countries
● Lot of additional data, sometimes only „noise“
● „original“ is still most important

The AKA's
● Array of AKAs – problems with scoring of short
names
● Nested AKA documents - query does not return
nested document which matched

● AKA document is child of title – have own
information (original, country, slug)
● Top Children Query – which AKA matched
● Another query with Ids Filter – get titles

The Sorting – Second Attempt
● Custom Filter Score Query – apply set of filters,
each filter boosts documents which pass its
condition
● boost parameter of filter – differentiate
importance of that filter
● score_mode – sum, product of boost values

The Sorting – Used Score Filters
● Release date (in case of TV show last episode)
in last 6 months
● Release date in next 3 months
● „original“ AKA
● Have all important categories filled
● Not Short genre
● Not TV movie

The Sorting – Short Input
● Special case 1 – 3 letters
● Very rare to exact match
● Should work after typing of first letter
● Only titles from this year
● 3 letters – also titles in near future and previous
year

The Year in Input
● Matrix 1999
● Matrix Reloaded (2003)
● Matrix 2000- released to 2000
● Matrix 2000+ released since 2000

One More Thing – Advanced Search
● Titles have also data about their usage
● „Watched by Friends“ Filter
Shows titles with IDs of your „friends“ in proper
field (TermsFilter([IDS]))
● „Not Watched“ filter
Show titles in which is your ID absent
(NotFilter(TermFilter(ID))
● combination – titles to watch to catch up with
friends

The End

Thanks

Tomáš Sirný
@junckritter

Searching for The Matrix in haystack (with Elasticsearch)

More Related Content

What's hot

Similar to Searching for The Matrix in haystack (with Elasticsearch)

Recently uploaded

Searching for The Matrix in haystack (with Elasticsearch)