Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama
Guillaume Cabanac
guillaume.cabanac@univ-tlse3.fr
Toulouse: A Picture is Worth a Thousand Words
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
2
1
2
3
4
Capbreton
3h ride
Toulouse
population: 437 000
students: 97 000
Ax-les-Thermes
1h40 ride
Collioure
2h30 ride
en.wikipedia.org
Telly Addicts Need Help to Find TV Series
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
amazon.com →
3
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Text Mining: Let’s Crunch Subtitles
4
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
Cold CaseCold Case
GreyGrey’s Anatomy’s Anatomy
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
What’s in a Subtitle File?
5
 Title – Season – Episode – Language.srt
 1 episode = 1 plain text file
 Synchronization
 start --> stop
 Dialogue
 We can easily extract words
[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum,
going, great, happen*2, has, hungry, i*2,
is, it, love, m, my, nice, night*2, miami,
now, pork, s*2, sandwiches, something, the,
to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Home]
7 527 files = 337 MB
100% Java and Oracle
DB technology at Work! [Search engine]
7
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list
of results
DB technology at Work! [Infos]
8
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Most
popular
terms
Most
related
series
DB technology at Work! [Recommendations]
9
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Recommendations]
10
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
I liked I disliked
What should
I watch next?
DB technology at Work! [Recommendations]
11
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list of
recommendations
How Does this Work?
12
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Architecture and Data Model
13
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT, term}
8 plane
27 killer
29 crash
Posting = { idT*, idS*, nb}
27 45 89
8 45 3
8 12 90
⊆
⊆
Theory − Text Indexing Pipeline
14
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization +
lowercase
Stopwords removal
Stemming
PorterPorter’s Stemmer (1980)’s Stemmer (1980)
http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a
considerable fortune in Poland. On his death 11 years later he willed his
entire estate to build a residential school for educating young boys. In
the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this
was converted into a day school to be known as Robert Gordon’s
College. This school also began to hold day and evening classes for boys
girls and adults in primary secondary mechanical and other subjects …
Counting
Theory − Similarity of Paired Series
15
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
A Big Limitation
 The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
 Dice’s Coefficient (1945)
 Based on the Set Theory
 Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}
Grey’s = {doctor, care, hospital}
Vocabulary
Theory − Vector Space Model, Term Weighting
16
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Raw TF
dexter > lost
max
max
 Normalization
TF / max(TF)
survive ?
max
max
dexter < lost
Theory − Best Match Retrieval
17
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
 Find most popular termspopular terms for a TV series
 Compute similaritysimilarity between TV series
 Find TV series matching a querymatching a query
Theory − More on Term Weighting
18
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 45 1467 6790 n
1 TV series = 1 vector
 All terms are supposed to be equally representative
… but ‘survive’ is way more unusual than ‘people’
⇒ ‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
Theory − The Big Picture: TF*IDF
19
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
Theory … and Practice
20
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Series = { idS, name, maxNb}
12 Lost 540
45 Dexter 125
Dict = { idT, term idf }
8 plane 1.25
27 killer 2.87
29 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.71
8 45 3 0.02
8 12 90 0.16
⊆
⊆
Description of a TV Series
21
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Lost
⋈
 Many surnames need to be filtered out
Retrieval of TV Series − queries with 1 term
22
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive ⋈
Importance of normalization
• Stargate Atlantis
nb/maxNb = 63/1116 = 0.05645
• Blade
nb/maxNb = 9/163 = 0.05521
Retrieval of TV Series − queries with n terms
23
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978
⁞
Similar to House?
Computing Similarities Among TV Series 1/2
24
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
First, let’s compute the numerator where:
Ai = Terms from House
Bi = Terms from Another TV series Ai Bi
Similar to House?
Computing Similarities Among TV Series 2/2
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
⋈
⋈
25
Thank you
http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac

Searching and Recommending TV series with SQL

  • 1.
    Series-O-RamaSeries-O-Rama Search & RecommendTV series with SQLSearch & Recommend TV series with SQL http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr
  • 2.
    Toulouse: A Pictureis Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population: 437 000 students: 97 000 Ax-les-Thermes 1h40 ride Collioure 2h30 ride
  • 3.
    en.wikipedia.org Telly Addicts NeedHelp to Find TV Series  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system amazon.com → 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 4.
    Text Mining: Let’sCrunch Subtitles 4  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system Cold CaseCold Case GreyGrey’s Anatomy’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 5.
    What’s in aSubtitle File? 5  Title – Season – Episode – Language.srt  1 episode = 1 plain text file  Synchronization  start --> stop  Dialogue  We can easily extract words [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 6.
    6 Series-O-Rama: Search &Recommend TV series with SQL Guillaume Cabanac DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle
  • 7.
    DB technology atWork! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results
  • 8.
    DB technology atWork! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series
  • 9.
    DB technology atWork! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 10.
    DB technology atWork! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?
  • 11.
    DB technology atWork! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations
  • 12.
    How Does thisWork? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 13.
    Architecture and DataModel 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Dict = { idT, term} 8 plane 27 killer 29 crash Posting = { idT*, idS*, nb} 27 45 89 8 45 3 8 12 90 ⊆ ⊆
  • 14.
    Theory − TextIndexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed, ..., planes, ..., is] [plane, crashed, ..., planes, ...] [plane, crash, ..., plane, ...] {(plane, 48), (crash, 15) ...} Tokenization + lowercase Stopwords removal Stemming PorterPorter’s Stemmer (1980)’s Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting
  • 15.
    Theory − Similarityof Paired Series 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac A Big Limitation  The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times  Dice’s Coefficient (1945)  Based on the Set Theory  Example: Let us Model a Series as a Set of Terms House = {hospital, doctor, crazy, psycho} Grey’s = {doctor, care, hospital}
  • 16.
    Vocabulary Theory − VectorSpace Model, Term Weighting 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max max  Normalization TF / max(TF) survive ? max max dexter < lost
  • 17.
    Theory − BestMatch Retrieval 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 1 45 1467 6790 n Now, we know how to:  Find most popular termspopular terms for a TV series  Compute similaritysimilarity between TV series  Find TV series matching a querymatching a query
  • 18.
    Theory − Moreon Term Weighting 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 45 1467 6790 n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ⇒ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
  • 19.
    Theory − TheBig Picture: TF*IDF 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
  • 20.
    Theory … andPractice 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = { idS, name, maxNb} 12 Lost 540 45 Dexter 125 Dict = { idT, term idf } 8 plane 1.25 27 killer 2.87 29 crash 3.07 Posting = { idT*, idS*, nb, tf } 27 45 89 0.71 8 45 3 0.02 8 12 90 0.16 ⊆ ⊆
  • 21.
    Description of aTV Series 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out
  • 22.
    Retrieval of TVSeries − queries with 1 term 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization • Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 • Blade nb/maxNb = 9/163 = 0.05521
  • 23.
    Retrieval of TVSeries − queries with n terms 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 3.978 ⁞
  • 24.
    Similar to House? ComputingSimilarities Among TV Series 1/2 24 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi
  • 25.
    Similar to House? ComputingSimilarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 25
  • 26.

Editor's Notes

  • #22 select term, tf*idf score from posting p, dict d where p.idT = d.idT and idS = (select idS from series where name = &apos;Lost&apos;) order by 2 desc, 1 ;
  • #23 select name, term, nb, tf from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term = &apos;survive&apos; order by tf desc, name ;
  • #24 select name, sum(tf*idf) rsv from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term in (&apos;survive&apos;, &apos;mulder&apos;) group by p.idS, name order by 2 desc, 1 ;
  • #25 with numerator as ( select pLost.idS idLostS, pOther.idS idOtherS, sum(pLost.tf*idf * pOther.tf*idf) numValue from posting pLost, posting pOther, dict d where pLost.idT = pOther.idT -- common terms and pLost.idT = d.idT -- for IDF and pLost.idS &lt;&gt; pOther.idS and pLost.idS = (select idS from series where name = &apos;House&apos;) group by pLost.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idLostS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;
  • #26 with numerator as ( select pHouse.idS idHouseS, pOther.idS idOtherS, sum(pHouse.tf*idf * pOther.tf*idf) numValue from posting pHouse, posting pOther, dict d where pHouse.idT = pOther.idT -- common terms and pHouse.idT = d.idT -- for IDF and pHouse.idS &lt;&gt; pOther.idS and pHouse.idS = (select idS from series where name = &apos;House&apos;) group by pHouse.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idHouseS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;