SlideShare a Scribd company logo
Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama
Guillaume Cabanac
guillaume.cabanac@univ-tlse3.fr
Toulouse: A Picture is Worth a Thousand Words
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
2
1
2
3
4
Capbreton
3h ride
Toulouse
population: 437 000
students: 97 000
Ax-les-Thermes
1h40 ride
Collioure
2h30 ride
en.wikipedia.org
Telly Addicts Need Help to Find TV Series
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
amazon.com →
3
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Text Mining: Let’s Crunch Subtitles
4
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
Cold CaseCold Case
GreyGrey’s Anatomy’s Anatomy
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
What’s in a Subtitle File?
5
 Title – Season – Episode – Language.srt
 1 episode = 1 plain text file
 Synchronization
 start --> stop
 Dialogue
 We can easily extract words
[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum,
going, great, happen*2, has, hungry, i*2,
is, it, love, m, my, nice, night*2, miami,
now, pork, s*2, sandwiches, something, the,
to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Home]
7 527 files = 337 MB
100% Java and Oracle
DB technology at Work! [Search engine]
7
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list
of results
DB technology at Work! [Infos]
8
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Most
popular
terms
Most
related
series
DB technology at Work! [Recommendations]
9
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Recommendations]
10
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
I liked I disliked
What should
I watch next?
DB technology at Work! [Recommendations]
11
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list of
recommendations
How Does this Work?
12
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Architecture and Data Model
13
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT, term}
8 plane
27 killer
29 crash
Posting = { idT*, idS*, nb}
27 45 89
8 45 3
8 12 90
⊆
⊆
Theory − Text Indexing Pipeline
14
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization +
lowercase
Stopwords removal
Stemming
PorterPorter’s Stemmer (1980)’s Stemmer (1980)
http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a
considerable fortune in Poland. On his death 11 years later he willed his
entire estate to build a residential school for educating young boys. In
the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this
was converted into a day school to be known as Robert Gordon’s
College. This school also began to hold day and evening classes for boys
girls and adults in primary secondary mechanical and other subjects …
Counting
Theory − Similarity of Paired Series
15
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
A Big Limitation
 The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
 Dice’s Coefficient (1945)
 Based on the Set Theory
 Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}
Grey’s = {doctor, care, hospital}
Vocabulary
Theory − Vector Space Model, Term Weighting
16
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Raw TF
dexter > lost
max
max
 Normalization
TF / max(TF)
survive ?
max
max
dexter < lost
Theory − Best Match Retrieval
17
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
 Find most popular termspopular terms for a TV series
 Compute similaritysimilarity between TV series
 Find TV series matching a querymatching a query
Theory − More on Term Weighting
18
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 45 1467 6790 n
1 TV series = 1 vector
 All terms are supposed to be equally representative
… but ‘survive’ is way more unusual than ‘people’
⇒ ‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
Theory − The Big Picture: TF*IDF
19
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
Theory … and Practice
20
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Series = { idS, name, maxNb}
12 Lost 540
45 Dexter 125
Dict = { idT, term idf }
8 plane 1.25
27 killer 2.87
29 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.71
8 45 3 0.02
8 12 90 0.16
⊆
⊆
Description of a TV Series
21
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Lost
⋈
 Many surnames need to be filtered out
Retrieval of TV Series − queries with 1 term
22
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive ⋈
Importance of normalization
• Stargate Atlantis
nb/maxNb = 63/1116 = 0.05645
• Blade
nb/maxNb = 9/163 = 0.05521
Retrieval of TV Series − queries with n terms
23
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978
⁞
Similar to House?
Computing Similarities Among TV Series 1/2
24
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
First, let’s compute the numerator where:
Ai = Terms from House
Bi = Terms from Another TV series Ai Bi
Similar to House?
Computing Similarities Among TV Series 2/2
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
⋈
⋈
25
Thank you
http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac

More Related Content

More from Guillaume Cabanac

Adoption de l’identifiant ORCID : le cas des universités toulousaines
Adoption de l’identifiant ORCID : le cas des universités toulousainesAdoption de l’identifiant ORCID : le cas des universités toulousaines
Adoption de l’identifiant ORCID : le cas des universités toulousaines
Guillaume Cabanac
 
Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...
Guillaume Cabanac
 
Interroger la science
Interroger la scienceInterroger la science
Interroger la science
Guillaume Cabanac
 
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Guillaume Cabanac
 
Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...
Guillaume Cabanac
 
Gender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic WritingGender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic Writing
Guillaume Cabanac
 
Prospection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospectiveProspection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospective
Guillaume Cabanac
 
Questionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovationQuestionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovation
Guillaume Cabanac
 
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Guillaume Cabanac
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artists
Guillaume Cabanac
 
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Guillaume Cabanac
 
Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres
Guillaume Cabanac
 
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
Guillaume Cabanac
 
Émergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-HubÉmergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-Hub
Guillaume Cabanac
 
Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines: Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines:
Guillaume Cabanac
 
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociauxLes altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Guillaume Cabanac
 
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifiqueBibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Guillaume Cabanac
 
Le renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorshipLe renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorship
Guillaume Cabanac
 
Médias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheursMédias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheurs
Guillaume Cabanac
 
In Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through ScientometricsIn Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through Scientometrics
Guillaume Cabanac
 

More from Guillaume Cabanac (20)

Adoption de l’identifiant ORCID : le cas des universités toulousaines
Adoption de l’identifiant ORCID : le cas des universités toulousainesAdoption de l’identifiant ORCID : le cas des universités toulousaines
Adoption de l’identifiant ORCID : le cas des universités toulousaines
 
Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...
 
Interroger la science
Interroger la scienceInterroger la science
Interroger la science
 
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
 
Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...
 
Gender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic WritingGender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic Writing
 
Prospection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospectiveProspection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospective
 
Questionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovationQuestionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovation
 
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artists
 
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
 
Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres
 
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
 
Émergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-HubÉmergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-Hub
 
Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines: Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines:
 
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociauxLes altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
 
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifiqueBibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
 
Le renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorshipLe renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorship
 
Médias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheursMédias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheurs
 
In Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through ScientometricsIn Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through Scientometrics
 

Recently uploaded

CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 

Recently uploaded (20)

CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 

Searching and Recommending TV series with SQL

  • 1. Series-O-RamaSeries-O-Rama Search & Recommend TV series with SQLSearch & Recommend TV series with SQL http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr
  • 2. Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population: 437 000 students: 97 000 Ax-les-Thermes 1h40 ride Collioure 2h30 ride
  • 3. en.wikipedia.org Telly Addicts Need Help to Find TV Series  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system amazon.com → 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 4. Text Mining: Let’s Crunch Subtitles 4  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system Cold CaseCold Case GreyGrey’s Anatomy’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 5. What’s in a Subtitle File? 5  Title – Season – Episode – Language.srt  1 episode = 1 plain text file  Synchronization  start --> stop  Dialogue  We can easily extract words [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 6. 6 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle
  • 7. DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results
  • 8. DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series
  • 9. DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 10. DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?
  • 11. DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations
  • 12. How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 13. Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Dict = { idT, term} 8 plane 27 killer 29 crash Posting = { idT*, idS*, nb} 27 45 89 8 45 3 8 12 90 ⊆ ⊆
  • 14. Theory − Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed, ..., planes, ..., is] [plane, crashed, ..., planes, ...] [plane, crash, ..., plane, ...] {(plane, 48), (crash, 15) ...} Tokenization + lowercase Stopwords removal Stemming PorterPorter’s Stemmer (1980)’s Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting
  • 15. Theory − Similarity of Paired Series 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac A Big Limitation  The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times  Dice’s Coefficient (1945)  Based on the Set Theory  Example: Let us Model a Series as a Set of Terms House = {hospital, doctor, crazy, psycho} Grey’s = {doctor, care, hospital}
  • 16. Vocabulary Theory − Vector Space Model, Term Weighting 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max max  Normalization TF / max(TF) survive ? max max dexter < lost
  • 17. Theory − Best Match Retrieval 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 1 45 1467 6790 n Now, we know how to:  Find most popular termspopular terms for a TV series  Compute similaritysimilarity between TV series  Find TV series matching a querymatching a query
  • 18. Theory − More on Term Weighting 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 45 1467 6790 n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ⇒ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
  • 19. Theory − The Big Picture: TF*IDF 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
  • 20. Theory … and Practice 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = { idS, name, maxNb} 12 Lost 540 45 Dexter 125 Dict = { idT, term idf } 8 plane 1.25 27 killer 2.87 29 crash 3.07 Posting = { idT*, idS*, nb, tf } 27 45 89 0.71 8 45 3 0.02 8 12 90 0.16 ⊆ ⊆
  • 21. Description of a TV Series 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out
  • 22. Retrieval of TV Series − queries with 1 term 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization • Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 • Blade nb/maxNb = 9/163 = 0.05521
  • 23. Retrieval of TV Series − queries with n terms 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 3.978 ⁞
  • 24. Similar to House? Computing Similarities Among TV Series 1/2 24 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi
  • 25. Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 25

Editor's Notes

  1. select term, tf*idf score from posting p, dict d where p.idT = d.idT and idS = (select idS from series where name = &apos;Lost&apos;) order by 2 desc, 1 ;
  2. select name, term, nb, tf from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term = &apos;survive&apos; order by tf desc, name ;
  3. select name, sum(tf*idf) rsv from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term in (&apos;survive&apos;, &apos;mulder&apos;) group by p.idS, name order by 2 desc, 1 ;
  4. with numerator as ( select pLost.idS idLostS, pOther.idS idOtherS, sum(pLost.tf*idf * pOther.tf*idf) numValue from posting pLost, posting pOther, dict d where pLost.idT = pOther.idT -- common terms and pLost.idT = d.idT -- for IDF and pLost.idS &lt;&gt; pOther.idS and pLost.idS = (select idS from series where name = &apos;House&apos;) group by pLost.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idLostS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;
  5. with numerator as ( select pHouse.idS idHouseS, pOther.idS idOtherS, sum(pHouse.tf*idf * pOther.tf*idf) numValue from posting pHouse, posting pOther, dict d where pHouse.idT = pOther.idT -- common terms and pHouse.idT = d.idT -- for IDF and pHouse.idS &lt;&gt; pOther.idS and pHouse.idS = (select idS from series where name = &apos;House&apos;) group by pHouse.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idHouseS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;