SlideShare a Scribd company logo
1 of 29
Did you know?- Mining Interesting Trivia for
Entities from Wikipedia
Abhay Prakash1, Manoj K. Chinnakotla2, Dhaval Patel1, Puneet Garg2
1Indian Institute of Technology Roorkee, India 2Microsoft, India
Did you know?
Dark Knight (2008): To prepare for Joker’s role, Heath Ledger lived alone in a hotel
room for a month, formulating the character’s posture, voice, and personality.
IJCAI-15: IJCAI-15 is the first IJCAI edition in South America, and the southern most
edition ever.
Argentina: In 2001, Argentina had 5 Presidents in 10 days!
Tom Hanks: Tom Hanks has an asteroid named after him: “12818 tomhanks”
What is a Trivia?
 Definition: Trivia is any fact about an entity which is interesting due to any of
the following characteristics
 Unusualness
 Uniqueness
 Unexpectedness
 Weirdness
 But, Isn’t interestingness subjective?
 Yes!
 For the current work, we take a majoritarian view for interestingness
Why Trivia?
0
2
4
6
8
10
12
14
2/14/2015
2/16/2015
2/18/2015
2/20/2015
2/22/2015
2/24/2015
2/26/2015
2/28/2015
3/2/2015
3/4/2015
3/6/2015
3/8/2015
3/10/2015
3/12/2015
3/14/2015
Trivia Follow-on Engagement
• Helps in drawing user
attention and improves user
engagement with the
experience
• Appeals to their sense of
appreciating novelty, curiosity
and inquisitiveness
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
2/14/2015
2/16/2015
2/18/2015
2/20/2015
2/22/2015
2/24/2015
2/26/2015
2/28/2015
3/2/2015
3/4/2015
3/6/2015
3/8/2015
3/10/2015
3/12/2015
3/14/2015
Trivia Click Through
Trivia Curation
Manual process – Hard to scale across
large number of entities
Wikipedia Trivia Miner (WTM)
 Automatically mine trivia for entities from unstructured text of Wikipedia
 Why Wikipedia?
 Reliable for factual correctness
 Ample # of interesting trivia (56/100 in expt.)
 Learn a model of interestingness for target domain
Use the interestingness model to rank sentences from Wikipedia
Interestingness Model
Collect Ratings from Humans
Train a Model
Harness Publicly Available Sources
Train a Model
Candidate
Selection
Candidates’ Source
Top-K Interesting Trivia
from Candidates
Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Human Voted Trivia Source
Train Dataset
Filtering & Grading
Feature Extraction SVMrank
Train Phase
Model
System Architecture
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Training Phase
Learn Interestingness Model
Train Phase
Filtering & Grading
 Crawled Trivia from IMDB
 Top 5K movies, 99K trivia in total
 Filter facts with lesser reliability
 Number of votes < 5
 𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 =
# 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠
# 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠
 Convert this skewed distribution into grades
Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB]
0
5
10
15
20
25
30
35
40
39.56
30.33
17.08
4.88
3.57
1.74 1.06 0.65 0.6 0.33 0.21
%ageCoverage
Likeness Ratio
Filtering & Grading (Contd..)
 High Support for High LR
For L.R. > 0.6, # of votes >= 100
 Graded by Percentile-Cutoff to get 5 grades
[90,100], [75-90), [25-75), [10-25), [0-10)
 6163 samples from 846 movies
706
1091
2880
945
541
0
500
1000
1500
2000
2500
3000
3500
4 (Very
Interesting)
3
(Interesting)
2
(Ambiguous)
1 (Boring) 0 (Very
Boring)
Frequency
Trivia Grade
Feature Engineering
Bucket Feature Significance
Sample
features
Example Trivia
Unigram (U)
Features
Each word’s
TF-IDF
Identify imp. words which
make the trivia interesting
“stunt”, “award”,
“improvise”
“Tom Cruise did all of his own stunt driving.”
Linguistic (L)
Features
Superlative
Words
Shows the extremeness
(uniqueness)
“best”, “longest”,
“first”
“The longest animated Disney film since
Fantasia (1940).”
Contradictory
Words
Opposing ideas could spark
intrigue and interest
“but”,
“although”,
“unlike”
“The studios wanted Matthew McConaughey
for lead role, but James Cameron insisted on
Leonardo DiCaprio.”
Root Word
(Main Verb)
Captures core activity
being discussed in the
sentence
root_gross “Gravity grossed $274 Mn in North America”
Subject Word
(First Noun)
Captures core thing being
discussed in the sentence
subj_actor “The actors snorted crushed B vitamins for
scenes involving cocaine”
Readability Complex and lengthy trivia
are hardly interesting
FOG Index binned
in 3 bins ---
Feature Engineering (Contd…)
Bucket Feature Significance Sample features Example Trivia
Entity (E)
Features
Generic NEs captures general about-
ness
MONEY,
ORGANIZATION,
PERSON, DATE, TIME
and LOCATION
“The guns in the film were supplied by Aldo
Uberti Inc., a company in Italy.”
• ORGANIZATION and LOCATION
Related
Entities
captures specific about-
ness
(Entities resolved using
DBPedia)
entity_producer,
entity_director
“According to Victoria Alonso, Rocket
Raccoon and Groot were created through a
mix of motion-capture and rotomation VFX.”
• entity_producer, entity_character
Entity Linking
before
(L) Parsing
Captures generalized
story of sentence
subj_entity_produce
r
[The same trivia above]
• “According to entity_producer, …”
• subj_Victoria  subj_entity_producer
Focus Entities Captures core entities
being talked about
underroot_entity_
producer
[The same trivia above]
• underroot_entity_producer,
underroot_entity_character
Domain Independence of Features
All the features are automatically generated and domain-independent
Entity Features are automatically generated using attribute:value pairs in Dbpedia
For a match of ‘value’ in sentence, the match is replaced by entity_‘attribute’
 Unigram (U) and Linguistic (L) features are clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
Interestingness Ranking Model
 Given facts (sentences) along with their interestingness grade, learn a model of
interestingness which will rank sentences based on their interestingness
Use Rank SVM model
MOVIE_ID FEATURES GRADE
1 1:1 5:2 … 4
1 … 2
1 … 1
2 … 4
2 … 3
2 … 1
2 … 1
MOVIE_ID FEATURES
1 1:1 5:2 …
1 …
2 …
2 …
2 …
3 …
3 …
Image taken and modified from Wikipedia
SCORE
1.7
2.4
1.2
2.7
0.13
3.1
1.3
INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING
MODEL
Interestingness Model: Cross Validation Results
0.934
0.919
0.929
0.9419
0.944
0.951
0.9
0.91
0.92
0.93
0.94
0.95
0.96
Unigram (U) Linguistic (L) Entity Features (E) U + L U + E WTM (U + L + E)
NDCG@10
Feature Group
Interestingness Model: Feature Weights
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
Entity Linking leads to better
generalization else these
would have been
subj_wolverine etc.
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Retrieval Phase
Get Trivia from Wikipedia Page
Candidate Selection
 Sentence Extraction
 Crawled only the text in paragraph tag <p>…</p>
 Sentence detection  took each sentence for further processing
 Removed sentences with missing context
 E.g. “It really reminds me of my childhood.”
 Co-ref resolution to find out links to different sentence
 Remove if out link not the target entity
“Hanks revealed that he signed onto the film after an hour and a half
of reading the script. He initially ...”
First ‘he’ not an out link, ‘the film’ points to the target entity. Second
‘He’ is an out link. First sentence kept, Second removed
Evaluation Dataset
 20 New Movie Pages from Wikipedia
No. of Sentences: 2928
No. of Positive Sentences: 791
 Judged (crowd-sourced) by 5 judges
Two scale voting
Boring / Interesting
 Majority voting for class rating
 Statistically significant?
Got 100 trivia from IMDB also judged by 5 judges only
Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd-
sourced by 5 judges
Agreement between two mechanisms = Substantial (Kappa Value = 0.618)
Kappa Agreement
< 0 Less than chance agreement
0.01-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Almost perfect agreement
Comparative Baselines
I. Random [Baseline I]:
- 10 sentences picked randomly from Wikipedia
II. CS + Random
- Candidates Selected
- Remove sentences like “it really reminds me of my childhood”
III. CS + supPOS(Best) [Baseline II]:
- Candidates Selected
- Ranked by No. of Superlative Words
Rank # of sup.
words
Class
1 2 Interesting
2 2 Boring
3 1 Interesting
4 1 Interesting
5 1 Interesting
6 1 Boring
7 1 Boring
supPOS (Best Case)
Results: Precision@10
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: Precision@10
 CS+Random > Random
 Shows significance of
Candidate Selection
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: Precision@10
 CS+Random > Random
 Shows significance of Candidate
Selection
 WTM (U+L+E) >> WTM (U)
 Shows significance of Engineered
Linguistic (L) and Entity (E)
Features
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: Recall@K
 supPOS limited to one kind of
trivia
 WTM captures varied types
 62% recall till rank 25
 Performance Comparison
 supPOS better till rank 3
 Soon after rank 3, WTM
beats superPOS
0
10
20
30
40
50
60
70
0 5 10 15 20 25
%Recall
Rank
SuperPOS (Best Case) WTM Random
Qualitative Analysis
Result Movie Trivia Description
WTM Wins
(Sup. POS
Misses)
Interstellar
(2014)
Paramount is providing a virtual reality walkthrough
of the Endurance spacecraft using Oculus Rift
technology.
Due to Organization being
subject, and (U) features
(technology, reality, virtual)
Gravity
(2013)
When the script was finalized, Cuarón assumed it
would take about a year to complete the film, but it
took four and a half years.
Due to Entity.Director,
Subject (the script), Root
word (assume) and (U)
features (film, years)
WTM’s Bad
Elf (2003) Stop motion animation was also used. Candidate Selection failed
Rio 2
(2014) Rio 2 received mixed reviews from critics.
Root verb "receive" has high
weightage in model
Qualitative Analysis (Contd…)
Result Movie Trivia Description
Sup. POS Wins
(WTM misses)
The
Incredibles
(2004)
Humans are widely considered to be the most
difficult thing to execute in animation.
Presence of ‘most’,
absence of any Entity,
vague Root word
(consider)
Sup. POS's Bad
Lone
Survivor
(2013)
Most critics praised Berg's direction, as well as the
acting, story, visuals and battle sequences.
Here 'most' is not to show
degree but instead to
show generality.
Our Contributions
 Introduced a novel research problem
Mining Interesting Facts for Entities from Unstructured Text
 Proposed a novel approach “Wikipedia Trivia Miner (WTM)”
For mining top-k interesting trivia for movie entities based on their
interestingness
 For movie entities, we leverage already available user-generated trivia data from
IMDB for learning interestingness
All the Data and Code used in this paper have been made publicly available for research purposes at
https://github.com/abhayprakash/WikipediaTriviaMiner_SharedResources/
Acknowledgements
First author travel was supported by travel grants from Xerox Research Centre India,
IIT Roorkee, IJCAI and Microsoft Research India.
IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia

More Related Content

Similar to IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia

What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)Krist Wongsuphasawat
 
movie_notebook.pdf
movie_notebook.pdfmovie_notebook.pdf
movie_notebook.pdfpinstechwork
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolCurated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolTrent McConaghy
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
Movebot ENGR245 Lean LaunchPad Stanford 2018
Movebot ENGR245 Lean LaunchPad Stanford 2018Movebot ENGR245 Lean LaunchPad Stanford 2018
Movebot ENGR245 Lean LaunchPad Stanford 2018Stanford University
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer Jim Czuprynski
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsJason Anderson
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML MeetupRomain Yon
 
Chatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes ViewChatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes ViewMark Cieliebak
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in businessLouis Dorard
 
movieRecommendation_FinalReport
movieRecommendation_FinalReportmovieRecommendation_FinalReport
movieRecommendation_FinalReportSohini Sarkar
 
Introduction to Unsupervised Learning - Code Heroku
Introduction to Unsupervised Learning - Code HerokuIntroduction to Unsupervised Learning - Code Heroku
Introduction to Unsupervised Learning - Code Herokucodeheroku
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
 
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...Amazon Web Services
 
Movie Recommendation System_final.pptx
Movie Recommendation System_final.pptxMovie Recommendation System_final.pptx
Movie Recommendation System_final.pptxSridharkadiri2
 

Similar to IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia (20)

Tim P
Tim P   Tim P
Tim P
 
What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)
 
movie_notebook.pdf
movie_notebook.pdfmovie_notebook.pdf
movie_notebook.pdf
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean ProtocolCurated Proof Markets & Token-Curated Identities in Ocean Protocol
Curated Proof Markets & Token-Curated Identities in Ocean Protocol
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Movebot ENGR245 Lean LaunchPad Stanford 2018
Movebot ENGR245 Lean LaunchPad Stanford 2018Movebot ENGR245 Lean LaunchPad Stanford 2018
Movebot ENGR245 Lean LaunchPad Stanford 2018
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
 
Deep learning
Deep learningDeep learning
Deep learning
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
 
Chatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes ViewChatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes View
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
movieRecommendation_FinalReport
movieRecommendation_FinalReportmovieRecommendation_FinalReport
movieRecommendation_FinalReport
 
Introduction to Unsupervised Learning - Code Heroku
Introduction to Unsupervised Learning - Code HerokuIntroduction to Unsupervised Learning - Code Heroku
Introduction to Unsupervised Learning - Code Heroku
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
 
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
[AWS LA Media & Entertainment Event 2015]: Cloud Analytics for Audience Engag...
 
Movie Recommendation System_final.pptx
Movie Recommendation System_final.pptxMovie Recommendation System_final.pptx
Movie Recommendation System_final.pptx
 

Recently uploaded

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia

  • 1. Did you know?- Mining Interesting Trivia for Entities from Wikipedia Abhay Prakash1, Manoj K. Chinnakotla2, Dhaval Patel1, Puneet Garg2 1Indian Institute of Technology Roorkee, India 2Microsoft, India
  • 2. Did you know? Dark Knight (2008): To prepare for Joker’s role, Heath Ledger lived alone in a hotel room for a month, formulating the character’s posture, voice, and personality. IJCAI-15: IJCAI-15 is the first IJCAI edition in South America, and the southern most edition ever. Argentina: In 2001, Argentina had 5 Presidents in 10 days! Tom Hanks: Tom Hanks has an asteroid named after him: “12818 tomhanks”
  • 3. What is a Trivia?  Definition: Trivia is any fact about an entity which is interesting due to any of the following characteristics  Unusualness  Uniqueness  Unexpectedness  Weirdness  But, Isn’t interestingness subjective?  Yes!  For the current work, we take a majoritarian view for interestingness
  • 4. Why Trivia? 0 2 4 6 8 10 12 14 2/14/2015 2/16/2015 2/18/2015 2/20/2015 2/22/2015 2/24/2015 2/26/2015 2/28/2015 3/2/2015 3/4/2015 3/6/2015 3/8/2015 3/10/2015 3/12/2015 3/14/2015 Trivia Follow-on Engagement • Helps in drawing user attention and improves user engagement with the experience • Appeals to their sense of appreciating novelty, curiosity and inquisitiveness 0.00% 0.10% 0.20% 0.30% 0.40% 0.50% 0.60% 0.70% 0.80% 2/14/2015 2/16/2015 2/18/2015 2/20/2015 2/22/2015 2/24/2015 2/26/2015 2/28/2015 3/2/2015 3/4/2015 3/6/2015 3/8/2015 3/10/2015 3/12/2015 3/14/2015 Trivia Click Through Trivia Curation Manual process – Hard to scale across large number of entities
  • 5. Wikipedia Trivia Miner (WTM)  Automatically mine trivia for entities from unstructured text of Wikipedia  Why Wikipedia?  Reliable for factual correctness  Ample # of interesting trivia (56/100 in expt.)  Learn a model of interestingness for target domain Use the interestingness model to rank sentences from Wikipedia
  • 6. Interestingness Model Collect Ratings from Humans Train a Model Harness Publicly Available Sources Train a Model
  • 7. Candidate Selection Candidates’ Source Top-K Interesting Trivia from Candidates Feature ExtractionSVMrank Knowledge Base Retrieval Phase Human Voted Trivia Source Train Dataset Filtering & Grading Feature Extraction SVMrank Train Phase Model System Architecture
  • 8. Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base Training Phase Learn Interestingness Model Train Phase
  • 9. Filtering & Grading  Crawled Trivia from IMDB  Top 5K movies, 99K trivia in total  Filter facts with lesser reliability  Number of votes < 5  𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 = # 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠 # 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠  Convert this skewed distribution into grades Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB] 0 5 10 15 20 25 30 35 40 39.56 30.33 17.08 4.88 3.57 1.74 1.06 0.65 0.6 0.33 0.21 %ageCoverage Likeness Ratio
  • 10. Filtering & Grading (Contd..)  High Support for High LR For L.R. > 0.6, # of votes >= 100  Graded by Percentile-Cutoff to get 5 grades [90,100], [75-90), [25-75), [10-25), [0-10)  6163 samples from 846 movies 706 1091 2880 945 541 0 500 1000 1500 2000 2500 3000 3500 4 (Very Interesting) 3 (Interesting) 2 (Ambiguous) 1 (Boring) 0 (Very Boring) Frequency Trivia Grade
  • 11. Feature Engineering Bucket Feature Significance Sample features Example Trivia Unigram (U) Features Each word’s TF-IDF Identify imp. words which make the trivia interesting “stunt”, “award”, “improvise” “Tom Cruise did all of his own stunt driving.” Linguistic (L) Features Superlative Words Shows the extremeness (uniqueness) “best”, “longest”, “first” “The longest animated Disney film since Fantasia (1940).” Contradictory Words Opposing ideas could spark intrigue and interest “but”, “although”, “unlike” “The studios wanted Matthew McConaughey for lead role, but James Cameron insisted on Leonardo DiCaprio.” Root Word (Main Verb) Captures core activity being discussed in the sentence root_gross “Gravity grossed $274 Mn in North America” Subject Word (First Noun) Captures core thing being discussed in the sentence subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine” Readability Complex and lengthy trivia are hardly interesting FOG Index binned in 3 bins ---
  • 12. Feature Engineering (Contd…) Bucket Feature Significance Sample features Example Trivia Entity (E) Features Generic NEs captures general about- ness MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION “The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.” • ORGANIZATION and LOCATION Related Entities captures specific about- ness (Entities resolved using DBPedia) entity_producer, entity_director “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.” • entity_producer, entity_character Entity Linking before (L) Parsing Captures generalized story of sentence subj_entity_produce r [The same trivia above] • “According to entity_producer, …” • subj_Victoria  subj_entity_producer Focus Entities Captures core entities being talked about underroot_entity_ producer [The same trivia above] • underroot_entity_producer, underroot_entity_character
  • 13. Domain Independence of Features All the features are automatically generated and domain-independent Entity Features are automatically generated using attribute:value pairs in Dbpedia For a match of ‘value’ in sentence, the match is replaced by entity_‘attribute’  Unigram (U) and Linguistic (L) features are clearly domain independent DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
  • 14. Interestingness Ranking Model  Given facts (sentences) along with their interestingness grade, learn a model of interestingness which will rank sentences based on their interestingness Use Rank SVM model MOVIE_ID FEATURES GRADE 1 1:1 5:2 … 4 1 … 2 1 … 1 2 … 4 2 … 3 2 … 1 2 … 1 MOVIE_ID FEATURES 1 1:1 5:2 … 1 … 2 … 2 … 2 … 3 … 3 … Image taken and modified from Wikipedia SCORE 1.7 2.4 1.2 2.7 0.13 3.1 1.3 INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING MODEL
  • 15. Interestingness Model: Cross Validation Results 0.934 0.919 0.929 0.9419 0.944 0.951 0.9 0.91 0.92 0.93 0.94 0.95 0.96 Unigram (U) Linguistic (L) Entity Features (E) U + L U + E WTM (U + L + E) NDCG@10 Feature Group
  • 16. Interestingness Model: Feature Weights Rank Feature Group 1 subj_scene Linguistic 2 subj_entity_cast Linguistic + Entity 3 entity_produced_by Entity 4 underroot_unlinked_organization Linguistic + Entity 6 root_improvise Linguistic 7 entity_character Entity 8 MONEY Entity (NER) 14 stunt Unigram 16 superPOS Linguistic 17 subj_actor Linguistic Entity Linking leads to better generalization else these would have been subj_wolverine etc.
  • 17. Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base Retrieval Phase Retrieval Phase Get Trivia from Wikipedia Page
  • 18. Candidate Selection  Sentence Extraction  Crawled only the text in paragraph tag <p>…</p>  Sentence detection  took each sentence for further processing  Removed sentences with missing context  E.g. “It really reminds me of my childhood.”  Co-ref resolution to find out links to different sentence  Remove if out link not the target entity “Hanks revealed that he signed onto the film after an hour and a half of reading the script. He initially ...” First ‘he’ not an out link, ‘the film’ points to the target entity. Second ‘He’ is an out link. First sentence kept, Second removed
  • 19. Evaluation Dataset  20 New Movie Pages from Wikipedia No. of Sentences: 2928 No. of Positive Sentences: 791  Judged (crowd-sourced) by 5 judges Two scale voting Boring / Interesting  Majority voting for class rating  Statistically significant? Got 100 trivia from IMDB also judged by 5 judges only Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd- sourced by 5 judges Agreement between two mechanisms = Substantial (Kappa Value = 0.618) Kappa Agreement < 0 Less than chance agreement 0.01-0.20 Slight agreement 0.21-0.40 Fair agreement 0.41-0.60 Moderate agreement 0.61-0.80 Substantial agreement 0.81-0.99 Almost perfect agreement
  • 20. Comparative Baselines I. Random [Baseline I]: - 10 sentences picked randomly from Wikipedia II. CS + Random - Candidates Selected - Remove sentences like “it really reminds me of my childhood” III. CS + supPOS(Best) [Baseline II]: - Candidates Selected - Ranked by No. of Superlative Words Rank # of sup. words Class 1 2 Interesting 2 2 Boring 3 1 Interesting 4 1 Interesting 5 1 Interesting 6 1 Boring 7 1 Boring supPOS (Best Case)
  • 21. Results: Precision@10 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 22. Results: Precision@10  CS+Random > Random  Shows significance of Candidate Selection 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 23. Results: Precision@10  CS+Random > Random  Shows significance of Candidate Selection  WTM (U+L+E) >> WTM (U)  Shows significance of Engineered Linguistic (L) and Entity (E) Features 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 24. Results: Recall@K  supPOS limited to one kind of trivia  WTM captures varied types  62% recall till rank 25  Performance Comparison  supPOS better till rank 3  Soon after rank 3, WTM beats superPOS 0 10 20 30 40 50 60 70 0 5 10 15 20 25 %Recall Rank SuperPOS (Best Case) WTM Random
  • 25. Qualitative Analysis Result Movie Trivia Description WTM Wins (Sup. POS Misses) Interstellar (2014) Paramount is providing a virtual reality walkthrough of the Endurance spacecraft using Oculus Rift technology. Due to Organization being subject, and (U) features (technology, reality, virtual) Gravity (2013) When the script was finalized, Cuarón assumed it would take about a year to complete the film, but it took four and a half years. Due to Entity.Director, Subject (the script), Root word (assume) and (U) features (film, years) WTM’s Bad Elf (2003) Stop motion animation was also used. Candidate Selection failed Rio 2 (2014) Rio 2 received mixed reviews from critics. Root verb "receive" has high weightage in model
  • 26. Qualitative Analysis (Contd…) Result Movie Trivia Description Sup. POS Wins (WTM misses) The Incredibles (2004) Humans are widely considered to be the most difficult thing to execute in animation. Presence of ‘most’, absence of any Entity, vague Root word (consider) Sup. POS's Bad Lone Survivor (2013) Most critics praised Berg's direction, as well as the acting, story, visuals and battle sequences. Here 'most' is not to show degree but instead to show generality.
  • 27. Our Contributions  Introduced a novel research problem Mining Interesting Facts for Entities from Unstructured Text  Proposed a novel approach “Wikipedia Trivia Miner (WTM)” For mining top-k interesting trivia for movie entities based on their interestingness  For movie entities, we leverage already available user-generated trivia data from IMDB for learning interestingness All the Data and Code used in this paper have been made publicly available for research purposes at https://github.com/abhayprakash/WikipediaTriviaMiner_SharedResources/
  • 28. Acknowledgements First author travel was supported by travel grants from Xerox Research Centre India, IIT Roorkee, IJCAI and Microsoft Research India.