SlideShare a Scribd company logo
Mining Interesting Trivia for Entities
from Wikipedia
Supervised By: Presented By:
Dr. Dhaval Patel,
Assistant Professor,
IIT Roorkee
Abhay Prakash,
En. No. - 10211002,
IIT Roorkee
Dr. Manoj Chinnakotla,
Applied Researcher,
Microsoft India
Motivation
Actual Consumption
by Bing during CWC’15
 User Engagement
(Rich Experience)
 Facts for
quiz games
(shows like KBC)
 Manual Curation?
 Professional Curator
- In 1 day: 50 trivia
(spanning 10 entities)
Introduction: Problem Statement
Definition: Trivia is any fact about an entity which is interesting due to any of
the following characteristics - unusualness, uniqueness, unexpectedness or
weirdness.
 E.g. “Aamir Khan did not blink his eyes even once in complete movie” [Movie: PK (2014)]
 Unusual for a human to not blink his eyes
Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia
page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more
than 𝑁/2 persons find it interesting.
 For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed ahead)
Position w.r.t Related Works
 Automatic generation of trivia questions (2002) [1]
 Their Work: Trivia Questions from structured Database.
 Difference: WTM retrieves Trivia (facts) from Unstructured Text.
 Predicting Interesting Things in Text (2014) [2]
 Their Work: Click prediction on anchors(links) with in Wikipedia page.
 Difference: WTM is not limited to Links and don’t(can’t) use any click-through data.
 Automatic Prediction of Text Aesthetics and Interestingness (2014) [3]
 Their Work: One-class algorithm for identifying poetically beautiful sentences.
 Difference: Similar Nature, but domain different so engineered features differ a lot.
 Man bites dog: looking for interesting inconsistencies in structured news reports (2004) [4]
 Their Work: Found unexpected news articles, dependent on ‘structured’ news reports.
 Difference: WTM not limited to structured data.
Wikipedia Trivia Miner
 Mines Trivia for a Target Entity (Expt: Movie)
 Trains a ranker using trivia of target domain
 Uses Wikipedia as source of Trivia
 Retrieves Top-k interesting trivia from entity’s page
 Why Wikipedia?
 Reliable for factual correctness
 Ample # of interesting trivia (56/100 in expt.)
 Two Phases
 Model Building (Train Phase)
 Retrieval (Test Phase)
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Train Phase Retrieval Phase
System Architecture
 Filtering & Grading
 Filters out less reliable samples
 Give a grade to each sample, as reqd. by ranker
 Interestingness Ranker
 Extracts features from the samples/candidates
 Trains ranker(SVMrank)/Ranks candidates
 Candidate Selection
 Identifies candidates from Wikipedia
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Filtering & Grading
 Crawled Trivia from IMDB
 Top 5K movies, 99K trivia in total
 Filtered on # of votes ≥ 5
 𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 =
# 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠
# 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠
 Normal Dist. required on grade
Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB]
0
5
10
15
20
25
30
35
40
39.56
30.33
17.08
4.88
3.57
1.74 1.06 0.65 0.6 0.33 0.21
%ageCoverage
Likeness Ratio
TRAIN PHASE
Filtering & Grading (Contd..)
 High Support for High LR
 For L.R. > 0.6, # of votes >= 100
 Graded by Percentile-Cutoff to get
5 grades
 [90,100], [75-90), [25-75), [10-25), [0-10)
 6163 samples from 846 movies
706
1091
2880
945
541
0
500
1000
1500
2000
2500
3000
3500
4 (Very
Interesting)
3
(Interesting)
2
(Ambiguous)
1 (Boring) 0 (Very
Boring)
Frequency
Trivia Grade
TRAIN PHASE
Feature Engineering
 Unigrams (U): Basic Technique in Text Mining
 Linguistic (L): Language Analysis Features
 Superlative Words
 Contradictory Words
 Root Word (Verb)
 Subject Word (First noun)
 Readability
 Entity (E): Understanding/Generalizing the entities present
 Present Entities
 Linking Entities for Linguistic Features
 Focus Entities of sentence
TRAIN PHASE
Feature: Unigram Features
 Basic Technique in Text Mining
 Each word(unigram) as a feature column, its TF-IDF as feature value
 Pre-processing
 Stop word removal, Case conversion, Stemming and Punctuation removal
 Why this Feature?
 Try to identify imp. words which make the trivia interesting
 Prominent emerged words - “stunt”, “award”, “improvise.”
 e.g. “Tom Cruise did all of his own stunt driving.” [Movie: Jack Reacher (2012)]
TRAIN PHASE
Feature: Linguistic Features
 Presence of Superlative Words
 Words like “best”, “longest”, “first” etc.
 Shows the extremeness (uniqueness)
 Identified by Part of Speech(POS) Tags: superlative adjective (JJS) and superlative adverbs (RBS)
 E.g. “The longest animated Disney film since Fantasia (1940).” [Movie: Tangled (2010)]
 Presence of Contradictory Words
 Words like “but”, “although”, “unlike” etc.
 Opposing ideas could spark intrigue and interest
 E.g. “The studios wanted Matthew McConaughey for lead role, but James Cameron insisted on
Leonardo DiCaprio.” [Movie: The Shawshank Redemption (1994)]
TRAIN PHASE
 Root Word of Sentence
 Captures core activity being discussed in the sentence
 E.g. “Gravity grossed $274 Mn in North America,” talks about revenue related stuff
 Feature column of root_gross
 Subject of Sentence (first noun before root verb)
 Captures core thing being discussed in the sentence
 E.g “The actors snorted crushed B vitamins for scenes involving cocaine.”
 Feature column of subj_actor
 Readability Score
 Complex and lengthy trivia are hardly interesting
 FOG Index calculated and binned in three bins
Feature: Linguistic Features
TRAIN PHASE
 Presence of Generic NEs
 Presence of NEs: MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION
 Feature column for each of the six NEs
 E.g. “The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”
 ORGANIZATION and LOCATION
Feature: Entity Features
TRAIN PHASE
Feature: Entity Features
 Present Entities
 Presence of related entities (Resolved using DBPedia)
 E.g. entity_producer and entity_character in above sample
 Entities Linked before Linguistic
 “According to entity_producer, …”
 Linguistic Feature Subject Word: subj_Victoria  subj_entity_producer
 Focus Named Entities of Sentence
 Presence of any NE present directly under the root
 For above ex. Feature columns of underroot_entity_producer, underroot_entity_character
“According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”
TRAIN PHASE
Model Building: Ranker
 Used Rank-SVM
 Finds a plane, projection of each sample on which is in given grade order
 Order of samples within a movie
MOVIE_ID FEATURES GRADE
1 1:1 5:2 … 4
1 … 2
1 … 1
2 … 4
2 … 3
2 … 1
2 … 1
MOVIE_ID FEATURES
1 1:1 5:2 …
1 …
2 …
2 …
2 …
3 …
3 …
Image taken and modified from Wikipedia
SCORE
1.7
2.4
1.2
2.7
0.13
3.1
1.3
INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING
MODEL
TRAIN PHASE
Model Building: Cross Validate Results
 Feature increment and model building
0.934
0.919
0.929
0.9419
0.944
0.951
0.9
0.91
0.92
0.93
0.94
0.95
0.96
Unigram (U) Linguistic (L) Entity Features
(E)
U + L U + E WTM (U + L + E)
NDCG@10
Feature Group
TRAIN PHASE
Model Building: Feature Weights
 Sneak peek inside the model - What the model is learning?
 Top Features: Our advanced features are useful and intuitive for humans too
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
• Entity Linking lead to
better generalization
• else these would have
been subj_wolverine etc.
TRAIN PHASE
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Retrieval Phase
- Get Trivia from Wikipedia Page
Candidate Selection
 Sentence Extraction
 Crawled only the text in paragraph tag <p>…</p>
 Sentence detection  each sentence for further processing
 Removed sentences with missing context
 E.g. “It really reminds me of my childhood.”
 Co-ref resolution to find out links to different sentence
 Remove if out link not the target entity
 e.g. “Hanks revealed that he signed onto the film after an hour and a half of reading the script. He
initially ...”
 First ‘he’ not an out link, ‘the film’ points to the target entity. Second ‘He’ is an out link
 First sentence kept, Second removed
RETRIEVAL PHASE
Test Set for Model Evaluation
 Generated trivia for 20 Movies from Wikipedia
 Judged (crowd-sourced) by 5 judges
 Two scale voting – Boring / Interesting
 Majority voting for Class Labeling
 Statistically significant?
 Got 100 trivia from IMDB also judged by 5 judges only
 Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd-sourced by 5 judges
 Agreement between two mechanisms = Substantial (Kappa Value = 0.618)
Kappa Agreement
< 0 Less than chance agreement
0.01-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Almost perfect agreement
RETRIEVAL PHASE
Results: Metrics on Unseen: P@10
 Comparative Approaches & Baselines
 Random:
- 10 sentences picked randomly from Wikipedia
0.25
0.3
0.32 0.33 0.34 0.34
0.45
0
0.1
0.2
0.3
0.4
0.5
P@10
Model
Random
(Baseline-I)
RETRIEVAL PHASE
Results: Metrics on Unseen: P@10
 Comparative Approaches & Baselines
 CS + Random:
- Missing context sentences removed by CS
- 10 sent. picked randomly
0.25
0.3
0.32 0.33 0.34 0.34
0.45
0
0.1
0.2
0.3
0.4
0.5
P@10
Model
CS then Random
B-I
(19.61% Imp.)
RETRIEVAL PHASE
Results: Metrics on Unseen: P@10
 Comparative Approaches & Baselines
 CS + supPOS(Worst):
- Ranked by # of sup. words
- Deliberately taking boring sent. for same # of sup.
 CS + supPOS(Rand):
- Ranked by # of sup. words
- Shuffled for same # of sup. Words
 CS + supPOS(Best):
- Ranked by # of sup. words
- Deliberately taking interesting sent. for same # of sup.
0.25
0.3
0.32 0.33 0.34 0.34
0.45
0
0.1
0.2
0.3
0.4
0.5
P@10
Model
supPOS_W
B-I
supPOS Trivia: Marlon Brando did not memorize most of his lines and read from cue cards during most of the film.
RETRIEVAL PHASE
supPOS_R
(29.41% Imp.)
supPOS_B
(Baseline-II)
Results: Metrics on Unseen: P@10
 Comparative Approaches & Baselines
 CS + WTM(U):
- ML Ranking
- With only basic Unigram(U) features
0.25
0.3
0.32 0.33 0.34 0.34
0.45
0
0.1
0.2
0.3
0.4
0.5
P@10
Model
B-I
WTM (U)
B-II
RETRIEVAL PHASE
Results: Metrics on Unseen: P@10
 Comparative Approaches & Baselines
 CS + WTM(U): ML Ranking with only (U) features
 CS + WTM(U+L+E):
- ML Ranking
- With advanced (U+L+E) features
0.25
0.3
0.32 0.33 0.34 0.34
0.45
0
0.1
0.2
0.3
0.4
0.5
P@10
Model
B-I
B-II
WTM (U+L+E)
78.43% imp. (B-I)
33.82% imp. (B-II)
RETRIEVAL PHASE
Results: Metrics on Unseen: Recall@K
 supPOS limited to one kind of trivia
 WTM captures varied types
 62% recall till rank 25
 Performance Comparison
 supPOS better till rank 3
 Soon after rank 3, WTM beats superPOS 0
10
20
30
40
50
60
70
0 5 10 15 20 25
%Recall
Rank
SuperPOS (Best Case) WTM Random
RETRIEVAL PHASE
Results: Qualitative Discussion
Result Movie Trivia Description
WTM Wins
(Sup. POS
Misses)
Interstellar
(2014)
Paramount is providing a virtual reality walkthrough
of the Endurance spacecraft using Oculus Rift
technology.
Due to Organization being
subject, and (U) features
(technology, reality, virtual)
Gravity
(2013)
When the script was finalized, Cuarón assumed it
would take about a year to complete the film, but it
took four and a half years.
Due to Entity.Director,
Subject (the script), Root
word (assume) and (U)
features (film, years)
WTM’s Bad
Elf (2003) Stop motion animation was also used. Candidate Selection failed
Rio 2
(2014) Rio 2 received mixed reviews from critics.
Root verb "receive" has high
weightage in model
RETRIEVAL PHASE
Results: Qualitative Discussion (Contd…)
Result Movie Trivia Description
Sup. POS Wins
(WTM misses)
The
Incredibles
(2004)
Humans are widely considered to be the most
difficult thing to execute in animation.
Presence of ‘most’,
absence of any Entity,
vague Root word
(consider)
Sup. POS's Bad
Lone
Survivor
(2013)
Most critics praised Berg's direction, as well as the
acting, story, visuals and battle sequences.
Here 'most' is not to show
degree but instead to
show genericity.
RETRIEVAL PHASE
Dissertation Contribution
 Identified, Defined and Provided a novel research problem
 not just only providing solutions to existing problem
 Proposed a system “Wikipedia Trivia Miner (WTM)”
 To mine top-k interesting trivia for any given entity based on their interestingness
 Engineered features that capture ‘about-ness’ of sentence
 Generalizes which one are interesting
 Shown how publicly available IMDB data can be leveraged for model learning
 Cost effective, as eliminates the need of crowd annotation
 Proposed a mechanism to prepare ground truth for test-set
 Cost-effective but statistically significant
Publication Submitted
[1] Abhay Prakash, Manoj Chinnakotla, Dhaval Patel, Puneet Garg (2015): “Did
you know?: Mining Interesting Trivia for Entities from Wikipedia”. Submitted in
International Joint Conference on Artificial Intelligence (IJCAI).
Further Work
 Replicate the work on Celebrities domain
 Verify that WTM approach is actually domain independent
 Feature Engineering to capture deviation from expectation
 Expectation based on topics in that domain, compare topic of candidate
 Fact Popularity
 Lesser known trivia could be more interesting to majority people
Key References
[1] Matthew Merzbacher, "Automatic generation of trivia questions," Foundations of Intelligent
Systems, Lecture Notes in Computer Science, vol. 2366, pp. 123-130, 2002
[2] Michael Gamon, Arjun Mukherjee, and Patrick Pantel, "Predicting interesting things in text,“
in COLING, 2014.
[3] Debasis Ganguly, Johannes Leveling, and Gareth Jones, "Automatic prediction of text
aesthetics and interestingness," in COLING, 2014.
[4] Emma Byrne and Anthony Hunter, "Man bites dog: looking for interesting inconsistencies in
structured news reports," Data and Knowledge Engineering, vol. 48, no. 3, pp. 265-295, 2004.
Mining Interesting Trivia for Entities from Wikipedia PART-I

More Related Content

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-I

GraphQL with Sangria
GraphQL with SangriaGraphQL with Sangria
GraphQL with Sangria
Anthony SSI YAN KAI
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
Alyona Medelyan
 
[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환
NAVER D2
 
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
Leveraging Semantic Parsing for Relation Linking over Knowledge BasesLeveraging Semantic Parsing for Relation Linking over Knowledge Bases
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
Nandana Mihindukulasooriya
 
Implicit Entity Linking in Tweets
Implicit Entity Linking in TweetsImplicit Entity Linking in Tweets
Implicit Entity Linking in Tweets
Artificial Intelligence Institute at UofSC
 
Implicit Entity Linking in Tweets
Implicit Entity Linking in TweetsImplicit Entity Linking in Tweets
Implicit Entity Linking in Tweets
Sujan Perera
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
Alyona Medelyan
 
Iccv2009 recognition and learning object categories p3 c00 - summary and da...
Iccv2009 recognition and learning object categories   p3 c00 - summary and da...Iccv2009 recognition and learning object categories   p3 c00 - summary and da...
Iccv2009 recognition and learning object categories p3 c00 - summary and da...
zukun
 
Entity Framework Today (May 2012)
Entity Framework Today (May 2012)Entity Framework Today (May 2012)
Entity Framework Today (May 2012)
Julie Lerman
 
WWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic AnalysisWWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic Analysis
Laura Hollink
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
ratnapatil14
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019
Jeremie Charlet
 
End-to-end machine learning project in Arabic
End-to-end machine learning project in ArabicEnd-to-end machine learning project in Arabic
End-to-end machine learning project in Arabic
AMR koura
 
Prediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handoutPrediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handout
Ha Phuong
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
Tommaso Teofili
 
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven WorldHumantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
Grant Simmons
 
Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021
Digital Marketers Organization
 
Introduction to Recommender System
Introduction to Recommender SystemIntroduction to Recommender System
Introduction to Recommender System
WQ Fan
 
emnlp14v6.pptx
emnlp14v6.pptxemnlp14v6.pptx
emnlp14v6.pptx
SaqeebShaikh2
 
Owlizr
OwlizrOwlizr
Owlizr
Fariz Darari
 

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-I (20)

GraphQL with Sangria
GraphQL with SangriaGraphQL with Sangria
GraphQL with Sangria
 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
 
[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환[2B1]검색엔진의 패러다임 전환
[2B1]검색엔진의 패러다임 전환
 
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
Leveraging Semantic Parsing for Relation Linking over Knowledge BasesLeveraging Semantic Parsing for Relation Linking over Knowledge Bases
Leveraging Semantic Parsing for Relation Linking over Knowledge Bases
 
Implicit Entity Linking in Tweets
Implicit Entity Linking in TweetsImplicit Entity Linking in Tweets
Implicit Entity Linking in Tweets
 
Implicit Entity Linking in Tweets
Implicit Entity Linking in TweetsImplicit Entity Linking in Tweets
Implicit Entity Linking in Tweets
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Iccv2009 recognition and learning object categories p3 c00 - summary and da...
Iccv2009 recognition and learning object categories   p3 c00 - summary and da...Iccv2009 recognition and learning object categories   p3 c00 - summary and da...
Iccv2009 recognition and learning object categories p3 c00 - summary and da...
 
Entity Framework Today (May 2012)
Entity Framework Today (May 2012)Entity Framework Today (May 2012)
Entity Framework Today (May 2012)
 
WWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic AnalysisWWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic Analysis
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
 
Machine learning study group 17 4 2019
Machine learning study group 17 4 2019Machine learning study group 17 4 2019
Machine learning study group 17 4 2019
 
End-to-end machine learning project in Arabic
End-to-end machine learning project in ArabicEnd-to-end machine learning project in Arabic
End-to-end machine learning project in Arabic
 
Prediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handoutPrediction io–final 2014-jp-handout
Prediction io–final 2014-jp-handout
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven WorldHumantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
 
Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021
 
Introduction to Recommender System
Introduction to Recommender SystemIntroduction to Recommender System
Introduction to Recommender System
 
emnlp14v6.pptx
emnlp14v6.pptxemnlp14v6.pptx
emnlp14v6.pptx
 
Owlizr
OwlizrOwlizr
Owlizr
 

Recently uploaded

Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
Kavitha Krishnan
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 

Recently uploaded (20)

Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 

Mining Interesting Trivia for Entities from Wikipedia PART-I

  • 1. Mining Interesting Trivia for Entities from Wikipedia Supervised By: Presented By: Dr. Dhaval Patel, Assistant Professor, IIT Roorkee Abhay Prakash, En. No. - 10211002, IIT Roorkee Dr. Manoj Chinnakotla, Applied Researcher, Microsoft India
  • 2. Motivation Actual Consumption by Bing during CWC’15  User Engagement (Rich Experience)  Facts for quiz games (shows like KBC)  Manual Curation?  Professional Curator - In 1 day: 50 trivia (spanning 10 entities)
  • 3. Introduction: Problem Statement Definition: Trivia is any fact about an entity which is interesting due to any of the following characteristics - unusualness, uniqueness, unexpectedness or weirdness.  E.g. “Aamir Khan did not blink his eyes even once in complete movie” [Movie: PK (2014)]  Unusual for a human to not blink his eyes Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more than 𝑁/2 persons find it interesting.  For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed ahead)
  • 4. Position w.r.t Related Works  Automatic generation of trivia questions (2002) [1]  Their Work: Trivia Questions from structured Database.  Difference: WTM retrieves Trivia (facts) from Unstructured Text.  Predicting Interesting Things in Text (2014) [2]  Their Work: Click prediction on anchors(links) with in Wikipedia page.  Difference: WTM is not limited to Links and don’t(can’t) use any click-through data.  Automatic Prediction of Text Aesthetics and Interestingness (2014) [3]  Their Work: One-class algorithm for identifying poetically beautiful sentences.  Difference: Similar Nature, but domain different so engineered features differ a lot.  Man bites dog: looking for interesting inconsistencies in structured news reports (2004) [4]  Their Work: Found unexpected news articles, dependent on ‘structured’ news reports.  Difference: WTM not limited to structured data.
  • 5. Wikipedia Trivia Miner  Mines Trivia for a Target Entity (Expt: Movie)  Trains a ranker using trivia of target domain  Uses Wikipedia as source of Trivia  Retrieves Top-k interesting trivia from entity’s page  Why Wikipedia?  Reliable for factual correctness  Ample # of interesting trivia (56/100 in expt.)  Two Phases  Model Building (Train Phase)  Retrieval (Test Phase) Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base Train Phase Retrieval Phase
  • 6. System Architecture  Filtering & Grading  Filters out less reliable samples  Give a grade to each sample, as reqd. by ranker  Interestingness Ranker  Extracts features from the samples/candidates  Trains ranker(SVMrank)/Ranks candidates  Candidate Selection  Identifies candidates from Wikipedia Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base
  • 7. Filtering & Grading  Crawled Trivia from IMDB  Top 5K movies, 99K trivia in total  Filtered on # of votes ≥ 5  𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 = # 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠 # 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠  Normal Dist. required on grade Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB] 0 5 10 15 20 25 30 35 40 39.56 30.33 17.08 4.88 3.57 1.74 1.06 0.65 0.6 0.33 0.21 %ageCoverage Likeness Ratio TRAIN PHASE
  • 8. Filtering & Grading (Contd..)  High Support for High LR  For L.R. > 0.6, # of votes >= 100  Graded by Percentile-Cutoff to get 5 grades  [90,100], [75-90), [25-75), [10-25), [0-10)  6163 samples from 846 movies 706 1091 2880 945 541 0 500 1000 1500 2000 2500 3000 3500 4 (Very Interesting) 3 (Interesting) 2 (Ambiguous) 1 (Boring) 0 (Very Boring) Frequency Trivia Grade TRAIN PHASE
  • 9. Feature Engineering  Unigrams (U): Basic Technique in Text Mining  Linguistic (L): Language Analysis Features  Superlative Words  Contradictory Words  Root Word (Verb)  Subject Word (First noun)  Readability  Entity (E): Understanding/Generalizing the entities present  Present Entities  Linking Entities for Linguistic Features  Focus Entities of sentence TRAIN PHASE
  • 10. Feature: Unigram Features  Basic Technique in Text Mining  Each word(unigram) as a feature column, its TF-IDF as feature value  Pre-processing  Stop word removal, Case conversion, Stemming and Punctuation removal  Why this Feature?  Try to identify imp. words which make the trivia interesting  Prominent emerged words - “stunt”, “award”, “improvise.”  e.g. “Tom Cruise did all of his own stunt driving.” [Movie: Jack Reacher (2012)] TRAIN PHASE
  • 11. Feature: Linguistic Features  Presence of Superlative Words  Words like “best”, “longest”, “first” etc.  Shows the extremeness (uniqueness)  Identified by Part of Speech(POS) Tags: superlative adjective (JJS) and superlative adverbs (RBS)  E.g. “The longest animated Disney film since Fantasia (1940).” [Movie: Tangled (2010)]  Presence of Contradictory Words  Words like “but”, “although”, “unlike” etc.  Opposing ideas could spark intrigue and interest  E.g. “The studios wanted Matthew McConaughey for lead role, but James Cameron insisted on Leonardo DiCaprio.” [Movie: The Shawshank Redemption (1994)] TRAIN PHASE
  • 12.  Root Word of Sentence  Captures core activity being discussed in the sentence  E.g. “Gravity grossed $274 Mn in North America,” talks about revenue related stuff  Feature column of root_gross  Subject of Sentence (first noun before root verb)  Captures core thing being discussed in the sentence  E.g “The actors snorted crushed B vitamins for scenes involving cocaine.”  Feature column of subj_actor  Readability Score  Complex and lengthy trivia are hardly interesting  FOG Index calculated and binned in three bins Feature: Linguistic Features TRAIN PHASE
  • 13.  Presence of Generic NEs  Presence of NEs: MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION  Feature column for each of the six NEs  E.g. “The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”  ORGANIZATION and LOCATION Feature: Entity Features TRAIN PHASE
  • 14. Feature: Entity Features  Present Entities  Presence of related entities (Resolved using DBPedia)  E.g. entity_producer and entity_character in above sample  Entities Linked before Linguistic  “According to entity_producer, …”  Linguistic Feature Subject Word: subj_Victoria  subj_entity_producer  Focus Named Entities of Sentence  Presence of any NE present directly under the root  For above ex. Feature columns of underroot_entity_producer, underroot_entity_character “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.” TRAIN PHASE
  • 15. Model Building: Ranker  Used Rank-SVM  Finds a plane, projection of each sample on which is in given grade order  Order of samples within a movie MOVIE_ID FEATURES GRADE 1 1:1 5:2 … 4 1 … 2 1 … 1 2 … 4 2 … 3 2 … 1 2 … 1 MOVIE_ID FEATURES 1 1:1 5:2 … 1 … 2 … 2 … 2 … 3 … 3 … Image taken and modified from Wikipedia SCORE 1.7 2.4 1.2 2.7 0.13 3.1 1.3 INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING MODEL TRAIN PHASE
  • 16. Model Building: Cross Validate Results  Feature increment and model building 0.934 0.919 0.929 0.9419 0.944 0.951 0.9 0.91 0.92 0.93 0.94 0.95 0.96 Unigram (U) Linguistic (L) Entity Features (E) U + L U + E WTM (U + L + E) NDCG@10 Feature Group TRAIN PHASE
  • 17. Model Building: Feature Weights  Sneak peek inside the model - What the model is learning?  Top Features: Our advanced features are useful and intuitive for humans too Rank Feature Group 1 subj_scene Linguistic 2 subj_entity_cast Linguistic + Entity 3 entity_produced_by Entity 4 underroot_unlinked_organization Linguistic + Entity 6 root_improvise Linguistic 7 entity_character Entity 8 MONEY Entity (NER) 14 stunt Unigram 16 superPOS Linguistic 17 subj_actor Linguistic • Entity Linking lead to better generalization • else these would have been subj_wolverine etc. TRAIN PHASE
  • 18. Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base Retrieval Phase Retrieval Phase - Get Trivia from Wikipedia Page
  • 19. Candidate Selection  Sentence Extraction  Crawled only the text in paragraph tag <p>…</p>  Sentence detection  each sentence for further processing  Removed sentences with missing context  E.g. “It really reminds me of my childhood.”  Co-ref resolution to find out links to different sentence  Remove if out link not the target entity  e.g. “Hanks revealed that he signed onto the film after an hour and a half of reading the script. He initially ...”  First ‘he’ not an out link, ‘the film’ points to the target entity. Second ‘He’ is an out link  First sentence kept, Second removed RETRIEVAL PHASE
  • 20. Test Set for Model Evaluation  Generated trivia for 20 Movies from Wikipedia  Judged (crowd-sourced) by 5 judges  Two scale voting – Boring / Interesting  Majority voting for Class Labeling  Statistically significant?  Got 100 trivia from IMDB also judged by 5 judges only  Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd-sourced by 5 judges  Agreement between two mechanisms = Substantial (Kappa Value = 0.618) Kappa Agreement < 0 Less than chance agreement 0.01-0.20 Slight agreement 0.21-0.40 Fair agreement 0.41-0.60 Moderate agreement 0.61-0.80 Substantial agreement 0.81-0.99 Almost perfect agreement RETRIEVAL PHASE
  • 21. Results: Metrics on Unseen: P@10  Comparative Approaches & Baselines  Random: - 10 sentences picked randomly from Wikipedia 0.25 0.3 0.32 0.33 0.34 0.34 0.45 0 0.1 0.2 0.3 0.4 0.5 P@10 Model Random (Baseline-I) RETRIEVAL PHASE
  • 22. Results: Metrics on Unseen: P@10  Comparative Approaches & Baselines  CS + Random: - Missing context sentences removed by CS - 10 sent. picked randomly 0.25 0.3 0.32 0.33 0.34 0.34 0.45 0 0.1 0.2 0.3 0.4 0.5 P@10 Model CS then Random B-I (19.61% Imp.) RETRIEVAL PHASE
  • 23. Results: Metrics on Unseen: P@10  Comparative Approaches & Baselines  CS + supPOS(Worst): - Ranked by # of sup. words - Deliberately taking boring sent. for same # of sup.  CS + supPOS(Rand): - Ranked by # of sup. words - Shuffled for same # of sup. Words  CS + supPOS(Best): - Ranked by # of sup. words - Deliberately taking interesting sent. for same # of sup. 0.25 0.3 0.32 0.33 0.34 0.34 0.45 0 0.1 0.2 0.3 0.4 0.5 P@10 Model supPOS_W B-I supPOS Trivia: Marlon Brando did not memorize most of his lines and read from cue cards during most of the film. RETRIEVAL PHASE supPOS_R (29.41% Imp.) supPOS_B (Baseline-II)
  • 24. Results: Metrics on Unseen: P@10  Comparative Approaches & Baselines  CS + WTM(U): - ML Ranking - With only basic Unigram(U) features 0.25 0.3 0.32 0.33 0.34 0.34 0.45 0 0.1 0.2 0.3 0.4 0.5 P@10 Model B-I WTM (U) B-II RETRIEVAL PHASE
  • 25. Results: Metrics on Unseen: P@10  Comparative Approaches & Baselines  CS + WTM(U): ML Ranking with only (U) features  CS + WTM(U+L+E): - ML Ranking - With advanced (U+L+E) features 0.25 0.3 0.32 0.33 0.34 0.34 0.45 0 0.1 0.2 0.3 0.4 0.5 P@10 Model B-I B-II WTM (U+L+E) 78.43% imp. (B-I) 33.82% imp. (B-II) RETRIEVAL PHASE
  • 26. Results: Metrics on Unseen: Recall@K  supPOS limited to one kind of trivia  WTM captures varied types  62% recall till rank 25  Performance Comparison  supPOS better till rank 3  Soon after rank 3, WTM beats superPOS 0 10 20 30 40 50 60 70 0 5 10 15 20 25 %Recall Rank SuperPOS (Best Case) WTM Random RETRIEVAL PHASE
  • 27. Results: Qualitative Discussion Result Movie Trivia Description WTM Wins (Sup. POS Misses) Interstellar (2014) Paramount is providing a virtual reality walkthrough of the Endurance spacecraft using Oculus Rift technology. Due to Organization being subject, and (U) features (technology, reality, virtual) Gravity (2013) When the script was finalized, Cuarón assumed it would take about a year to complete the film, but it took four and a half years. Due to Entity.Director, Subject (the script), Root word (assume) and (U) features (film, years) WTM’s Bad Elf (2003) Stop motion animation was also used. Candidate Selection failed Rio 2 (2014) Rio 2 received mixed reviews from critics. Root verb "receive" has high weightage in model RETRIEVAL PHASE
  • 28. Results: Qualitative Discussion (Contd…) Result Movie Trivia Description Sup. POS Wins (WTM misses) The Incredibles (2004) Humans are widely considered to be the most difficult thing to execute in animation. Presence of ‘most’, absence of any Entity, vague Root word (consider) Sup. POS's Bad Lone Survivor (2013) Most critics praised Berg's direction, as well as the acting, story, visuals and battle sequences. Here 'most' is not to show degree but instead to show genericity. RETRIEVAL PHASE
  • 29. Dissertation Contribution  Identified, Defined and Provided a novel research problem  not just only providing solutions to existing problem  Proposed a system “Wikipedia Trivia Miner (WTM)”  To mine top-k interesting trivia for any given entity based on their interestingness  Engineered features that capture ‘about-ness’ of sentence  Generalizes which one are interesting  Shown how publicly available IMDB data can be leveraged for model learning  Cost effective, as eliminates the need of crowd annotation  Proposed a mechanism to prepare ground truth for test-set  Cost-effective but statistically significant
  • 30. Publication Submitted [1] Abhay Prakash, Manoj Chinnakotla, Dhaval Patel, Puneet Garg (2015): “Did you know?: Mining Interesting Trivia for Entities from Wikipedia”. Submitted in International Joint Conference on Artificial Intelligence (IJCAI).
  • 31. Further Work  Replicate the work on Celebrities domain  Verify that WTM approach is actually domain independent  Feature Engineering to capture deviation from expectation  Expectation based on topics in that domain, compare topic of candidate  Fact Popularity  Lesser known trivia could be more interesting to majority people
  • 32. Key References [1] Matthew Merzbacher, "Automatic generation of trivia questions," Foundations of Intelligent Systems, Lecture Notes in Computer Science, vol. 2366, pp. 123-130, 2002 [2] Michael Gamon, Arjun Mukherjee, and Patrick Pantel, "Predicting interesting things in text,“ in COLING, 2014. [3] Debasis Ganguly, Johannes Leveling, and Gareth Jones, "Automatic prediction of text aesthetics and interestingness," in COLING, 2014. [4] Emma Byrne and Anthony Hunter, "Man bites dog: looking for interesting inconsistencies in structured news reports," Data and Knowledge Engineering, vol. 48, no. 3, pp. 265-295, 2004.