Mining Interesting Trivia for Entities from Wikipedia PART-II

Mining Interesting Trivia for Entities
from Wikipedia
Supervised By: Presented By:
Dr. Dhaval Patel,
Assistant Professor,
IIT Roorkee
Abhay Prakash,
En. No. - 10211002,
IIT Roorkee
Dr. Manoj K. Chinnakotla,
Applied Researcher,
Microsoft India

Publication Accepted
[1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you
know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th
International Joint Conference on Artificial Intelligence (IJCAI), 2015.
Conference Rating: A*

Introduction: Problem Statement
Definition: Trivia are any facts about an entity which are interesting due to any
of the following characteristics - unusualness, uniqueness, unexpectedness or
weirdness.
 Generally appear in “Did you know?” articles
 E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman
Begins]
 Unusual for an actor/human to seclude himself for a month
Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia
page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more
than 𝑁/2 persons find it interesting.
 For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)

Wikipedia Trivia Miner (WTM)
 Based on ML approach to mine trivia from unstructured text
 Trains a ranker using sample trivia of target domain
 Experiment with Movie entities and Celebrity entities
 Harness trained ranker to mine Trivia from entity’s Wikipedia page
 Retrieves Top-k standalone interesting sentences from entity’s page
 Why Wikipedia?
 Reliable for factual correctness
 Ample # of interesting trivia (56/100 in expt.)

System Architecture
 Filtering & Grading
 Filters out noisy samples
 Give a grade to each sample, as reqd. by ranker
 Interestingness Ranker
 Extracts features from the samples/candidates
 Trains ranker(SVMrank)/Ranks candidates
 Candidate Selection
 Identifies candidates from Wikipedia
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base

Candidate
Selection
Candidates’ Source
Top-K Interesting Trivia
from Candidates
Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Human Voted Trivia Source
Train Dataset
Filtering & Grading
Feature Extraction SVMrank
Train Phase
Model
Execution Phases
 Train Phase
 Crawls and prepares train data
 Featurize the train data
 Trains SVMrank to build a model
 Retrieval Phase
 Crawls entity’s Wikipedia text
 Identify candidates for trivia
 Featurize the candidates
 Rank the candidates using
already built model

Feature Engineering
Bucket Feature Significance Sample features Example Trivia
Unigram (U)
Features
Each word’s
TF-IDF
Identify imp. words which
make the trivia interesting
“stunt”, “award”,
“improvise”
“Tom Cruise did all of his own stunt driving.”
Linguistic (L)
Features
Superlative
Words
Shows the extremeness
(uniqueness)
“best”, “longest”,
“first”
“The longest animated Disney film since
Fantasia (1940).”
Contradictory
Words
Opposing ideas could spark
intrigue and interest
“but”, “although”,
“unlike”
“The studios wanted Matthew McConaughey
for lead role, but James Cameron insisted on
Leonardo DiCaprio.”
Root Word
(Main Verb)
Captures core activity being
discussed in the sentence
root_gross “Gravity grossed $274 Mn in North America”
Subject Word
(First Noun)
Captures core thing being
discussed in the sentence
subj_actor “The actors snorted crushed B vitamins for
scenes involving cocaine”
Readability Complex and lengthy trivia
are hardly interesting
FOG Index binned
in 3 bins ---

Feature Engineering (Contd…)
Bucket Feature Significance Sample features Example Trivia
Entity (E)
Features
Generic NEs captures general about-
ness
MONEY,
ORGANIZATION,
PERSON, DATE, TIME
and LOCATION
“The guns in the film were supplied by Aldo
Uberti Inc., a company in Italy.”
• ORGANIZATION and LOCATION
Related
Entities
captures specific about-
ness
(Entities resolved using
DBPedia)
entity_producer,
entity_director
“According to Victoria Alonso, Rocket Raccoon
and Groot were created through a mix of
motion-capture and rotomation VFX.”
• entity_producer, entity_character
Entity Linking
before
(L) Parsing
Captures generalized
story of sentence
subj_entity_producer [The same trivia above]
• “According to entity_producer, …”
• subj_Victoria  subj_entity_producer
Focus Entities Captures core entities
being talked about
underroot_entity_
producer
[The same trivia above]
• underroot_entity_producer,
underroot_entity_character

Feature Engineering: Example
Ex. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of
motion-capture and rotomation VFX.”
 Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data
 Rest of the features have value 0.
 entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, ….
create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG
0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3
contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character
0 1 1 1 1

Comparative Approaches
I. Random [Baseline I]:
- 10 sentences picked randomly from Wikipedia
II. CS + Random
- Candidates Selected (standalone context independent sentences)
- i.e., remove sentences like “it really reminds me of my childhood”
- 10 sentences picked randomly from candidates
III. CS + supPOS(Best) [Baseline II]:
- Candidates Selected
- Ranked by # of sup. words
- Deliberately taking interesting sent. for same # of sup. words
Rank # of sup.
words
Class
1 2 Interesting
2 2 Boring
3 1 Interesting
4 1 Interesting
5 1 Interesting
6 1 Boring
7 1 Boring
supPOS (Best Case)

Variants of WTM
I. WTM (U)
- ML Ranking of candidates using only Unigram Features
II. WTM (U+L+E)
- ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)

Results: P@10
 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates,
how many actually are interesting
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches

Results: P@10
 CS+Random > Random
 Shows significance of Candidate
Selection
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches

Results: P@10
 CS+Random > Random
 Shows significance of Candidate
Selection
 WTM (U+L+E) >> WTM (U)
 Shows significance of Engineered
Linguistic (L) and Entity (E) Features
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches

Results: Recall@K
 supPOS limited to one kind of trivia
 WTM captures varied types
 62% recall till rank 25
 Performance Comparison
 supPOS better till rank 3
 Soon after rank 3, WTM beats superPOS
0
10
20
30
40
50
60
70
0 5 10 15 20 25
%Recall
Rank
SuperPOS (Best Case) WTM Random

Sensitivity to Training Size
 Current Results reported with 6163 Train
Trivia
 WTM precision increases with train size
 Desirable property as precision can be
improved by taking more train data

WTM’s Domain Independence
 Experiment on Celebrity Domain to justify claim of domain independence.
 Dataset:
 Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test
 Train dataset: 4459 Trivia (106 entities)
 Test dataset: 500 Trivia (10 entities)
 Doubtful feature for being domain dependent – Entity Features
Unigram (E) Features Linguistic (L) Features Entity (E) Features
All words subj_actor, root_reveal,
subj_scene, but, best,
FOG_index = 7.2
entity_producer,
entity_director, …

WTM’s Domain Independence (Contd…)
 Entity Features are domain independent too
 Entity Features are automatically generated using attribute:value pairs in DBpedia
 For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
 Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

 Entity Features are domain independent too
 Entity Features are automatically generated using attribute:value pairs in DBpedia
 For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
 Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

FEATURE ENTITY TRIVIA
entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].**
entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award.
** After Entity Linking sentence parsed as “Engaged to entity_partner”
 Entity Feature Generation from DBpedia
 Example of Entity Features in Celebrity Domain
Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie)
DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated
Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner
Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace

Feature Contribution (Movie v/s Celeb.)
Rank Feature Group
1 win Unigram
3 magazine Unigram
4 superPOS Linguistic
5 MONEY Entity (NER)
6 entity_alternativenames Entity
7 root_engage Linguistic
14 subj_earnings Linguistic
15 subj_entity_children Linguistic + Entity
18 entity_birthplace Entity
19 subj_unlinked_location Linguistic + Entity
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
 Top Features: Our advanced features are useful and intuitive for humans too
 Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast)
Movie Domain Celebrity Domain

Results: P@10 (Celebrity Domain)
0.39
0.54
0.58
0.71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Random supPOS(Best
Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
 Again WTM (U+L+E) >> WTM (U)
 Significance of advanced (L) and (E)
features
 Hence, Features and Approach are
Domain Independent
 For entities of any domain, just replace
Train Data (Sample Trivia)

Dissertation Contribution
 Identified, Defined and Provided a novel research problem
 not just only providing solutions to existing problem
 Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)”
 To mine top-k interesting trivia for any given entity based on their interestingness
 Engineered features that capture ‘about-ness’ of sentence
 Generalizes which one are interesting
 Proposed a mechanism to prepare ground truth for test-set
 Cost-effective but statistically significant

Future Works
 New Features to increase Ranking Quality
 Unusualness: Probability of occurrence of the sentence in considered domain
 Fact Popularity: Lesser known trivia could be more interesting to majority people
 Trying Deep Learning
 Could be helpful as in case of sarcasm detection
 Generating Questions from mined trivia
 To present Trivia in question form
 Obtaining personalized Interesting Trivia
 In this dissertation work, we took interesting based on majority voting. Ranking based on user
demographics

Mining Interesting Trivia for Entities from Wikipedia PART-II

Mining Interesting Trivia for Entities from Wikipedia PART-II

Recommended

Recommended

More Related Content

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-II

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-II (20)

Recently uploaded

Recently uploaded (20)

Mining Interesting Trivia for Entities from Wikipedia PART-II