SlideShare a Scribd company logo
Mining Interesting Trivia for Entities
from Wikipedia
Supervised By: Presented By:
Dr. Dhaval Patel,
Assistant Professor,
IIT Roorkee
Abhay Prakash,
En. No. - 10211002,
IIT Roorkee
Dr. Manoj K. Chinnakotla,
Applied Researcher,
Microsoft India
Publication Accepted
[1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you
know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th
International Joint Conference on Artificial Intelligence (IJCAI), 2015.
Conference Rating: A*
Introduction: Problem Statement
Definition: Trivia are any facts about an entity which are interesting due to any
of the following characteristics - unusualness, uniqueness, unexpectedness or
weirdness.
 Generally appear in “Did you know?” articles
 E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman
Begins]
 Unusual for an actor/human to seclude himself for a month
Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia
page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more
than 𝑁/2 persons find it interesting.
 For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)
Wikipedia Trivia Miner (WTM)
 Based on ML approach to mine trivia from unstructured text
 Trains a ranker using sample trivia of target domain
 Experiment with Movie entities and Celebrity entities
 Harness trained ranker to mine Trivia from entity’s Wikipedia page
 Retrieves Top-k standalone interesting sentences from entity’s page
 Why Wikipedia?
 Reliable for factual correctness
 Ample # of interesting trivia (56/100 in expt.)
System Architecture
 Filtering & Grading
 Filters out noisy samples
 Give a grade to each sample, as reqd. by ranker
 Interestingness Ranker
 Extracts features from the samples/candidates
 Trains ranker(SVMrank)/Ranks candidates
 Candidate Selection
 Identifies candidates from Wikipedia
Candidate
Selection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Trivia
from Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Candidate
Selection
Candidates’ Source
Top-K Interesting Trivia
from Candidates
Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Human Voted Trivia Source
Train Dataset
Filtering & Grading
Feature Extraction SVMrank
Train Phase
Model
Execution Phases
 Train Phase
 Crawls and prepares train data
 Featurize the train data
 Trains SVMrank to build a model
 Retrieval Phase
 Crawls entity’s Wikipedia text
 Identify candidates for trivia
 Featurize the candidates
 Rank the candidates using
already built model
Feature Engineering
Bucket Feature Significance Sample features Example Trivia
Unigram (U)
Features
Each word’s
TF-IDF
Identify imp. words which
make the trivia interesting
“stunt”, “award”,
“improvise”
“Tom Cruise did all of his own stunt driving.”
Linguistic (L)
Features
Superlative
Words
Shows the extremeness
(uniqueness)
“best”, “longest”,
“first”
“The longest animated Disney film since
Fantasia (1940).”
Contradictory
Words
Opposing ideas could spark
intrigue and interest
“but”, “although”,
“unlike”
“The studios wanted Matthew McConaughey
for lead role, but James Cameron insisted on
Leonardo DiCaprio.”
Root Word
(Main Verb)
Captures core activity being
discussed in the sentence
root_gross “Gravity grossed $274 Mn in North America”
Subject Word
(First Noun)
Captures core thing being
discussed in the sentence
subj_actor “The actors snorted crushed B vitamins for
scenes involving cocaine”
Readability Complex and lengthy trivia
are hardly interesting
FOG Index binned
in 3 bins ---
Feature Engineering (Contd…)
Bucket Feature Significance Sample features Example Trivia
Entity (E)
Features
Generic NEs captures general about-
ness
MONEY,
ORGANIZATION,
PERSON, DATE, TIME
and LOCATION
“The guns in the film were supplied by Aldo
Uberti Inc., a company in Italy.”
• ORGANIZATION and LOCATION
Related
Entities
captures specific about-
ness
(Entities resolved using
DBPedia)
entity_producer,
entity_director
“According to Victoria Alonso, Rocket Raccoon
and Groot were created through a mix of
motion-capture and rotomation VFX.”
• entity_producer, entity_character
Entity Linking
before
(L) Parsing
Captures generalized
story of sentence
subj_entity_producer [The same trivia above]
• “According to entity_producer, …”
• subj_Victoria  subj_entity_producer
Focus Entities Captures core entities
being talked about
underroot_entity_
producer
[The same trivia above]
• underroot_entity_producer,
underroot_entity_character
Feature Engineering: Example
Ex. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of
motion-capture and rotomation VFX.”
 Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data
 Rest of the features have value 0.
 entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, ….
create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG
0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3
contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character
0 1 1 1 1
Comparative Approaches
I. Random [Baseline I]:
- 10 sentences picked randomly from Wikipedia
II. CS + Random
- Candidates Selected (standalone context independent sentences)
- i.e., remove sentences like “it really reminds me of my childhood”
- 10 sentences picked randomly from candidates
III. CS + supPOS(Best) [Baseline II]:
- Candidates Selected
- Ranked by # of sup. words
- Deliberately taking interesting sent. for same # of sup. words
Rank # of sup.
words
Class
1 2 Interesting
2 2 Boring
3 1 Interesting
4 1 Interesting
5 1 Interesting
6 1 Boring
7 1 Boring
supPOS (Best Case)
Variants of WTM
I. WTM (U)
- Candidates Selected
- ML Ranking of candidates using only Unigram Features
II. WTM (U+L+E)
- Candidates Selected
- ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)
Results: P@10
 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates,
how many actually are interesting
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: P@10
 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates,
how many actually are interesting
 CS+Random > Random
 Shows significance of Candidate
Selection
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: P@10
 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates,
how many actually are interesting
 CS+Random > Random
 Shows significance of Candidate
Selection
 WTM (U+L+E) >> WTM (U)
 Shows significance of Engineered
Linguistic (L) and Entity (E) Features
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS
(Best Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
Results: Recall@K
 supPOS limited to one kind of trivia
 WTM captures varied types
 62% recall till rank 25
 Performance Comparison
 supPOS better till rank 3
 Soon after rank 3, WTM beats superPOS
0
10
20
30
40
50
60
70
0 5 10 15 20 25
%Recall
Rank
SuperPOS (Best Case) WTM Random
Sensitivity to Training Size
 Current Results reported with 6163 Train
Trivia
 WTM precision increases with train size
 Desirable property as precision can be
improved by taking more train data
WTM’s Domain Independence
 Experiment on Celebrity Domain to justify claim of domain independence.
 Dataset:
 Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test
 Train dataset: 4459 Trivia (106 entities)
 Test dataset: 500 Trivia (10 entities)
 Doubtful feature for being domain dependent – Entity Features
Unigram (E) Features Linguistic (L) Features Entity (E) Features
All words subj_actor, root_reveal,
subj_scene, but, best,
FOG_index = 7.2
entity_producer,
entity_director, …
WTM’s Domain Independence (Contd…)
 Entity Features are domain independent too
 Entity Features are automatically generated using attribute:value pairs in DBpedia
 For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
 Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
WTM’s Domain Independence (Contd…)
 Entity Features are domain independent too
 Entity Features are automatically generated using attribute:value pairs in DBpedia
 For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
 Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
FEATURE ENTITY TRIVIA
entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].**
entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award.
** After Entity Linking sentence parsed as “Engaged to entity_partner”
 Entity Feature Generation from DBpedia
 Example of Entity Features in Celebrity Domain
WTM’s Domain Independence (Contd…)
Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie)
DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated
Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner
Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace
Feature Contribution (Movie v/s Celeb.)
Rank Feature Group
1 win Unigram
3 magazine Unigram
4 superPOS Linguistic
5 MONEY Entity (NER)
6 entity_alternativenames Entity
7 root_engage Linguistic
14 subj_earnings Linguistic
15 subj_entity_children Linguistic + Entity
18 entity_birthplace Entity
19 subj_unlinked_location Linguistic + Entity
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
 Top Features: Our advanced features are useful and intuitive for humans too
 Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast)
Movie Domain Celebrity Domain
Results: P@10 (Celebrity Domain)
0.39
0.54
0.58
0.71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Random supPOS(Best
Case)
WTM (U) WTM
(U+L+E)
P@10
Approaches
 Again WTM (U+L+E) >> WTM (U)
 Significance of advanced (L) and (E)
features
 Hence, Features and Approach are
Domain Independent
 For entities of any domain, just replace
Train Data (Sample Trivia)
Dissertation Contribution
 Identified, Defined and Provided a novel research problem
 not just only providing solutions to existing problem
 Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)”
 To mine top-k interesting trivia for any given entity based on their interestingness
 Engineered features that capture ‘about-ness’ of sentence
 Generalizes which one are interesting
 Proposed a mechanism to prepare ground truth for test-set
 Cost-effective but statistically significant
Future Works
 New Features to increase Ranking Quality
 Unusualness: Probability of occurrence of the sentence in considered domain
 Fact Popularity: Lesser known trivia could be more interesting to majority people
 Trying Deep Learning
 Could be helpful as in case of sarcasm detection
 Generating Questions from mined trivia
 To present Trivia in question form
 Obtaining personalized Interesting Trivia
 In this dissertation work, we took interesting based on majority voting. Ranking based on user
demographics
Mining Interesting Trivia for Entities from Wikipedia PART-II

More Related Content

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-II

GraphQL with Sangria
GraphQL with SangriaGraphQL with Sangria
GraphQL with Sangria
Anthony SSI YAN KAI
 
Kotlin for Android devs
Kotlin for Android devsKotlin for Android devs
Kotlin for Android devs
Adit Lal
 
Solving performance issues in Django ORM
Solving performance issues in Django ORMSolving performance issues in Django ORM
Solving performance issues in Django ORM
Sian Lerk Lau
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
predictionio
 
Whats New In C# 4 0 - NetPonto
Whats New In C# 4 0 - NetPontoWhats New In C# 4 0 - NetPonto
Whats New In C# 4 0 - NetPonto
Paulo Morgado
 
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
 IronPython and Dynamic Languages on .NET by Mahesh Prakriya IronPython and Dynamic Languages on .NET by Mahesh Prakriya
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
codebits
 
Automated evaluation of crowdsourced annotations in the cultural heritage domain
Automated evaluation of crowdsourced annotations in the cultural heritage domainAutomated evaluation of crowdsourced annotations in the cultural heritage domain
Automated evaluation of crowdsourced annotations in the cultural heritage domain
dreamgirl314
 
Recommending job ads to people
Recommending job ads to peopleRecommending job ads to people
Recommending job ads to people
Fabian Abel
 
Max Koretskyi "Why are Angular and React so fast?"
Max Koretskyi "Why are Angular and React so fast?"Max Koretskyi "Why are Angular and React so fast?"
Max Koretskyi "Why are Angular and React so fast?"
Fwdays
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
Narong Intiruk
 
OOP programming for engineering students
OOP programming for engineering studentsOOP programming for engineering students
OOP programming for engineering students
iaeronlineexm
 
ITU - MDD - XText
ITU - MDD - XTextITU - MDD - XText
ITU - MDD - XText
Tonny Madsen
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
nithinmohantk
 
Lerman Vvs14 Ef Tips And Tricks
Lerman Vvs14  Ef Tips And TricksLerman Vvs14  Ef Tips And Tricks
Lerman Vvs14 Ef Tips And Tricks
Julie Lerman
 
Towards JVM Dynamic Languages Toolchain
Towards JVM Dynamic Languages ToolchainTowards JVM Dynamic Languages Toolchain
Towards JVM Dynamic Languages Toolchain
Attila Szegedi
 
COMP111-Week-1_138439.pptx
COMP111-Week-1_138439.pptxCOMP111-Week-1_138439.pptx
COMP111-Week-1_138439.pptx
FarooqTariq8
 
Nltk sentiment analysis
Nltk sentiment analysisNltk sentiment analysis
Nltk sentiment analysis
Janu Jahnavi
 
Python for dummies
Python for dummiesPython for dummies
Python for dummies
Roberto Stefanetti
 
Secrets of Awesome JavaScript API Design
Secrets of Awesome JavaScript API DesignSecrets of Awesome JavaScript API Design
Secrets of Awesome JavaScript API Design
Brandon Satrom
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
krisztianbalog
 

Similar to Mining Interesting Trivia for Entities from Wikipedia PART-II (20)

GraphQL with Sangria
GraphQL with SangriaGraphQL with Sangria
GraphQL with Sangria
 
Kotlin for Android devs
Kotlin for Android devsKotlin for Android devs
Kotlin for Android devs
 
Solving performance issues in Django ORM
Solving performance issues in Django ORMSolving performance issues in Django ORM
Solving performance issues in Django ORM
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Whats New In C# 4 0 - NetPonto
Whats New In C# 4 0 - NetPontoWhats New In C# 4 0 - NetPonto
Whats New In C# 4 0 - NetPonto
 
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
 IronPython and Dynamic Languages on .NET by Mahesh Prakriya IronPython and Dynamic Languages on .NET by Mahesh Prakriya
IronPython and Dynamic Languages on .NET by Mahesh Prakriya
 
Automated evaluation of crowdsourced annotations in the cultural heritage domain
Automated evaluation of crowdsourced annotations in the cultural heritage domainAutomated evaluation of crowdsourced annotations in the cultural heritage domain
Automated evaluation of crowdsourced annotations in the cultural heritage domain
 
Recommending job ads to people
Recommending job ads to peopleRecommending job ads to people
Recommending job ads to people
 
Max Koretskyi "Why are Angular and React so fast?"
Max Koretskyi "Why are Angular and React so fast?"Max Koretskyi "Why are Angular and React so fast?"
Max Koretskyi "Why are Angular and React so fast?"
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
OOP programming for engineering students
OOP programming for engineering studentsOOP programming for engineering students
OOP programming for engineering students
 
ITU - MDD - XText
ITU - MDD - XTextITU - MDD - XText
ITU - MDD - XText
 
PDC Video on C# 4.0 Futures
PDC Video on C# 4.0 FuturesPDC Video on C# 4.0 Futures
PDC Video on C# 4.0 Futures
 
Lerman Vvs14 Ef Tips And Tricks
Lerman Vvs14  Ef Tips And TricksLerman Vvs14  Ef Tips And Tricks
Lerman Vvs14 Ef Tips And Tricks
 
Towards JVM Dynamic Languages Toolchain
Towards JVM Dynamic Languages ToolchainTowards JVM Dynamic Languages Toolchain
Towards JVM Dynamic Languages Toolchain
 
COMP111-Week-1_138439.pptx
COMP111-Week-1_138439.pptxCOMP111-Week-1_138439.pptx
COMP111-Week-1_138439.pptx
 
Nltk sentiment analysis
Nltk sentiment analysisNltk sentiment analysis
Nltk sentiment analysis
 
Python for dummies
Python for dummiesPython for dummies
Python for dummies
 
Secrets of Awesome JavaScript API Design
Secrets of Awesome JavaScript API DesignSecrets of Awesome JavaScript API Design
Secrets of Awesome JavaScript API Design
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 

Recently uploaded

Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
Chevonnese Chevers Whyte, MBA, B.Sc.
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Leena Ghag-Sakpal
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
ssuser13ffe4
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 

Recently uploaded (20)

Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 

Mining Interesting Trivia for Entities from Wikipedia PART-II

  • 1. Mining Interesting Trivia for Entities from Wikipedia Supervised By: Presented By: Dr. Dhaval Patel, Assistant Professor, IIT Roorkee Abhay Prakash, En. No. - 10211002, IIT Roorkee Dr. Manoj K. Chinnakotla, Applied Researcher, Microsoft India
  • 2. Publication Accepted [1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015. Conference Rating: A*
  • 3. Introduction: Problem Statement Definition: Trivia are any facts about an entity which are interesting due to any of the following characteristics - unusualness, uniqueness, unexpectedness or weirdness.  Generally appear in “Did you know?” articles  E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman Begins]  Unusual for an actor/human to seclude himself for a month Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more than 𝑁/2 persons find it interesting.  For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)
  • 4. Wikipedia Trivia Miner (WTM)  Based on ML approach to mine trivia from unstructured text  Trains a ranker using sample trivia of target domain  Experiment with Movie entities and Celebrity entities  Harness trained ranker to mine Trivia from entity’s Wikipedia page  Retrieves Top-k standalone interesting sentences from entity’s page  Why Wikipedia?  Reliable for factual correctness  Ample # of interesting trivia (56/100 in expt.)
  • 5. System Architecture  Filtering & Grading  Filters out noisy samples  Give a grade to each sample, as reqd. by ranker  Interestingness Ranker  Extracts features from the samples/candidates  Trains ranker(SVMrank)/Ranks candidates  Candidate Selection  Identifies candidates from Wikipedia Candidate Selection Human Voted Trivia Source Train Dataset Candidates’ Source Top-K Interesting Trivia from Candidates Wikipedia Trivia Miner (WTM) Interestingness Ranker Filtering & Grading Feature Extraction Feature ExtractionSVMrank Knowledge Base
  • 6. Candidate Selection Candidates’ Source Top-K Interesting Trivia from Candidates Feature ExtractionSVMrank Knowledge Base Retrieval Phase Human Voted Trivia Source Train Dataset Filtering & Grading Feature Extraction SVMrank Train Phase Model Execution Phases  Train Phase  Crawls and prepares train data  Featurize the train data  Trains SVMrank to build a model  Retrieval Phase  Crawls entity’s Wikipedia text  Identify candidates for trivia  Featurize the candidates  Rank the candidates using already built model
  • 7. Feature Engineering Bucket Feature Significance Sample features Example Trivia Unigram (U) Features Each word’s TF-IDF Identify imp. words which make the trivia interesting “stunt”, “award”, “improvise” “Tom Cruise did all of his own stunt driving.” Linguistic (L) Features Superlative Words Shows the extremeness (uniqueness) “best”, “longest”, “first” “The longest animated Disney film since Fantasia (1940).” Contradictory Words Opposing ideas could spark intrigue and interest “but”, “although”, “unlike” “The studios wanted Matthew McConaughey for lead role, but James Cameron insisted on Leonardo DiCaprio.” Root Word (Main Verb) Captures core activity being discussed in the sentence root_gross “Gravity grossed $274 Mn in North America” Subject Word (First Noun) Captures core thing being discussed in the sentence subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine” Readability Complex and lengthy trivia are hardly interesting FOG Index binned in 3 bins ---
  • 8. Feature Engineering (Contd…) Bucket Feature Significance Sample features Example Trivia Entity (E) Features Generic NEs captures general about- ness MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION “The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.” • ORGANIZATION and LOCATION Related Entities captures specific about- ness (Entities resolved using DBPedia) entity_producer, entity_director “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.” • entity_producer, entity_character Entity Linking before (L) Parsing Captures generalized story of sentence subj_entity_producer [The same trivia above] • “According to entity_producer, …” • subj_Victoria  subj_entity_producer Focus Entities Captures core entities being talked about underroot_entity_ producer [The same trivia above] • underroot_entity_producer, underroot_entity_character
  • 9. Feature Engineering: Example Ex. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”  Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data  Rest of the features have value 0.  entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, …. create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG 0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3 contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character 0 1 1 1 1
  • 10. Comparative Approaches I. Random [Baseline I]: - 10 sentences picked randomly from Wikipedia II. CS + Random - Candidates Selected (standalone context independent sentences) - i.e., remove sentences like “it really reminds me of my childhood” - 10 sentences picked randomly from candidates III. CS + supPOS(Best) [Baseline II]: - Candidates Selected - Ranked by # of sup. words - Deliberately taking interesting sent. for same # of sup. words Rank # of sup. words Class 1 2 Interesting 2 2 Boring 3 1 Interesting 4 1 Interesting 5 1 Interesting 6 1 Boring 7 1 Boring supPOS (Best Case)
  • 11. Variants of WTM I. WTM (U) - Candidates Selected - ML Ranking of candidates using only Unigram Features II. WTM (U+L+E) - Candidates Selected - ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)
  • 12. Results: P@10  Metric is Precision at 10 (P@10), which means out of top 10 ranked candidates, how many actually are interesting 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 13. Results: P@10  Metric is Precision at 10 (P@10), which means out of top 10 ranked candidates, how many actually are interesting  CS+Random > Random  Shows significance of Candidate Selection 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 14. Results: P@10  Metric is Precision at 10 (P@10), which means out of top 10 ranked candidates, how many actually are interesting  CS+Random > Random  Shows significance of Candidate Selection  WTM (U+L+E) >> WTM (U)  Shows significance of Engineered Linguistic (L) and Entity (E) Features 0.25 0.3 0.34 0.34 0.45 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Random CS+Random supPOS (Best Case) WTM (U) WTM (U+L+E) P@10 Approaches
  • 15. Results: Recall@K  supPOS limited to one kind of trivia  WTM captures varied types  62% recall till rank 25  Performance Comparison  supPOS better till rank 3  Soon after rank 3, WTM beats superPOS 0 10 20 30 40 50 60 70 0 5 10 15 20 25 %Recall Rank SuperPOS (Best Case) WTM Random
  • 16. Sensitivity to Training Size  Current Results reported with 6163 Train Trivia  WTM precision increases with train size  Desirable property as precision can be improved by taking more train data
  • 17. WTM’s Domain Independence  Experiment on Celebrity Domain to justify claim of domain independence.  Dataset:  Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test  Train dataset: 4459 Trivia (106 entities)  Test dataset: 500 Trivia (10 entities)  Doubtful feature for being domain dependent – Entity Features Unigram (E) Features Linguistic (L) Features Entity (E) Features All words subj_actor, root_reveal, subj_scene, but, best, FOG_index = 7.2 entity_producer, entity_director, …
  • 18. WTM’s Domain Independence (Contd…)  Entity Features are domain independent too  Entity Features are automatically generated using attribute:value pairs in DBpedia  For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’  Unigram (U) and Linguistic (L) features clearly domain independent DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
  • 19. WTM’s Domain Independence (Contd…)  Entity Features are domain independent too  Entity Features are automatically generated using attribute:value pairs in DBpedia  For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’  Unigram (U) and Linguistic (L) features clearly domain independent DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
  • 20. FEATURE ENTITY TRIVIA entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].** entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award. ** After Entity Linking sentence parsed as “Engaged to entity_partner”  Entity Feature Generation from DBpedia  Example of Entity Features in Celebrity Domain WTM’s Domain Independence (Contd…) Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie) DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace
  • 21. Feature Contribution (Movie v/s Celeb.) Rank Feature Group 1 win Unigram 3 magazine Unigram 4 superPOS Linguistic 5 MONEY Entity (NER) 6 entity_alternativenames Entity 7 root_engage Linguistic 14 subj_earnings Linguistic 15 subj_entity_children Linguistic + Entity 18 entity_birthplace Entity 19 subj_unlinked_location Linguistic + Entity Rank Feature Group 1 subj_scene Linguistic 2 subj_entity_cast Linguistic + Entity 3 entity_produced_by Entity 4 underroot_unlinked_organization Linguistic + Entity 6 root_improvise Linguistic 7 entity_character Entity 8 MONEY Entity (NER) 14 stunt Unigram 16 superPOS Linguistic 17 subj_actor Linguistic  Top Features: Our advanced features are useful and intuitive for humans too  Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast) Movie Domain Celebrity Domain
  • 22. Results: P@10 (Celebrity Domain) 0.39 0.54 0.58 0.71 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Random supPOS(Best Case) WTM (U) WTM (U+L+E) P@10 Approaches  Again WTM (U+L+E) >> WTM (U)  Significance of advanced (L) and (E) features  Hence, Features and Approach are Domain Independent  For entities of any domain, just replace Train Data (Sample Trivia)
  • 23. Dissertation Contribution  Identified, Defined and Provided a novel research problem  not just only providing solutions to existing problem  Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)”  To mine top-k interesting trivia for any given entity based on their interestingness  Engineered features that capture ‘about-ness’ of sentence  Generalizes which one are interesting  Proposed a mechanism to prepare ground truth for test-set  Cost-effective but statistically significant
  • 24. Future Works  New Features to increase Ranking Quality  Unusualness: Probability of occurrence of the sentence in considered domain  Fact Popularity: Lesser known trivia could be more interesting to majority people  Trying Deep Learning  Could be helpful as in case of sarcasm detection  Generating Questions from mined trivia  To present Trivia in question form  Obtaining personalized Interesting Trivia  In this dissertation work, we took interesting based on majority voting. Ranking based on user demographics