SlideShare a Scribd company logo
1 of 18
Graph Based Facet 
Selection 
NICK YU, LEON ZHANG, ANTON EMELYANOV
Why Facet Selection? 
Obviously, our graph has errors 
This is because our sources have errors 
Ideally, when we have more data from more 
sources, our data correctness should improve (but 
it doesn’t) 
Having more data ≠ Having more knowledge 
If our precision tier needs to have a low error rate, 
we need a way to filter out errors
Source Based Facet Selection 
All facet from one data source shares same confidence 
Aggregate confidence if multiple data source have same facet
However… 
As we add more sources, it is impossible to 
have a score per source per property 
Even if we trust a source highly, its individual 
facts cannot be 100% correct 
The selection process does not take account of 
information we already know 
Mari Henmi 
Children Children 
Emiri Henmi
Taking Advantage of Prior Knowledge 
When seen in isolation, it’s hard to know 
whether Mari is Emiri’s mother or child 
However, if we know the following facts (each 
with some probability of being true), then the 
job is a lot easier: 
◦ Emiri’s the other parent, Teruhiko, is Mari’s 
husband 
◦ Emiri’s sibling, Noritaka, is Mari’s child 
Mari Henmi 
Teruhiko Saigo 
Noritaka Henmi 
Mari Henmi 
Children Children 
Emiri Henmi 
Spouse 
Children Children Children 
Sibling 
Children 
Emiri Henmi 
Inspired by Google’s Knowledge Vault concept
Graph Based Selection using Prior 
Knowledge 
We can generalize the model in the following form. Given a triple 푠, 푝, 표 , and 푟푖 are all possible 
paths that connect 푠 to 표, including reverse edges and multiple hops. The probability of that 
triplet being true is: 
푃 푠, 푝, 표 = 푃 푝 푟1, 푟2, … , 푟푛 
In particular, for any given triple 푠, 푝, 표 , we first find all the paths 풓 = 푟1, … , 푟푛 from Mari to 
Emiri. 
푠푝표푢푠푒 
푐ℎ푖푙푑 
푐ℎ푖푙푑 
푠푖푏푙푖푛푔 
◦ Examples include r1 = A 
∙ 
퐵, r2 = A 
∙ 
퐵, etc. 
Then we assign the weight 풘 = 푤1, … , 푤푛 to each of the paths, and calculate the prediction 
as: 
퐹 푝| 풓 = 
1 
1 + exp − 풘 ∙ 풓 
There are several linear models possible. We find logistic regression to perform very well
Implementation 
Steps to calculate this score: 
1. Create training set and test set 
2. Mine all the possible paths from S to O 
3. Treat each path as a feature and train a model 
Simple, right?
Training and Test Set 
Need to contain: 
◦ Positive examples – easy 
◦ Negative examples – how? 
Local Closed World Assumption: 
◦ For a given entity, if we know the values of a property, then we know all values of that property 
◦ More concretely, if we already know a Tom Cruise has three children, then any other entity is unlikely to 
be his fourth children – this is a possible negative example 
◦ However, if we don’t know his children at all, then we cannot say who must not be his children – this is 
not a possible negative example 
◦ Remember we just need LCWA to be true enough to generate negative examples, it doesn’t have to be 
100% true.
What Negative Examples to Choose 
Are all entities who violate LCWA also good candidates for negative examples? 
Randomly pick any entity? 
◦ This cannot work because paths between any random pair of entities are very sparse 
◦ The classifier will learn to classify the existence of connection between two entities and not the right 
kind of connections 
The related entities of positive examples? 
◦ Choose A-B to be a negative example if A-C is a positive example and B and C are related entities 
◦ The connection between A-C is still very sparse 
All neighbor entities? 
◦ Choose A-B to be a negative example if A is already connected to B and A-B is not a positive example 
◦ Need to make sure B has the same type as the expected type of the property
All Paths (Rules) That Connect S to O 
We find the following paths between any two entities: 
◦ 1 hop: All forward and backward edges 
◦ 2 hops: All edges including f/f, f/b, b/f, b/b directions 
◦ Excluding intermediate hub entities 
◦ We use bidirectional search to speed up the job 
◦ The performance breaks down beyond two hops – this can be improved
(Some) Rules Trained for Marriage 
Property 
Rule Precision Recall F1 
->people.person.children<-people.person.children 0.990501 0.254667 0.405163 
->people.person.children->people.person.parent 0.995007 0.25403 0.40473 
<-people.person.parent<-people.person.children 0.991585 0.253872 0.404246 
<-people.person.parent->people.person.parent 0.995389 0.253263 0.403788 
->film.actor.film<-film.actor.film 0.210501 0.044027 0.072822 
<-film.film.actor->film.film.actor 0.210041 0.043981 0.072733 
->film.actor.film<-film.actor.performance--film.performance.film 0.211015 0.043931 0.072722 
->film.actor.film->film.film.performance--film.performance.actor 0.211004 0.043931 0.072721 
<-film.film.actor<-film.actor.film 0.210119 0.043965 0.072715 
->film.actor.film->film.film.actor 0.210183 0.043959 0.072711 
<-film.film.actor<-film.actor.performance--film.performance.film 0.210729 0.043914 0.072681 
<-film.film.actor->film.film.performance--film.performance.actor 0.210711 0.043914 0.07268
Train Final Models for Each Property 
Given the paths (rules 푟1, … , 푟푛) for each property, we train a logistic regression model for each 
property p: 
푃 푝 = 퐿푅 푟1, 푟2, … , 푟푛 
How to map the rules to a feature vector? 
◦ There are 90,000 distinct possible paths between any two given entity. This maps to a feature vector of 90,000 
dimensions. 
◦ There could be more paths as we grow our graph. How do we assign dimensions for new paths? 
Our solution – hash kernels 
◦ Project the feature space down to a 1,500 dimensional hash space 
◦ Learn the model on the hashed feature space 
◦ Use L1 regularization to get rid of useless features 
◦ Collisions are handled in some degree by the hash kernel itself. Additional collisions are handled by having 
multiple hash kernels
Mari Henmi & Emiri Hemi 
Mari Henmi 
Children Children 
Emiri Henmi 
Is Mari a child of Emiri's? 
Rule Weight 
Bias -3.38435 
<-people.person.children 
<-people.person.marriage--time.event.person<-people.person.parent -0.03988 
<-people.person.marriage--time.event.person->people.person.children -0.3237 
<-people.person.parent 
<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling 
<-people.person.parent<-people.person.siblings 
<-people.person.parent->people.person.sibling--people.sibling_relationship.sibling 
<-people.person.parent->people.person.siblings -0.3237 
->people.person.children 
->people.person.children<-people.person.sibling--people.sibling_relationship.sibling -0.03796 
->people.person.children<-people.person.siblings 
->people.person.children->people.person.sibling--people.sibling_relationship.sibling -0.12332 
->people.person.children->people.person.siblings -0.03463 
->people.person.marriage--time.event.person<-people.person.parent -0.4855 
->people.person.marriage--time.event.person->people.person.children -0.06937 
->people.person.parent 
Total -4.8224 
Sigmoid 0.007983 
Is Emiri a child of Mari's? 
Rule Weight 
Bias -3.38435 
<-people.person.children 
<-people.person.children<-people.person.marriage--time.event.person 3.043293 
<-people.person.children->people.person.marriage--time.event.person 1.556977 
<-people.person.parent 
<-people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.72802 
<-people.person.sibling--people.sibling_relationship.sibling->people.person.parent 1.194149 
<-people.person.siblings<-people.person.children 0.369578 
<-people.person.siblings->people.person.parent 0.436715 
->people.person.children 
->people.person.parent 
->people.person.parent<-people.person.marriage--time.event.person 3.05227 
->people.person.parent->people.person.marriage--time.event.person 1.445125 
->people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.386518 
->people.person.sibling--people.sibling_relationship.sibling->people.person.parent 0.989205 
->people.person.siblings<-people.person.children 1.365237 
->people.person.siblings->people.person.parent 0.827563 
Total 14.0103 
Sigmoid 0.99999
Measurement 
We measure the trained models on a separate 
hold-out set 
Precision = True Positives / Predicted Positives 
Recall = True Positives / Labeled Positives 
Most models have high precision and not so 
high recalls 
This is because the model can’t reason about 
shallow entities 
_ P Precision Recall 
automotive.automotive_class.related 0.997171 0.999055 
automotive.trim_level.model_year 1 1 
automotive.trim_level.option_package--automotive.option_package.trim_levels 1 0.924326 
automotive.trim_level.related_trim_level 1 1 
award.nominated_work.nomination--award.nomination.nominee 0.911385 0.327528 
award.nominee.award_nominations--award.nomination.nominated_work 0.852335 0.265277 
award.winner.awards_won--award.honor.winner 0.773061 0.694073 
award.winning_work.honor--award.honor.winner 0.907776 0.186617 
education.school.school_district 0.983673 0.598015 
film.actor.film 0.967154 0.981438 
film.director.film 0.674589 0.81029 
film.film.actor 0.991635 0.981757 
film.film.art_director 0.621622 0.042048 
film.film.country 0.86165 0.776492 
film.film.director 0.705793 0.821246 
film.film.editor 0.886905 0.060105 
film.film.language 1 0.780943 
film.film.music 0.945946 0.01992 
film.film.performance--film.performance.actor 0.620755 0.07921 
film.film.producer 0.492865 0.052533 
film.film.production_company 0.94723 0.187467 
film.film.story 0.976492 0.159292 
film.film.writer 0.795948 0.787734 
film.producer.film 0.829213 0.317553 
film.writer.film 0.704711 0.77448 
music.artist.track_contributions--music.track_contribution.track 0.963513 0.835044 
music.track.artist 0.955354 0.975672 
music.track.producer 0.76477 0.705882 
organization.organization.headquarters--location.address.city_entity 0.865854 0.022955 
organization.organization.headquarters--location.address.subdivision_entity 0.811475 0.037106 
people.deceased_person.place_of_death 0.916096 0.075246 
people.person.children 0.998595 0.917393 
people.person.marriage--time.event.person 0.996065 0.798455 
people.person.nationality 0.95238 0.9496 
people.person.parent 0.998061 0.916419 
people.person.place_of_birth 0.966238 0.344751 
people.person.sibling--people.sibling_relationship.sibling 0.993952 0.909814 
people.person.siblings 0.998787 0.911975 
soccer.player.national_career_roster--sports.sports_team_roster.team 0.997714 0.637226 
sports.pro_athlete.team--soccer.roster_position.team 0.607641 0.919797 
sports.pro_athlete.team--sports.sports_team_roster.team 0.811641 0.633239
Demo 
http://10.123.70.114:8787/
Handling Scalar Values 
Our models only handle entity-entity facets and not entity-value facets 
To handle scalar values, we can bucketize the values and then treat buckets as entities. Then, we 
can apply the same algorithm. 
For example, to score the facet “Tom Cruise is born on 7/3/1962”, we can do the following: 
◦ Bucketize 7/3/1962 into entity “1960s” 
◦ Find all possible paths between Tom Cruise and “1960s” 
푠푖푏푙푖푛푔 
푏표푟푛 
◦ One such path could be: 푇표푚 퐶푟푢푖푠푒 
푀푎푟푖푎푛 푀푎푝표푡ℎ푒푟 
1960푠 
◦ We can assign high weights to paths like this one, and the rest same as before
Issues and Further Work 
Classifiers work well with rich entities but not shallow entities 
◦ As we grow more data, our rich entities should increase 
The training and test set are not representative of real world data 
◦ Positive examples are often highly connected – this can cause the classifier to be very conservative 
◦ Negative examples are often too random – real world data can be more ambiguous 
The prototype works with two hops but not yet three hops 
◦ When we get to three hops, the intermediate data reaches about 40TB or more. More optimization 
needed
Resources 
Aether: aether://experiments/01fe49c5-ae8f-4fc3-b713-9c63f8c68cf8

More Related Content

Viewers also liked

How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...
How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...
How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...callutheran
 
Introduction to phonetics by Siorella gonzales
Introduction to phonetics by Siorella gonzalesIntroduction to phonetics by Siorella gonzales
Introduction to phonetics by Siorella gonzalesSiorella Gonzales Sánchez
 
Letter from President Ronald Reagan to William Teichner
Letter from President Ronald Reagan to William TeichnerLetter from President Ronald Reagan to William Teichner
Letter from President Ronald Reagan to William TeichnerWilliam Teichner
 
Analisis saluran pemasaran terintegrasi
Analisis saluran pemasaran terintegrasiAnalisis saluran pemasaran terintegrasi
Analisis saluran pemasaran terintegrasiwandafebri
 
Mapas mentales lubiglys 1
Mapas mentales lubiglys 1Mapas mentales lubiglys 1
Mapas mentales lubiglys 1lubiglys
 
20141117 movable type seminar
20141117 movable type seminar20141117 movable type seminar
20141117 movable type seminarSix Apart
 
Digiprog 3 adapter cable manual
Digiprog 3 adapter cable manualDigiprog 3 adapter cable manual
Digiprog 3 adapter cable manualbuyobdii
 

Viewers also liked (9)

Dafot03
Dafot03Dafot03
Dafot03
 
How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...
How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...
How We Got to Now: A Brief Overview of Some Key Events in Modern Middle Easte...
 
Introduction to phonetics by Siorella gonzales
Introduction to phonetics by Siorella gonzalesIntroduction to phonetics by Siorella gonzales
Introduction to phonetics by Siorella gonzales
 
Letter from President Ronald Reagan to William Teichner
Letter from President Ronald Reagan to William TeichnerLetter from President Ronald Reagan to William Teichner
Letter from President Ronald Reagan to William Teichner
 
Gestor de proyectos docent tic 2
Gestor de proyectos docent tic   2Gestor de proyectos docent tic   2
Gestor de proyectos docent tic 2
 
Analisis saluran pemasaran terintegrasi
Analisis saluran pemasaran terintegrasiAnalisis saluran pemasaran terintegrasi
Analisis saluran pemasaran terintegrasi
 
Mapas mentales lubiglys 1
Mapas mentales lubiglys 1Mapas mentales lubiglys 1
Mapas mentales lubiglys 1
 
20141117 movable type seminar
20141117 movable type seminar20141117 movable type seminar
20141117 movable type seminar
 
Digiprog 3 adapter cable manual
Digiprog 3 adapter cable manualDigiprog 3 adapter cable manual
Digiprog 3 adapter cable manual
 

Similar to Graph Based Facet Selection Using Prior Knowledge

Semantic Monitoring of Personal Web Activity to Support the Management of Tru...
Semantic Monitoring of Personal Web Activity to Support the Management of Tru...Semantic Monitoring of Personal Web Activity to Support the Management of Tru...
Semantic Monitoring of Personal Web Activity to Support the Management of Tru...Mathieu d'Aquin
 
AWS Certified Machine Learning Specialty
AWS Certified Machine Learning Specialty AWS Certified Machine Learning Specialty
AWS Certified Machine Learning Specialty Adnan Rashid
 
Slides(ppt)
Slides(ppt)Slides(ppt)
Slides(ppt)butest
 
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...InfoTrust LLC
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesChode Amarnath
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptDEEPAK948083
 
Example Of An Analysis Essay Outline. Online assignment writing service.
Example Of An Analysis Essay Outline. Online assignment writing service.Example Of An Analysis Essay Outline. Online assignment writing service.
Example Of An Analysis Essay Outline. Online assignment writing service.Nicole Barnes
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
Organizational behavior chapter 3
Organizational behavior chapter 3Organizational behavior chapter 3
Organizational behavior chapter 3Aus Tin
 
What Should An Essay Abstract Contain
What Should An Essay Abstract ContainWhat Should An Essay Abstract Contain
What Should An Essay Abstract ContainTrina Martin
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Sri Ambati
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...butest
 
Csis 5420 week 8 homework answers (13 jul 05)
Csis 5420 week 8 homework   answers (13 jul 05)Csis 5420 week 8 homework   answers (13 jul 05)
Csis 5420 week 8 homework answers (13 jul 05)Thắng Tạ Bảo
 
Lightning Talks: An Innovation Showcase
Lightning Talks: An Innovation ShowcaseLightning Talks: An Innovation Showcase
Lightning Talks: An Innovation ShowcaseSomo
 
Data Science: The Product Manager's Primer
Data Science: The Product Manager's PrimerData Science: The Product Manager's Primer
Data Science: The Product Manager's PrimerProduct School
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningNandakumar P
 
Cynefin & software testing
Cynefin & software testingCynefin & software testing
Cynefin & software testingDuncan Nisbet
 

Similar to Graph Based Facet Selection Using Prior Knowledge (20)

Semantic Monitoring of Personal Web Activity to Support the Management of Tru...
Semantic Monitoring of Personal Web Activity to Support the Management of Tru...Semantic Monitoring of Personal Web Activity to Support the Management of Tru...
Semantic Monitoring of Personal Web Activity to Support the Management of Tru...
 
AWS Certified Machine Learning Specialty
AWS Certified Machine Learning Specialty AWS Certified Machine Learning Specialty
AWS Certified Machine Learning Specialty
 
Slides(ppt)
Slides(ppt)Slides(ppt)
Slides(ppt)
 
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
Big Data Analytics: The Math, the Implementation and How it can be Effectivel...
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
 
introDM.ppt
introDM.pptintroDM.ppt
introDM.ppt
 
Example Of An Analysis Essay Outline. Online assignment writing service.
Example Of An Analysis Essay Outline. Online assignment writing service.Example Of An Analysis Essay Outline. Online assignment writing service.
Example Of An Analysis Essay Outline. Online assignment writing service.
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
Organizational behavior chapter 3
Organizational behavior chapter 3Organizational behavior chapter 3
Organizational behavior chapter 3
 
What Should An Essay Abstract Contain
What Should An Essay Abstract ContainWhat Should An Essay Abstract Contain
What Should An Essay Abstract Contain
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
 
Csis 5420 week 8 homework answers (13 jul 05)
Csis 5420 week 8 homework   answers (13 jul 05)Csis 5420 week 8 homework   answers (13 jul 05)
Csis 5420 week 8 homework answers (13 jul 05)
 
Lightning Talks: An Innovation Showcase
Lightning Talks: An Innovation ShowcaseLightning Talks: An Innovation Showcase
Lightning Talks: An Innovation Showcase
 
Data Science: The Product Manager's Primer
Data Science: The Product Manager's PrimerData Science: The Product Manager's Primer
Data Science: The Product Manager's Primer
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data Mining
 
scrib.pptx
scrib.pptxscrib.pptx
scrib.pptx
 
Cynefin & software testing
Cynefin & software testingCynefin & software testing
Cynefin & software testing
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Graph Based Facet Selection Using Prior Knowledge

  • 1. Graph Based Facet Selection NICK YU, LEON ZHANG, ANTON EMELYANOV
  • 2. Why Facet Selection? Obviously, our graph has errors This is because our sources have errors Ideally, when we have more data from more sources, our data correctness should improve (but it doesn’t) Having more data ≠ Having more knowledge If our precision tier needs to have a low error rate, we need a way to filter out errors
  • 3. Source Based Facet Selection All facet from one data source shares same confidence Aggregate confidence if multiple data source have same facet
  • 4. However… As we add more sources, it is impossible to have a score per source per property Even if we trust a source highly, its individual facts cannot be 100% correct The selection process does not take account of information we already know Mari Henmi Children Children Emiri Henmi
  • 5. Taking Advantage of Prior Knowledge When seen in isolation, it’s hard to know whether Mari is Emiri’s mother or child However, if we know the following facts (each with some probability of being true), then the job is a lot easier: ◦ Emiri’s the other parent, Teruhiko, is Mari’s husband ◦ Emiri’s sibling, Noritaka, is Mari’s child Mari Henmi Teruhiko Saigo Noritaka Henmi Mari Henmi Children Children Emiri Henmi Spouse Children Children Children Sibling Children Emiri Henmi Inspired by Google’s Knowledge Vault concept
  • 6. Graph Based Selection using Prior Knowledge We can generalize the model in the following form. Given a triple 푠, 푝, 표 , and 푟푖 are all possible paths that connect 푠 to 표, including reverse edges and multiple hops. The probability of that triplet being true is: 푃 푠, 푝, 표 = 푃 푝 푟1, 푟2, … , 푟푛 In particular, for any given triple 푠, 푝, 표 , we first find all the paths 풓 = 푟1, … , 푟푛 from Mari to Emiri. 푠푝표푢푠푒 푐ℎ푖푙푑 푐ℎ푖푙푑 푠푖푏푙푖푛푔 ◦ Examples include r1 = A ∙ 퐵, r2 = A ∙ 퐵, etc. Then we assign the weight 풘 = 푤1, … , 푤푛 to each of the paths, and calculate the prediction as: 퐹 푝| 풓 = 1 1 + exp − 풘 ∙ 풓 There are several linear models possible. We find logistic regression to perform very well
  • 7. Implementation Steps to calculate this score: 1. Create training set and test set 2. Mine all the possible paths from S to O 3. Treat each path as a feature and train a model Simple, right?
  • 8. Training and Test Set Need to contain: ◦ Positive examples – easy ◦ Negative examples – how? Local Closed World Assumption: ◦ For a given entity, if we know the values of a property, then we know all values of that property ◦ More concretely, if we already know a Tom Cruise has three children, then any other entity is unlikely to be his fourth children – this is a possible negative example ◦ However, if we don’t know his children at all, then we cannot say who must not be his children – this is not a possible negative example ◦ Remember we just need LCWA to be true enough to generate negative examples, it doesn’t have to be 100% true.
  • 9. What Negative Examples to Choose Are all entities who violate LCWA also good candidates for negative examples? Randomly pick any entity? ◦ This cannot work because paths between any random pair of entities are very sparse ◦ The classifier will learn to classify the existence of connection between two entities and not the right kind of connections The related entities of positive examples? ◦ Choose A-B to be a negative example if A-C is a positive example and B and C are related entities ◦ The connection between A-C is still very sparse All neighbor entities? ◦ Choose A-B to be a negative example if A is already connected to B and A-B is not a positive example ◦ Need to make sure B has the same type as the expected type of the property
  • 10. All Paths (Rules) That Connect S to O We find the following paths between any two entities: ◦ 1 hop: All forward and backward edges ◦ 2 hops: All edges including f/f, f/b, b/f, b/b directions ◦ Excluding intermediate hub entities ◦ We use bidirectional search to speed up the job ◦ The performance breaks down beyond two hops – this can be improved
  • 11. (Some) Rules Trained for Marriage Property Rule Precision Recall F1 ->people.person.children<-people.person.children 0.990501 0.254667 0.405163 ->people.person.children->people.person.parent 0.995007 0.25403 0.40473 <-people.person.parent<-people.person.children 0.991585 0.253872 0.404246 <-people.person.parent->people.person.parent 0.995389 0.253263 0.403788 ->film.actor.film<-film.actor.film 0.210501 0.044027 0.072822 <-film.film.actor->film.film.actor 0.210041 0.043981 0.072733 ->film.actor.film<-film.actor.performance--film.performance.film 0.211015 0.043931 0.072722 ->film.actor.film->film.film.performance--film.performance.actor 0.211004 0.043931 0.072721 <-film.film.actor<-film.actor.film 0.210119 0.043965 0.072715 ->film.actor.film->film.film.actor 0.210183 0.043959 0.072711 <-film.film.actor<-film.actor.performance--film.performance.film 0.210729 0.043914 0.072681 <-film.film.actor->film.film.performance--film.performance.actor 0.210711 0.043914 0.07268
  • 12. Train Final Models for Each Property Given the paths (rules 푟1, … , 푟푛) for each property, we train a logistic regression model for each property p: 푃 푝 = 퐿푅 푟1, 푟2, … , 푟푛 How to map the rules to a feature vector? ◦ There are 90,000 distinct possible paths between any two given entity. This maps to a feature vector of 90,000 dimensions. ◦ There could be more paths as we grow our graph. How do we assign dimensions for new paths? Our solution – hash kernels ◦ Project the feature space down to a 1,500 dimensional hash space ◦ Learn the model on the hashed feature space ◦ Use L1 regularization to get rid of useless features ◦ Collisions are handled in some degree by the hash kernel itself. Additional collisions are handled by having multiple hash kernels
  • 13. Mari Henmi & Emiri Hemi Mari Henmi Children Children Emiri Henmi Is Mari a child of Emiri's? Rule Weight Bias -3.38435 <-people.person.children <-people.person.marriage--time.event.person<-people.person.parent -0.03988 <-people.person.marriage--time.event.person->people.person.children -0.3237 <-people.person.parent <-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling <-people.person.parent<-people.person.siblings <-people.person.parent->people.person.sibling--people.sibling_relationship.sibling <-people.person.parent->people.person.siblings -0.3237 ->people.person.children ->people.person.children<-people.person.sibling--people.sibling_relationship.sibling -0.03796 ->people.person.children<-people.person.siblings ->people.person.children->people.person.sibling--people.sibling_relationship.sibling -0.12332 ->people.person.children->people.person.siblings -0.03463 ->people.person.marriage--time.event.person<-people.person.parent -0.4855 ->people.person.marriage--time.event.person->people.person.children -0.06937 ->people.person.parent Total -4.8224 Sigmoid 0.007983 Is Emiri a child of Mari's? Rule Weight Bias -3.38435 <-people.person.children <-people.person.children<-people.person.marriage--time.event.person 3.043293 <-people.person.children->people.person.marriage--time.event.person 1.556977 <-people.person.parent <-people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.72802 <-people.person.sibling--people.sibling_relationship.sibling->people.person.parent 1.194149 <-people.person.siblings<-people.person.children 0.369578 <-people.person.siblings->people.person.parent 0.436715 ->people.person.children ->people.person.parent ->people.person.parent<-people.person.marriage--time.event.person 3.05227 ->people.person.parent->people.person.marriage--time.event.person 1.445125 ->people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.386518 ->people.person.sibling--people.sibling_relationship.sibling->people.person.parent 0.989205 ->people.person.siblings<-people.person.children 1.365237 ->people.person.siblings->people.person.parent 0.827563 Total 14.0103 Sigmoid 0.99999
  • 14. Measurement We measure the trained models on a separate hold-out set Precision = True Positives / Predicted Positives Recall = True Positives / Labeled Positives Most models have high precision and not so high recalls This is because the model can’t reason about shallow entities _ P Precision Recall automotive.automotive_class.related 0.997171 0.999055 automotive.trim_level.model_year 1 1 automotive.trim_level.option_package--automotive.option_package.trim_levels 1 0.924326 automotive.trim_level.related_trim_level 1 1 award.nominated_work.nomination--award.nomination.nominee 0.911385 0.327528 award.nominee.award_nominations--award.nomination.nominated_work 0.852335 0.265277 award.winner.awards_won--award.honor.winner 0.773061 0.694073 award.winning_work.honor--award.honor.winner 0.907776 0.186617 education.school.school_district 0.983673 0.598015 film.actor.film 0.967154 0.981438 film.director.film 0.674589 0.81029 film.film.actor 0.991635 0.981757 film.film.art_director 0.621622 0.042048 film.film.country 0.86165 0.776492 film.film.director 0.705793 0.821246 film.film.editor 0.886905 0.060105 film.film.language 1 0.780943 film.film.music 0.945946 0.01992 film.film.performance--film.performance.actor 0.620755 0.07921 film.film.producer 0.492865 0.052533 film.film.production_company 0.94723 0.187467 film.film.story 0.976492 0.159292 film.film.writer 0.795948 0.787734 film.producer.film 0.829213 0.317553 film.writer.film 0.704711 0.77448 music.artist.track_contributions--music.track_contribution.track 0.963513 0.835044 music.track.artist 0.955354 0.975672 music.track.producer 0.76477 0.705882 organization.organization.headquarters--location.address.city_entity 0.865854 0.022955 organization.organization.headquarters--location.address.subdivision_entity 0.811475 0.037106 people.deceased_person.place_of_death 0.916096 0.075246 people.person.children 0.998595 0.917393 people.person.marriage--time.event.person 0.996065 0.798455 people.person.nationality 0.95238 0.9496 people.person.parent 0.998061 0.916419 people.person.place_of_birth 0.966238 0.344751 people.person.sibling--people.sibling_relationship.sibling 0.993952 0.909814 people.person.siblings 0.998787 0.911975 soccer.player.national_career_roster--sports.sports_team_roster.team 0.997714 0.637226 sports.pro_athlete.team--soccer.roster_position.team 0.607641 0.919797 sports.pro_athlete.team--sports.sports_team_roster.team 0.811641 0.633239
  • 16. Handling Scalar Values Our models only handle entity-entity facets and not entity-value facets To handle scalar values, we can bucketize the values and then treat buckets as entities. Then, we can apply the same algorithm. For example, to score the facet “Tom Cruise is born on 7/3/1962”, we can do the following: ◦ Bucketize 7/3/1962 into entity “1960s” ◦ Find all possible paths between Tom Cruise and “1960s” 푠푖푏푙푖푛푔 푏표푟푛 ◦ One such path could be: 푇표푚 퐶푟푢푖푠푒 푀푎푟푖푎푛 푀푎푝표푡ℎ푒푟 1960푠 ◦ We can assign high weights to paths like this one, and the rest same as before
  • 17. Issues and Further Work Classifiers work well with rich entities but not shallow entities ◦ As we grow more data, our rich entities should increase The training and test set are not representative of real world data ◦ Positive examples are often highly connected – this can cause the classifier to be very conservative ◦ Negative examples are often too random – real world data can be more ambiguous The prototype works with two hops but not yet three hops ◦ When we get to three hops, the intermediate data reaches about 40TB or more. More optimization needed