This document discusses graph-based facet selection using prior knowledge. It begins by explaining the need for facet selection to filter out errors from multiple data sources. It then describes an initial source-based facet selection method and its limitations. The document proposes a graph-based selection model that takes advantage of prior knowledge by looking at all possible paths connecting two entities. It provides details on implementing this model, including generating training and test data, mining paths, training classification models, and making predictions. Evaluation results for various property models are presented, along with issues and opportunities for further work.
2. Why Facet Selection?
Obviously, our graph has errors
This is because our sources have errors
Ideally, when we have more data from more
sources, our data correctness should improve (but
it doesn’t)
Having more data ≠ Having more knowledge
If our precision tier needs to have a low error rate,
we need a way to filter out errors
3. Source Based Facet Selection
All facet from one data source shares same confidence
Aggregate confidence if multiple data source have same facet
4. However…
As we add more sources, it is impossible to
have a score per source per property
Even if we trust a source highly, its individual
facts cannot be 100% correct
The selection process does not take account of
information we already know
Mari Henmi
Children Children
Emiri Henmi
5. Taking Advantage of Prior Knowledge
When seen in isolation, it’s hard to know
whether Mari is Emiri’s mother or child
However, if we know the following facts (each
with some probability of being true), then the
job is a lot easier:
◦ Emiri’s the other parent, Teruhiko, is Mari’s
husband
◦ Emiri’s sibling, Noritaka, is Mari’s child
Mari Henmi
Teruhiko Saigo
Noritaka Henmi
Mari Henmi
Children Children
Emiri Henmi
Spouse
Children Children Children
Sibling
Children
Emiri Henmi
Inspired by Google’s Knowledge Vault concept
6. Graph Based Selection using Prior
Knowledge
We can generalize the model in the following form. Given a triple 푠, 푝, 표 , and 푟푖 are all possible
paths that connect 푠 to 표, including reverse edges and multiple hops. The probability of that
triplet being true is:
푃 푠, 푝, 표 = 푃 푝 푟1, 푟2, … , 푟푛
In particular, for any given triple 푠, 푝, 표 , we first find all the paths 풓 = 푟1, … , 푟푛 from Mari to
Emiri.
푠푝표푢푠푒
푐ℎ푖푙푑
푐ℎ푖푙푑
푠푖푏푙푖푛푔
◦ Examples include r1 = A
∙
퐵, r2 = A
∙
퐵, etc.
Then we assign the weight 풘 = 푤1, … , 푤푛 to each of the paths, and calculate the prediction
as:
퐹 푝| 풓 =
1
1 + exp − 풘 ∙ 풓
There are several linear models possible. We find logistic regression to perform very well
7. Implementation
Steps to calculate this score:
1. Create training set and test set
2. Mine all the possible paths from S to O
3. Treat each path as a feature and train a model
Simple, right?
8. Training and Test Set
Need to contain:
◦ Positive examples – easy
◦ Negative examples – how?
Local Closed World Assumption:
◦ For a given entity, if we know the values of a property, then we know all values of that property
◦ More concretely, if we already know a Tom Cruise has three children, then any other entity is unlikely to
be his fourth children – this is a possible negative example
◦ However, if we don’t know his children at all, then we cannot say who must not be his children – this is
not a possible negative example
◦ Remember we just need LCWA to be true enough to generate negative examples, it doesn’t have to be
100% true.
9. What Negative Examples to Choose
Are all entities who violate LCWA also good candidates for negative examples?
Randomly pick any entity?
◦ This cannot work because paths between any random pair of entities are very sparse
◦ The classifier will learn to classify the existence of connection between two entities and not the right
kind of connections
The related entities of positive examples?
◦ Choose A-B to be a negative example if A-C is a positive example and B and C are related entities
◦ The connection between A-C is still very sparse
All neighbor entities?
◦ Choose A-B to be a negative example if A is already connected to B and A-B is not a positive example
◦ Need to make sure B has the same type as the expected type of the property
10. All Paths (Rules) That Connect S to O
We find the following paths between any two entities:
◦ 1 hop: All forward and backward edges
◦ 2 hops: All edges including f/f, f/b, b/f, b/b directions
◦ Excluding intermediate hub entities
◦ We use bidirectional search to speed up the job
◦ The performance breaks down beyond two hops – this can be improved
12. Train Final Models for Each Property
Given the paths (rules 푟1, … , 푟푛) for each property, we train a logistic regression model for each
property p:
푃 푝 = 퐿푅 푟1, 푟2, … , 푟푛
How to map the rules to a feature vector?
◦ There are 90,000 distinct possible paths between any two given entity. This maps to a feature vector of 90,000
dimensions.
◦ There could be more paths as we grow our graph. How do we assign dimensions for new paths?
Our solution – hash kernels
◦ Project the feature space down to a 1,500 dimensional hash space
◦ Learn the model on the hashed feature space
◦ Use L1 regularization to get rid of useless features
◦ Collisions are handled in some degree by the hash kernel itself. Additional collisions are handled by having
multiple hash kernels
13. Mari Henmi & Emiri Hemi
Mari Henmi
Children Children
Emiri Henmi
Is Mari a child of Emiri's?
Rule Weight
Bias -3.38435
<-people.person.children
<-people.person.marriage--time.event.person<-people.person.parent -0.03988
<-people.person.marriage--time.event.person->people.person.children -0.3237
<-people.person.parent
<-people.person.parent<-people.person.sibling--people.sibling_relationship.sibling
<-people.person.parent<-people.person.siblings
<-people.person.parent->people.person.sibling--people.sibling_relationship.sibling
<-people.person.parent->people.person.siblings -0.3237
->people.person.children
->people.person.children<-people.person.sibling--people.sibling_relationship.sibling -0.03796
->people.person.children<-people.person.siblings
->people.person.children->people.person.sibling--people.sibling_relationship.sibling -0.12332
->people.person.children->people.person.siblings -0.03463
->people.person.marriage--time.event.person<-people.person.parent -0.4855
->people.person.marriage--time.event.person->people.person.children -0.06937
->people.person.parent
Total -4.8224
Sigmoid 0.007983
Is Emiri a child of Mari's?
Rule Weight
Bias -3.38435
<-people.person.children
<-people.person.children<-people.person.marriage--time.event.person 3.043293
<-people.person.children->people.person.marriage--time.event.person 1.556977
<-people.person.parent
<-people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.72802
<-people.person.sibling--people.sibling_relationship.sibling->people.person.parent 1.194149
<-people.person.siblings<-people.person.children 0.369578
<-people.person.siblings->people.person.parent 0.436715
->people.person.children
->people.person.parent
->people.person.parent<-people.person.marriage--time.event.person 3.05227
->people.person.parent->people.person.marriage--time.event.person 1.445125
->people.person.sibling--people.sibling_relationship.sibling<-people.person.children 1.386518
->people.person.sibling--people.sibling_relationship.sibling->people.person.parent 0.989205
->people.person.siblings<-people.person.children 1.365237
->people.person.siblings->people.person.parent 0.827563
Total 14.0103
Sigmoid 0.99999
14. Measurement
We measure the trained models on a separate
hold-out set
Precision = True Positives / Predicted Positives
Recall = True Positives / Labeled Positives
Most models have high precision and not so
high recalls
This is because the model can’t reason about
shallow entities
_ P Precision Recall
automotive.automotive_class.related 0.997171 0.999055
automotive.trim_level.model_year 1 1
automotive.trim_level.option_package--automotive.option_package.trim_levels 1 0.924326
automotive.trim_level.related_trim_level 1 1
award.nominated_work.nomination--award.nomination.nominee 0.911385 0.327528
award.nominee.award_nominations--award.nomination.nominated_work 0.852335 0.265277
award.winner.awards_won--award.honor.winner 0.773061 0.694073
award.winning_work.honor--award.honor.winner 0.907776 0.186617
education.school.school_district 0.983673 0.598015
film.actor.film 0.967154 0.981438
film.director.film 0.674589 0.81029
film.film.actor 0.991635 0.981757
film.film.art_director 0.621622 0.042048
film.film.country 0.86165 0.776492
film.film.director 0.705793 0.821246
film.film.editor 0.886905 0.060105
film.film.language 1 0.780943
film.film.music 0.945946 0.01992
film.film.performance--film.performance.actor 0.620755 0.07921
film.film.producer 0.492865 0.052533
film.film.production_company 0.94723 0.187467
film.film.story 0.976492 0.159292
film.film.writer 0.795948 0.787734
film.producer.film 0.829213 0.317553
film.writer.film 0.704711 0.77448
music.artist.track_contributions--music.track_contribution.track 0.963513 0.835044
music.track.artist 0.955354 0.975672
music.track.producer 0.76477 0.705882
organization.organization.headquarters--location.address.city_entity 0.865854 0.022955
organization.organization.headquarters--location.address.subdivision_entity 0.811475 0.037106
people.deceased_person.place_of_death 0.916096 0.075246
people.person.children 0.998595 0.917393
people.person.marriage--time.event.person 0.996065 0.798455
people.person.nationality 0.95238 0.9496
people.person.parent 0.998061 0.916419
people.person.place_of_birth 0.966238 0.344751
people.person.sibling--people.sibling_relationship.sibling 0.993952 0.909814
people.person.siblings 0.998787 0.911975
soccer.player.national_career_roster--sports.sports_team_roster.team 0.997714 0.637226
sports.pro_athlete.team--soccer.roster_position.team 0.607641 0.919797
sports.pro_athlete.team--sports.sports_team_roster.team 0.811641 0.633239
16. Handling Scalar Values
Our models only handle entity-entity facets and not entity-value facets
To handle scalar values, we can bucketize the values and then treat buckets as entities. Then, we
can apply the same algorithm.
For example, to score the facet “Tom Cruise is born on 7/3/1962”, we can do the following:
◦ Bucketize 7/3/1962 into entity “1960s”
◦ Find all possible paths between Tom Cruise and “1960s”
푠푖푏푙푖푛푔
푏표푟푛
◦ One such path could be: 푇표푚 퐶푟푢푖푠푒
푀푎푟푖푎푛 푀푎푝표푡ℎ푒푟
1960푠
◦ We can assign high weights to paths like this one, and the rest same as before
17. Issues and Further Work
Classifiers work well with rich entities but not shallow entities
◦ As we grow more data, our rich entities should increase
The training and test set are not representative of real world data
◦ Positive examples are often highly connected – this can cause the classifier to be very conservative
◦ Negative examples are often too random – real world data can be more ambiguous
The prototype works with two hops but not yet three hops
◦ When we get to three hops, the intermediate data reaches about 40TB or more. More optimization
needed