Quantifying the bias in data links

Quantifying the bias
in data links
Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

The problem
• Linked Data datasets are biased
• Bias = the information is unevenly distributed
• To detect such a bias, the information distribution in the
dataset should be compared to an unbiased one (ground
truth), which is not available
• Our proposal is to use information coming from the
connected datasets to approximate such a comparison

Is bias a problem?
• LMDB is biased towards old movies (i.e., it mostly
contains information about old movies)
• A recommender system would therefore produce
results biased towards old movies
• There is a need of identifying this bias
• to properly assess the results of Linked Data systems and
• to compensate the bias.

Motivation
• Dedalo: using Linked Data to explain patterns
• Pattern
• Students of the Open University enroll into Health&Social Care
courses more often around Manchester than in other places
• Explanation
• Health&Social Care courses are popular in Manchester because it is
in the Northern Hemisphere
• In DBpedia, the information incompleteness regarding places
locations is unevenly distributed, i.e. there is a bias

Identifying the bias
• Measure how much a dataset is biased when compared to
another one
D
S
owl:sameAs
rdf:seeAlso
skos:exactMatch
….
Dataset
• Use the dataset projection into its connecting dataset D
• compare the property values distribution of entities in D
• with the one of entities in S (the dataset projection)

Example : is LMDB biased?
• Compare dc:subject values for the entities in D and in S
LMDB is biased towards black and white movies
• Same for dbp:released
LMDB is biased towards older movies

Bias detection proposition
• Use SPARQL to build pairs of values distributions in S and D
• Given
• two populations (values) and
• a same observation (RDF property)
dc:subject(D) = {dbCat:ScienceFictionMovies,dbCat:Black&WhiteMovies}
dc:subject(S) = {dbCat:Black&WhiteMovies}
• Use the statistical t-tests commonly exploited to compare
observations

T-Tests of statistical significance
• There is a significant difference between two populations
• calculates the probability p that the difference is due to
chance
• state a null hypothesis (i.e. is due to chance)
• there is no bias in a property
• an alternate hypothesis (the one you want to prove)
• there is bias in a property
• if p below 0.05, then one can reject the null hypothesis
• the lower p, the more the property is biased
• Rank the properties according to p to find the most biased ones

Experiments and results
• 30 datasets and 54 pairs from the DataHub1
• Varying in size of entities in S (from 30 to 60,000
approx.)
• Varying in domain (multi-domain, biomedical
computer science, education, geography…)
[1] http://datahub.io/

When results are expected…
• NLFinland, places in Finland (connected to DBpedia)
class prop value p
db:Place dc:subject db:CitiesAndTownsInFinland p < 1.00e-15
db:Place dbp:latd (average) 40.5 p < 1.00e-15
db:Place dbp:longd (average) 24.6 p < 1.00e-15
• NLSpain, bibliographic Spanish data (connected to DBpedia)
class prop value p
db:MusicalArtist db:birthPlace db:Spain p < 1.13e-13
db:Writer dbp:nationality db:Spanish p < 4.64e-03

…when results are less expected
• Uniprot, biomedical data (connected to
Bio2RDF/BioPax/DrugBank)
class prop value p
up:Protein up:isolatedFrom uptissue:Brain p < 1.33e-04
• RED, writers data (connected to DBpedia)
class prop value p
db:Agent db:genre db:Novel p < 1.00e-15
db:Agent db:genre db:Poetry p < 1.00e-15
db:Agent db:deathCause db:Suicide p < 1.00e-15

Conclusions and future work
• The importance of identifying the bias in a dataset
• Approach:
• with information from the connected datasets
• statistical t-tests on the distributions of the values of a property
• ranking properties basing on the probability of being biased
• Evaluating Dedalo’s performance on Google Trends
Please participate!
http://linkedu.eu/dedalo/eval/

Thank you for your attention
Questions?
ilaria.tiddi @open.ac.uk
@IlaTiddi http://linkedu.eu/dedalo/eval/

Dedalo: explaining clusters with Linked Data
• Linked Data are a graph
• nodes : URIs
• edges : RDF properties
• Some nodes walk to the same node
Walk = a chain of RDF properties
• Walks can be an explanation for the cluster
ExplC = a chain of properties and one final entity

Dedalo: explaining clusters with Linked Data
ExplC =“movies whose subject is a subcategory of Science Fiction”
 A* iterative search
 Entropy to drive the search expanding the graph
 Improving the F-score of ExplC at each iteration

Knowledge Discovery
raw
data
clean
data
Patterns
 The process of identifying patterns in data1
 Patterns are usually interpreted by the experts
 Linked Data can be used to automatically interpret patterns
 open, shared, multi-domain, connected knowledge
Knowledge
[1] Fayyad, 1998.

Contribution
Need of identify the bias when producing Linked Data systems
A recommender system based on DBpedia (any kind of movies)
DBpedia is linked to the Linked Movies Database ( ‘30s movies )
The recommendation might be compromised
We propose a process to identify and measure the bias based on
statistical methods

Motivation
• Students are interested in Health&Social Care since they live in
the Northern Hemisphere
• What about the other counties?
• are they connected to the “Northern Hemisphere” entity?
• There must be a bias :the information is unevenly distributed
• Solution: weighting properties to rebalance the unevenness

ilaria.tiddi @open.ac.uk
@IlaTiddi
THANK YOU VERY MUCH!
Questions?

Quantifying the bias in data links

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quantifying the bias in data links

Similar to Quantifying the bias in data links (20)

More from Vrije Universiteit Amsterdam

More from Vrije Universiteit Amsterdam (13)

Recently uploaded

Recently uploaded (20)

Quantifying the bias in data links

Editor's Notes