An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden
2. The problem
• Linked Data datasets are biased
• Bias = the information is unevenly distributed
• To detect such a bias, the information distribution in the
dataset should be compared to an unbiased one (ground
truth), which is not available
• Our proposal is to use information coming from the
connected datasets to approximate such a comparison
3. Is bias a problem?
• LMDB is biased towards old movies (i.e., it mostly
contains information about old movies)
• A recommender system would therefore produce
results biased towards old movies
• There is a need of identifying this bias
• to properly assess the results of Linked Data systems and
• to compensate the bias.
4. Motivation
• Dedalo: using Linked Data to explain patterns
• Pattern
• Students of the Open University enroll into Health&Social Care
courses more often around Manchester than in other places
• Explanation
• Health&Social Care courses are popular in Manchester because it is
in the Northern Hemisphere
• In DBpedia, the information incompleteness regarding places
locations is unevenly distributed, i.e. there is a bias
5. Identifying the bias
• Measure how much a dataset is biased when compared to
another one
D
S
owl:sameAs
rdf:seeAlso
skos:exactMatch
….
Dataset
• Use the dataset projection into its connecting dataset D
• compare the property values distribution of entities in D
• with the one of entities in S (the dataset projection)
6. Example : is LMDB biased?
• Compare dc:subject values for the entities in D and in S
LMDB is biased towards black and white movies
• Same for dbp:released
LMDB is biased towards older movies
7. Bias detection proposition
• Use SPARQL to build pairs of values distributions in S and D
• Given
• two populations (values) and
• a same observation (RDF property)
dc:subject(D) = {dbCat:ScienceFictionMovies,dbCat:Black&WhiteMovies}
dc:subject(S) = {dbCat:Black&WhiteMovies}
• Use the statistical t-tests commonly exploited to compare
observations
8. T-Tests of statistical significance
• There is a significant difference between two populations
• calculates the probability p that the difference is due to
chance
• state a null hypothesis (i.e. is due to chance)
• there is no bias in a property
• an alternate hypothesis (the one you want to prove)
• there is bias in a property
• if p below 0.05, then one can reject the null hypothesis
• the lower p, the more the property is biased
• Rank the properties according to p to find the most biased ones
9. Experiments and results
• 30 datasets and 54 pairs from the DataHub1
• Varying in size of entities in S (from 30 to 60,000
approx.)
• Varying in domain (multi-domain, biomedical
computer science, education, geography…)
[1] http://datahub.io/
10. When results are expected…
• NLFinland, places in Finland (connected to DBpedia)
class prop value p
db:Place dc:subject db:CitiesAndTownsInFinland p < 1.00e-15
db:Place dbp:latd (average) 40.5 p < 1.00e-15
db:Place dbp:longd (average) 24.6 p < 1.00e-15
• NLSpain, bibliographic Spanish data (connected to DBpedia)
class prop value p
db:MusicalArtist db:birthPlace db:Spain p < 1.13e-13
db:Writer dbp:nationality db:Spanish p < 4.64e-03
11. …when results are less expected
• Uniprot, biomedical data (connected to
Bio2RDF/BioPax/DrugBank)
class prop value p
up:Protein up:isolatedFrom uptissue:Brain p < 1.33e-04
• RED, writers data (connected to DBpedia)
class prop value p
db:Agent db:genre db:Novel p < 1.00e-15
db:Agent db:genre db:Poetry p < 1.00e-15
db:Agent db:deathCause db:Suicide p < 1.00e-15
12. Conclusions and future work
• The importance of identifying the bias in a dataset
• Approach:
• with information from the connected datasets
• statistical t-tests on the distributions of the values of a property
• ranking properties basing on the probability of being biased
• Evaluating Dedalo’s performance on Google Trends
Please participate!
http://linkedu.eu/dedalo/eval/
13. Thank you for your attention
Questions?
ilaria.tiddi @open.ac.uk
@IlaTiddi http://linkedu.eu/dedalo/eval/
14. Dedalo: explaining clusters with Linked Data
• Linked Data are a graph
• nodes : URIs
• edges : RDF properties
• Some nodes walk to the same node
Walk = a chain of RDF properties
• Walks can be an explanation for the cluster
ExplC = a chain of properties and one final entity
15. Dedalo: explaining clusters with Linked Data
ExplC =“movies whose subject is a subcategory of Science Fiction”
A* iterative search
Entropy to drive the search expanding the graph
Improving the F-score of ExplC at each iteration
16. Knowledge Discovery
raw
data
clean
data
Patterns
The process of identifying patterns in data1
Patterns are usually interpreted by the experts
Linked Data can be used to automatically interpret patterns
open, shared, multi-domain, connected knowledge
Knowledge
[1] Fayyad, 1998.
17. Contribution
Need of identify the bias when producing Linked Data systems
A recommender system based on DBpedia (any kind of movies)
DBpedia is linked to the Linked Movies Database ( ‘30s movies )
The recommendation might be compromised
We propose a process to identify and measure the bias based on
statistical methods
18. Motivation
• Students are interested in Health&Social Care since they live in
the Northern Hemisphere
• What about the other counties?
• are they connected to the “Northern Hemisphere” entity?
• There must be a bias :the information is unevenly distributed
• Solution: weighting properties to rebalance the unevenness
one ds is linked to another
a subset is linked to another
compare the subset of linked entities
the subset does not reflect the whole dataset
We propose a process based on statistical methods to do it
remove background
**dsitrubutions comparison
Use sparql to build paiss of distribution of values for (D, S)Compare these distributions
How do ewe compare
ind. >> compared by distributions
a low p = a significant difference (not random) = the most bias
Independent t-test for numerical values (dbp:released) Paired t-test for the others (dc:subject)
itt >> the most common: the difference is shown by the groups avg, std-dev and sample size
p is the value of alpha (type I error : assuming there is a relbut there Is not)
indications of the most biased prop and class and which values the
somr
presenting them in order of surprise
indications of the most biased prop and class and which values the
somr
presenting them in order of surprise