Exploiting Linked Open Data as Background Knowledge in Data Mining

10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim

Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMiner Linked Open Data Extension
• Challenges and Future Work

Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities

Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?

Motivation
http://lod-cloud.net/

Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data mining process as needed
• Two main variants:
– develop mining/learning algorithms that run directly on Linked Data
– create relational features from Linked Data

Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel functions
• Advantages:
– can be quite efficient
– no reduction to “flat” table structure
– semantics can be respected directly

Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / RapidMiner Linked Open Data Extension
• Advantages:
– Easy combination of knowledge from various sources
• including relational features in the original data
– Arbitrary mining algorithms/tools possible

FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1

FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessing URIs)
– Seven generators:
• direct types
• data properties
• unqualified relations (boolean, numeric)
• qualified relations (boolean, numeric)
• individuals (dangerous!) - may be restricted to specific property
– Simple feature selection: filtering features
• that have only* different values (expect numerical)
• that have only* identical values
• that are mostly missing*
*) 95% or 99%

Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
– AAUP: predicting income of university employees
(regression task)
• Question: how much improvement do additional features bring?
*) standard ML datasets with speaking labels are scarce!

Experiments: Zoo Dataset

First Results: AAUP

Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ripper: 89.11 to 96.04
– SMO: 93.07 to 97.03
– No improvement for Naive Bayes
• AAUP dataset (compensation):
– M5: 59.88 to 51.28
– SMO: 74.12 to 61.97
– No improvement for linear regression
• ...but they may also cause problems
– extreme example: 6.54 to 189.90 for linear regression
– memory and timeouts due to large datasets

Experiments: Quality of Features
• Information gain of features on Zoo dataset

Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)

Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work with Dennis Wegener and Daniel Hienert (GESIS)
• Task: event classification (e.g., Politics, Sports, ...)
http://www.vizgr.org/historical-events/timeline/

• Source Material:
http://www.vizgr.org/historical-events/timeline/

• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.

• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
• Possible learned model:
– "Angela Merkel" → Politics

• Possibly Learned Model:
– "Angela Merkel" → Politics
• How can we do better?
• Background knowledge from Linked Open Data
– 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts
down the seven oldest German nuclear power plants.
– 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class:
Politician] is elected to continue as Minister-President, heading an SPD-
Green coalition.
• Model learned in that case:
– "[class: Politician]" → Politics

• Model learned in that case:
– "[class: Politician]" → Politics
• Much more general
– Can also classify events with politicians
not contained in the training set
• Less training examples required
– A few events with politicians, athletes, singers, ... are enough

• Experiments on Wikipedia data
– >10 categories
– 1,000 labeled examples as training set
– Classification accuracy: 80%
• Plus:
– We have trained a language-independent model!
• often, models are like "elect*" → Politics
– 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von
Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt.
– 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för
Vänsterpartiet efter Lars Ohly [class: Politician].

Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)
• Goal: using Twitter for emergency management
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron

• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)

• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city

• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident

Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided

Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...

• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)

http://xkcd.com/605/

• What statistics often look like

• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• Approach:
– use Linked Open Data for enriching statistical data (e.g., FeGeLOD)
– run analysis tools for finding explanations

Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge
• Runs basic analysis (correlation, rule learning)
• Presents explanations

Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• LOD data sets used in the examples:
– DBpedia
– CIA World Factbook for statistics by country

• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
• a very accurate rule
• but what's the interpretation? Next Record Studio
2547 miles
Next Record Studio
2547 miles

• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results

• Example rules for countries with low corruption
– HDI > 78%
• Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines

• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/

Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Quality of living dataset

Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Corruption dataset

• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
• Exception: Frankfurt

• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-
2011/pharmaceutical-consumption_health_glance-2011-39-en

• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
• Nordic countries, countries at the Atlantic: high
• Mediterranean: medium
• Alpine countries: low
– High average age → high consumption
– High birth rates → high consumption

• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html

• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
• At the sea → low
• In the mountains → high
– High Gini index → low suicide rate
• High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates

• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya

• Findings on sexual activity
– By geography:
• High in Europe, low in Asia
• Low in Island states
– By language:
• English speaking: low
• French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity

Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/

RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday
• Problems
– still no nice UI
– special configurations are tricky
– difficult to enhance
• Decision
– Reimplementation on RapidMiner platform
– September 13th
, 2013:
Release of RapidMiner Linked Open Data Extension
– Available from RapidMiner marketplace
• http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/

• Simple wiring of operators
– linkers
– generators
• Combination with powerful RapidMiner operators

• Easy SPARQL endpoint definitions
• Support of custom SPARQL statements

Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPARQL constructs
– COUNT(...)
– transitive closure
– exploit where applicable
• Implementations without SPARQL
– Freebase
– OpenCyc

• Linking is still challenging
– URI patterns are not flexible
– Search by label is time consuming
– Services like DBpedia Lookup are scarce
• Limitations of completely unsupervised linking
– e.g., Hurricanes
– how to use headlines/attribute names?

• Linking as optimization problem
– find candidates for all entities, e.g., by DBpedia lookup
– find a selection of candidates that are most similar to each other
• e.g., all of them are U.S. cities
– some experiments with types and categories
• problem: not complete
– some problems cannot be addressed (e.g.: Hurricanes)
• Alternatives:
– semi supervised linking – user provides some example links
– active learning

• Exploiting semantics for feature selection
• Given two features:
– f1: type(RoadsInAlaska)
– f2: type(Road)
• and the schema definition Road rdfs:subclassOf RoadsInAlaska
• Exploit that information for feature selection
– e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1

• Incompleteness of LOD
– e.g., type information in DBpedia
– may lead to findings such as
• if a city is of type Place, the quality of living is high
– possible remedy: autocomplete on the dataset
(e.g., Paulheim/Bizer 2013)
• Biases in LOD
– e.g., DBpedia has a bias towards western culture
– may lead to findings such as
• if many records have been made in a city, the quality of living is high

• Features not used for scalability reasons:
– features for single entities
• e.g., “Roman Polanski directorOf X”
– features more than one hop away
• e.g., “Cities with a university which has a computer science department”
– some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990”
• but subject to YAGO's selection bias
• Approaches are required to use such features
– which respect scalability
– “generate first, filter later” is not the best solution
• e.g., “Cities with at least one of ArtSchoolsInParis”
– on-the-fly filtering may be more suitable
• e.g., sampling

• Automatically exploit data sources with non-simple structures
EU18931 a Funding .
EU18931 has-grant-value [
has-amount 1300000 .
has-unit-of-measure EUR .
]
• Support geo/temporal features
– e.g., Data Cubes
– e.g., Linked Geo Data
• Construct complex features (in a scalable way!)
– e.g., cinemas per inhabitant
real example from
CORDIS dataset

Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have little knowledge in themselves
• Unsupervised methods
– avoid biases and work without knowledge about LOD
– but: scalability and generality problems
• RapidMiner LOD extension
– a constantly growing toolkit

Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexander Gabriel
– Simon Holthausen
• Current team of RapidMiner Linked Open Data Extension:
– Chris Bizer
– Petar Ristoski
– Evgeny Mitichkin

Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim

Exploiting Linked Open Data as Background Knowledge in Data Mining

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Exploiting Linked Open Data as Background Knowledge in Data Mining

Similar to Exploiting Linked Open Data as Background Knowledge in Data Mining (20)

More from Heiko Paulheim

More from Heiko Paulheim (19)

Recently uploaded

Recently uploaded (20)

Exploiting Linked Open Data as Background Knowledge in Data Mining