10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim
10/08/13 Heiko Paulheim 2
Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMiner Linked Open Data Extension
• Challenges and Future Work
10/08/13 Heiko Paulheim 3
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
10/08/13 Heiko Paulheim 4
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
10/08/13 Heiko Paulheim 5
Motivation
http://lod-cloud.net/
10/08/13 Heiko Paulheim 6
Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data mining process as needed
• Two main variants:
– develop mining/learning algorithms that run directly on Linked Data
– create relational features from Linked Data
10/08/13 Heiko Paulheim 7
Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel functions
• Advantages:
– can be quite efficient
– no reduction to “flat” table structure
– semantics can be respected directly
10/08/13 Heiko Paulheim 8
Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / RapidMiner Linked Open Data Extension
• Advantages:
– Easy combination of knowledge from various sources
• including relational features in the original data
– Arbitrary mining algorithms/tools possible
10/08/13 Heiko Paulheim 9
FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
10/08/13 Heiko Paulheim 10
FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessing URIs)
– Seven generators:
• direct types
• data properties
• unqualified relations (boolean, numeric)
• qualified relations (boolean, numeric)
• individuals (dangerous!) - may be restricted to specific property
– Simple feature selection: filtering features
• that have only* different values (expect numerical)
• that have only* identical values
• that are mostly missing*
*) 95% or 99%
10/08/13 Heiko Paulheim 11
Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
– AAUP: predicting income of university employees
(regression task)
• Question: how much improvement do additional features bring?
*) standard ML datasets with speaking labels are scarce!
10/08/13 Heiko Paulheim 12
Experiments: Zoo Dataset
10/08/13 Heiko Paulheim 13
First Results: AAUP
10/08/13 Heiko Paulheim 14
Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ripper: 89.11 to 96.04
– SMO: 93.07 to 97.03
– No improvement for Naive Bayes
• AAUP dataset (compensation):
– M5: 59.88 to 51.28
– SMO: 74.12 to 61.97
– No improvement for linear regression
• ...but they may also cause problems
– extreme example: 6.54 to 189.90 for linear regression
– memory and timeouts due to large datasets
10/08/13 Heiko Paulheim 15
Experiments: Quality of Features
• Information gain of features on Zoo dataset
10/08/13 Heiko Paulheim 16
Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)
10/08/13 Heiko Paulheim 17
Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work with Dennis Wegener and Daniel Hienert (GESIS)
• Task: event classification (e.g., Politics, Sports, ...)
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 18
Application: Classifying Events from Wikipedia
• Source Material:
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 19
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
10/08/13 Heiko Paulheim 20
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
• Possible learned model:
– "Angela Merkel" → Politics
10/08/13 Heiko Paulheim 21
Application: Classifying Events from Wikipedia
• Possibly Learned Model:
– "Angela Merkel" → Politics
• How can we do better?
• Background knowledge from Linked Open Data
– 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts
down the seven oldest German nuclear power plants.
– 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class:
Politician] is elected to continue as Minister-President, heading an SPD-
Green coalition.
• Model learned in that case:
– "[class: Politician]" → Politics
10/08/13 Heiko Paulheim 22
Application: Classifying Events from Wikipedia
• Model learned in that case:
– "[class: Politician]" → Politics
• Much more general
– Can also classify events with politicians
not contained in the training set
• Less training examples required
– A few events with politicians, athletes, singers, ... are enough
10/08/13 Heiko Paulheim 23
Application: Classifying Events from Wikipedia
• Experiments on Wikipedia data
– >10 categories
– 1,000 labeled examples as training set
– Classification accuracy: 80%
• Plus:
– We have trained a language-independent model!
• often, models are like "elect*" → Politics
– 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von
Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt.
– 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för
Vänsterpartiet efter Lars Ohly [class: Politician].
10/08/13 Heiko Paulheim 24
Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)
• Goal: using Twitter for emergency management
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
10/08/13 Heiko Paulheim 25
Application: Classifying Tweets
• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)
10/08/13 Heiko Paulheim 26
Application: Classifying Tweets
• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
10/08/13 Heiko Paulheim 27
Application: Classifying Tweets
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident
10/08/13 Heiko Paulheim 28
Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided
10/08/13 Heiko Paulheim 29
Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
10/08/13 Heiko Paulheim 30
Explaining Statistics
• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)
10/08/13 Heiko Paulheim 31
Explaining Statistics
http://xkcd.com/605/
10/08/13 Heiko Paulheim 32
Explaining Statistics
• What statistics often look like
10/08/13 Heiko Paulheim 33
Explaining Statistics
• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• Approach:
– use Linked Open Data for enriching statistical data (e.g., FeGeLOD)
– run analysis tools for finding explanations
10/08/13 Heiko Paulheim 34
Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge
• Runs basic analysis (correlation, rule learning)
• Presents explanations
10/08/13 Heiko Paulheim 35
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• LOD data sets used in the examples:
– DBpedia
– CIA World Factbook for statistics by country
10/08/13 Heiko Paulheim 36
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
• a very accurate rule
• but what's the interpretation? Next Record Studio
2547 miles
Next Record Studio
2547 miles
10/08/13 Heiko Paulheim 37
Statistical Data: Examples
10/08/13 Heiko Paulheim 38
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results
10/08/13 Heiko Paulheim 39
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
• Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines
10/08/13 Heiko Paulheim 40
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
10/08/13 Heiko Paulheim 41
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Quality of living dataset
10/08/13 Heiko Paulheim 42
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Corruption dataset
10/08/13 Heiko Paulheim 43
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
• Exception: Frankfurt
10/08/13 Heiko Paulheim 44
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-
2011/pharmaceutical-consumption_health_glance-2011-39-en
10/08/13 Heiko Paulheim 45
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
• Nordic countries, countries at the Atlantic: high
• Mediterranean: medium
• Alpine countries: low
– High average age → high consumption
– High birth rates → high consumption
10/08/13 Heiko Paulheim 46
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html
10/08/13 Heiko Paulheim 47
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
• At the sea → low
• In the mountains → high
– High Gini index → low suicide rate
• High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates
10/08/13 Heiko Paulheim 48
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya
10/08/13 Heiko Paulheim 49
Statistical Data: Examples
• Findings on sexual activity
– By geography:
• High in Europe, low in Asia
• Low in Island states
– By language:
• English speaking: low
• French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity
10/08/13 Heiko Paulheim 50
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/
10/08/13 Heiko Paulheim 51
RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday
• Problems
– still no nice UI
– special configurations are tricky
– difficult to enhance
• Decision
– Reimplementation on RapidMiner platform
– September 13th
, 2013:
Release of RapidMiner Linked Open Data Extension
– Available from RapidMiner marketplace
• http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
10/08/13 Heiko Paulheim 52
RapidMiner Linked Open Data Extension
• Simple wiring of operators
– linkers
– generators
• Combination with powerful RapidMiner operators
10/08/13 Heiko Paulheim 53
RapidMiner Linked Open Data Extension
• Easy SPARQL endpoint definitions
• Support of custom SPARQL statements
10/08/13 Heiko Paulheim 54
Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPARQL constructs
– COUNT(...)
– transitive closure
– exploit where applicable
• Implementations without SPARQL
– Freebase
– OpenCyc
10/08/13 Heiko Paulheim 55
Challenges and Future Work
• Linking is still challenging
– URI patterns are not flexible
– Search by label is time consuming
– Services like DBpedia Lookup are scarce
• Limitations of completely unsupervised linking
– e.g., Hurricanes
– how to use headlines/attribute names?
10/08/13 Heiko Paulheim 56
Challenges and Future Work
• Linking as optimization problem
– find candidates for all entities, e.g., by DBpedia lookup
– find a selection of candidates that are most similar to each other
• e.g., all of them are U.S. cities
– some experiments with types and categories
• problem: not complete
– some problems cannot be addressed (e.g.: Hurricanes)
• Alternatives:
– semi supervised linking – user provides some example links
– active learning
10/08/13 Heiko Paulheim 57
Challenges and Future Work
• Exploiting semantics for feature selection
• Given two features:
– f1: type(RoadsInAlaska)
– f2: type(Road)
• and the schema definition Road rdfs:subclassOf RoadsInAlaska
• Exploit that information for feature selection
– e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
10/08/13 Heiko Paulheim 58
Challenges and Future Work
• Incompleteness of LOD
– e.g., type information in DBpedia
– may lead to findings such as
• if a city is of type Place, the quality of living is high
– possible remedy: autocomplete on the dataset
(e.g., Paulheim/Bizer 2013)
• Biases in LOD
– e.g., DBpedia has a bias towards western culture
– may lead to findings such as
• if many records have been made in a city, the quality of living is high
10/08/13 Heiko Paulheim 59
Challenges and Future Work
• Features not used for scalability reasons:
– features for single entities
• e.g., “Roman Polanski directorOf X”
– features more than one hop away
• e.g., “Cities with a university which has a computer science department”
– some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990”
• but subject to YAGO's selection bias
• Approaches are required to use such features
– which respect scalability
– “generate first, filter later” is not the best solution
• e.g., “Cities with at least one of ArtSchoolsInParis”
– on-the-fly filtering may be more suitable
• e.g., sampling
10/08/13 Heiko Paulheim 60
Challenges and Future Work
• Automatically exploit data sources with non-simple structures
EU18931 a Funding .
EU18931 has-grant-value [
has-amount 1300000 .
has-unit-of-measure EUR .
]
• Support geo/temporal features
– e.g., Data Cubes
– e.g., Linked Geo Data
• Construct complex features (in a scalable way!)
– e.g., cinemas per inhabitant
real example from
CORDIS dataset
10/08/13 Heiko Paulheim 61
Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have little knowledge in themselves
• Unsupervised methods
– avoid biases and work without knowledge about LOD
– but: scalability and generality problems
• RapidMiner LOD extension
– a constantly growing toolkit
10/08/13 Heiko Paulheim 62
Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexander Gabriel
– Simon Holthausen
• Current team of RapidMiner Linked Open Data Extension:
– Chris Bizer
– Petar Ristoski
– Evgeny Mitichkin
10/08/13 Heiko Paulheim 63
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim

Exploiting Linked Open Data as Background Knowledge in Data Mining

  • 1.
    10/08/13 Heiko Paulheim1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim
  • 2.
    10/08/13 Heiko Paulheim2 Outline • Motivation • The original FeGeLOD framework • Experiments • Applications • The RapidMiner Linked Open Data Extension • Challenges and Future Work
  • 3.
    10/08/13 Heiko Paulheim3 Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities
  • 4.
    10/08/13 Heiko Paulheim4 Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include?
  • 5.
    10/08/13 Heiko Paulheim5 Motivation http://lod-cloud.net/
  • 6.
    10/08/13 Heiko Paulheim6 Motivation • Idea: – reuse background knowledge from Linked Open Data – include it in the data mining process as needed • Two main variants: – develop mining/learning algorithms that run directly on Linked Data – create relational features from Linked Data
  • 7.
    10/08/13 Heiko Paulheim7 Motivation • Develop mining/learning algorithms – e.g., DL Learner – e.g., dedicated Kernel functions • Advantages: – can be quite efficient – no reduction to “flat” table structure – semantics can be respected directly
  • 8.
    10/08/13 Heiko Paulheim8 Motivation • Create relational features – e.g., LiDDM – e.g., AutoSPARQL – e.g., FeGeLOD / RapidMiner Linked Open Data Extension • Advantages: – Easy combination of knowledge from various sources • including relational features in the original data – Arbitrary mining algorithms/tools possible
  • 9.
    10/08/13 Heiko Paulheim9 FeGeLOD – Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  • 10.
    10/08/13 Heiko Paulheim10 FeGeLOD – Feature Generation from LOD • Original prototype, based on Weka: – Simple NER (guessing URIs) – Seven generators: • direct types • data properties • unqualified relations (boolean, numeric) • qualified relations (boolean, numeric) • individuals (dangerous!) - may be restricted to specific property – Simple feature selection: filtering features • that have only* different values (expect numerical) • that have only* identical values • that are mostly missing* *) 95% or 99%
  • 11.
    10/08/13 Heiko Paulheim11 Experiments • Testing with two* standard machine learning data sets – Zoo: classifying animals – AAUP: predicting income of university employees (regression task) • Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce!
  • 12.
    10/08/13 Heiko Paulheim12 Experiments: Zoo Dataset
  • 13.
    10/08/13 Heiko Paulheim13 First Results: AAUP
  • 14.
    10/08/13 Heiko Paulheim14 Experiments: Early Insights • Additional features often improve the results • Zoo dataset: – Ripper: 89.11 to 96.04 – SMO: 93.07 to 97.03 – No improvement for Naive Bayes • AAUP dataset (compensation): – M5: 59.88 to 51.28 – SMO: 74.12 to 61.97 – No improvement for linear regression • ...but they may also cause problems – extreme example: 6.54 to 189.90 for linear regression – memory and timeouts due to large datasets
  • 15.
    10/08/13 Heiko Paulheim15 Experiments: Quality of Features • Information gain of features on Zoo dataset
  • 16.
    10/08/13 Heiko Paulheim16 Experiments: Quality of Features • Information gain of features on AAUP dataset (compensation)
  • 17.
    10/08/13 Heiko Paulheim17 Application: Classifying Events from Wikipedia • Event Extraction from Wikipedia • Joint work with Dennis Wegener and Daniel Hienert (GESIS) • Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/
  • 18.
    10/08/13 Heiko Paulheim18 Application: Classifying Events from Wikipedia • Source Material: http://www.vizgr.org/historical-events/timeline/
  • 19.
    10/08/13 Heiko Paulheim19 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest.
  • 20.
    10/08/13 Heiko Paulheim20 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest. • Possible learned model: – "Angela Merkel" → Politics
  • 21.
    10/08/13 Heiko Paulheim21 Application: Classifying Events from Wikipedia • Possibly Learned Model: – "Angela Merkel" → Politics • How can we do better? • Background knowledge from Linked Open Data – 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. – 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. • Model learned in that case: – "[class: Politician]" → Politics
  • 22.
    10/08/13 Heiko Paulheim22 Application: Classifying Events from Wikipedia • Model learned in that case: – "[class: Politician]" → Politics • Much more general – Can also classify events with politicians not contained in the training set • Less training examples required – A few events with politicians, athletes, singers, ... are enough
  • 23.
    10/08/13 Heiko Paulheim23 Application: Classifying Events from Wikipedia • Experiments on Wikipedia data – >10 categories – 1,000 labeled examples as training set – Classification accuracy: 80% • Plus: – We have trained a language-independent model! • often, models are like "elect*" → Politics – 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt. – 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för Vänsterpartiet efter Lars Ohly [class: Politician].
  • 24.
    10/08/13 Heiko Paulheim24 Application: Classifying Tweets • Joint work with Axel Schulz and Petar Ristoski (SAP Research) • Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  • 25.
    10/08/13 Heiko Paulheim25 Application: Classifying Tweets • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  • 26.
    10/08/13 Heiko Paulheim26 Application: Classifying Tweets • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  • 27.
    10/08/13 Heiko Paulheim27 Application: Classifying Tweets • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  • 28.
    10/08/13 Heiko Paulheim28 Using LOD for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  • 29.
    10/08/13 Heiko Paulheim29 Explaining Statistics • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  • 30.
    10/08/13 Heiko Paulheim30 Explaining Statistics • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  • 31.
    10/08/13 Heiko Paulheim31 Explaining Statistics http://xkcd.com/605/
  • 32.
    10/08/13 Heiko Paulheim32 Explaining Statistics • What statistics often look like
  • 33.
    10/08/13 Heiko Paulheim33 Explaining Statistics • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • Approach: – use Linked Open Data for enriching statistical data (e.g., FeGeLOD) – run analysis tools for finding explanations
  • 34.
    10/08/13 Heiko Paulheim34 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Runs basic analysis (correlation, rule learning) • Presents explanations
  • 35.
    10/08/13 Heiko Paulheim35 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in 216 cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • LOD data sets used in the examples: – DBpedia – CIA World Factbook for statistics by country
  • 36.
    10/08/13 Heiko Paulheim36 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 • a very accurate rule • but what's the interpretation? Next Record Studio 2547 miles Next Record Studio 2547 miles
  • 37.
    10/08/13 Heiko Paulheim37 Statistical Data: Examples
  • 38.
    10/08/13 Heiko Paulheim38 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  • 39.
    10/08/13 Heiko Paulheim39 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% • Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  • 40.
    10/08/13 Heiko Paulheim40 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  • 41.
    10/08/13 Heiko Paulheim41 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Quality of living dataset
  • 42.
    10/08/13 Heiko Paulheim42 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Corruption dataset
  • 43.
    10/08/13 Heiko Paulheim43 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones • Exception: Frankfurt
  • 44.
    10/08/13 Heiko Paulheim44 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance- 2011/pharmaceutical-consumption_health_glance-2011-39-en
  • 45.
    10/08/13 Heiko Paulheim45 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: • Nordic countries, countries at the Atlantic: high • Mediterranean: medium • Alpine countries: low – High average age → high consumption – High birth rates → high consumption
  • 46.
    10/08/13 Heiko Paulheim46 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  • 47.
    10/08/13 Heiko Paulheim47 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: • At the sea → low • In the mountains → high – High Gini index → low suicide rate • High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  • 48.
    10/08/13 Heiko Paulheim48 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  • 49.
    10/08/13 Heiko Paulheim49 Statistical Data: Examples • Findings on sexual activity – By geography: • High in Europe, low in Asia • Low in Island states – By language: • English speaking: low • French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  • 50.
    10/08/13 Heiko Paulheim50 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  • 51.
    10/08/13 Heiko Paulheim51 RapidMiner Linked Open Data Extension • August 16th , 2013: FeGeLOD celebrates its 2nd birthday • Problems – still no nice UI – special configurations are tricky – difficult to enhance • Decision – Reimplementation on RapidMiner platform – September 13th , 2013: Release of RapidMiner Linked Open Data Extension – Available from RapidMiner marketplace • http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
  • 52.
    10/08/13 Heiko Paulheim52 RapidMiner Linked Open Data Extension • Simple wiring of operators – linkers – generators • Combination with powerful RapidMiner operators
  • 53.
    10/08/13 Heiko Paulheim53 RapidMiner Linked Open Data Extension • Easy SPARQL endpoint definitions • Support of custom SPARQL statements
  • 54.
    10/08/13 Heiko Paulheim54 Challenges and Future Work • SPARQL variants – Some endpoints support special/non-standard SPARQL constructs – COUNT(...) – transitive closure – exploit where applicable • Implementations without SPARQL – Freebase – OpenCyc
  • 55.
    10/08/13 Heiko Paulheim55 Challenges and Future Work • Linking is still challenging – URI patterns are not flexible – Search by label is time consuming – Services like DBpedia Lookup are scarce • Limitations of completely unsupervised linking – e.g., Hurricanes – how to use headlines/attribute names?
  • 56.
    10/08/13 Heiko Paulheim56 Challenges and Future Work • Linking as optimization problem – find candidates for all entities, e.g., by DBpedia lookup – find a selection of candidates that are most similar to each other • e.g., all of them are U.S. cities – some experiments with types and categories • problem: not complete – some problems cannot be addressed (e.g.: Hurricanes) • Alternatives: – semi supervised linking – user provides some example links – active learning
  • 57.
    10/08/13 Heiko Paulheim57 Challenges and Future Work • Exploiting semantics for feature selection • Given two features: – f1: type(RoadsInAlaska) – f2: type(Road) • and the schema definition Road rdfs:subclassOf RoadsInAlaska • Exploit that information for feature selection – e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
  • 58.
    10/08/13 Heiko Paulheim58 Challenges and Future Work • Incompleteness of LOD – e.g., type information in DBpedia – may lead to findings such as • if a city is of type Place, the quality of living is high – possible remedy: autocomplete on the dataset (e.g., Paulheim/Bizer 2013) • Biases in LOD – e.g., DBpedia has a bias towards western culture – may lead to findings such as • if many records have been made in a city, the quality of living is high
  • 59.
    10/08/13 Heiko Paulheim59 Challenges and Future Work • Features not used for scalability reasons: – features for single entities • e.g., “Roman Polanski directorOf X” – features more than one hop away • e.g., “Cities with a university which has a computer science department” – some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990” • but subject to YAGO's selection bias • Approaches are required to use such features – which respect scalability – “generate first, filter later” is not the best solution • e.g., “Cities with at least one of ArtSchoolsInParis” – on-the-fly filtering may be more suitable • e.g., sampling
  • 60.
    10/08/13 Heiko Paulheim60 Challenges and Future Work • Automatically exploit data sources with non-simple structures EU18931 a Funding . EU18931 has-grant-value [ has-amount 1300000 . has-unit-of-measure EUR . ] • Support geo/temporal features – e.g., Data Cubes – e.g., Linked Geo Data • Construct complex features (in a scalable way!) – e.g., cinemas per inhabitant real example from CORDIS dataset
  • 61.
    10/08/13 Heiko Paulheim61 Wrap-up • Linked Data is useful as background knowledge – especially on problems which have little knowledge in themselves • Unsupervised methods – avoid biases and work without knowledge about LOD – but: scalability and generality problems • RapidMiner LOD extension – a constantly growing toolkit
  • 62.
    10/08/13 Heiko Paulheim62 Credits & Thanks • Past contributors of FeGeLOD: – Johannes Fürnkranz – Raad Bahmani – Alexander Gabriel – Simon Holthausen • Current team of RapidMiner Linked Open Data Extension: – Chris Bizer – Petar Ristoski – Evgeny Mitichkin
  • 63.
    10/08/13 Heiko Paulheim63 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim