10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of...
10/08/13 Heiko Paulheim 2
Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMi...
10/08/13 Heiko Paulheim 3
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darm...
10/08/13 Heiko Paulheim 4
Motivation
• Many data mining problems are solved better
– when you have more background knowled...
10/08/13 Heiko Paulheim 5
Motivation
http://lod-cloud.net/
10/08/13 Heiko Paulheim 6
Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data m...
10/08/13 Heiko Paulheim 7
Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel func...
10/08/13 Heiko Paulheim 8
Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / Rapid...
10/08/13 Heiko Paulheim 9
FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o...
10/08/13 Heiko Paulheim 10
FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessi...
10/08/13 Heiko Paulheim 11
Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
...
10/08/13 Heiko Paulheim 12
Experiments: Zoo Dataset
10/08/13 Heiko Paulheim 13
First Results: AAUP
10/08/13 Heiko Paulheim 14
Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ri...
10/08/13 Heiko Paulheim 15
Experiments: Quality of Features
• Information gain of features on Zoo dataset
10/08/13 Heiko Paulheim 16
Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)
10/08/13 Heiko Paulheim 17
Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work w...
10/08/13 Heiko Paulheim 18
Application: Classifying Events from Wikipedia
• Source Material:
http://www.vizgr.org/historic...
10/08/13 Heiko Paulheim 19
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, ...
10/08/13 Heiko Paulheim 20
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, ...
10/08/13 Heiko Paulheim 21
Application: Classifying Events from Wikipedia
• Possibly Learned Model:
– "Angela Merkel" → Po...
10/08/13 Heiko Paulheim 22
Application: Classifying Events from Wikipedia
• Model learned in that case:
– "[class: Politic...
10/08/13 Heiko Paulheim 23
Application: Classifying Events from Wikipedia
• Experiments on Wikipedia data
– >10 categories...
10/08/13 Heiko Paulheim 24
Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)...
10/08/13 Heiko Paulheim 25
Application: Classifying Tweets
• Social media contains data on many incidents
– But keyword se...
10/08/13 Heiko Paulheim 26
Application: Classifying Tweets
• Learning to classify tweets:
– Positive and negative examples...
10/08/13 Heiko Paulheim 27
Application: Classifying Tweets
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Mod...
10/08/13 Heiko Paulheim 28
Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
...
10/08/13 Heiko Paulheim 29
Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corrupt...
10/08/13 Heiko Paulheim 30
Explaining Statistics
• Questions we are often interested in
– Why does city X have a high/low ...
10/08/13 Heiko Paulheim 31
Explaining Statistics
http://xkcd.com/605/
10/08/13 Heiko Paulheim 32
Explaining Statistics
• What statistics often look like
10/08/13 Heiko Paulheim 33
Explaining Statistics
• There are powerful tools for finding correlations etc.
– but many stati...
10/08/13 Heiko Paulheim 34
Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge...
10/08/13 Heiko Paulheim 35
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 citi...
10/08/13 Heiko Paulheim 36
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 a...
10/08/13 Heiko Paulheim 37
Statistical Data: Examples
10/08/13 Heiko Paulheim 38
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corrupt...
10/08/13 Heiko Paulheim 39
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
• Huma...
10/08/13 Heiko Paulheim 40
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and r...
10/08/13 Heiko Paulheim 41
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boo...
10/08/13 Heiko Paulheim 42
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boo...
10/08/13 Heiko Paulheim 43
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover...
10/08/13 Heiko Paulheim 44
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– So...
10/08/13 Heiko Paulheim 45
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have h...
10/08/13 Heiko Paulheim 46
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– ...
10/08/13 Heiko Paulheim 47
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates ...
10/08/13 Heiko Paulheim 48
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly...
10/08/13 Heiko Paulheim 49
Statistical Data: Examples
• Findings on sexual activity
– By geography:
• High in Europe, low ...
10/08/13 Heiko Paulheim 50
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod...
10/08/13 Heiko Paulheim 51
RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday...
10/08/13 Heiko Paulheim 52
RapidMiner Linked Open Data Extension
• Simple wiring of operators
– linkers
– generators
• Com...
10/08/13 Heiko Paulheim 53
RapidMiner Linked Open Data Extension
• Easy SPARQL endpoint definitions
• Support of custom SP...
10/08/13 Heiko Paulheim 54
Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPAR...
10/08/13 Heiko Paulheim 55
Challenges and Future Work
• Linking is still challenging
– URI patterns are not flexible
– Sea...
10/08/13 Heiko Paulheim 56
Challenges and Future Work
• Linking as optimization problem
– find candidates for all entities...
10/08/13 Heiko Paulheim 57
Challenges and Future Work
• Exploiting semantics for feature selection
• Given two features:
–...
10/08/13 Heiko Paulheim 58
Challenges and Future Work
• Incompleteness of LOD
– e.g., type information in DBpedia
– may le...
10/08/13 Heiko Paulheim 59
Challenges and Future Work
• Features not used for scalability reasons:
– features for single e...
10/08/13 Heiko Paulheim 60
Challenges and Future Work
• Automatically exploit data sources with non-simple structures
EU18...
10/08/13 Heiko Paulheim 61
Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have lit...
10/08/13 Heiko Paulheim 62
Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexande...
10/08/13 Heiko Paulheim 63
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University o...
Upcoming SlideShare
Loading in …5
×

Exploiting Linked Open Data as Background Knowledge in Data Mining

3,210 views
3,117 views

Published on

Invited talk at Data Mining on Linked Data (DMoLD) workshop, co-located with ECML-PKDD 2013

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,210
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Exploiting Linked Open Data as Background Knowledge in Data Mining

  1. 1. 10/08/13 Heiko Paulheim 1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim
  2. 2. 10/08/13 Heiko Paulheim 2 Outline • Motivation • The original FeGeLOD framework • Experiments • Applications • The RapidMiner Linked Open Data Extension • Challenges and Future Work
  3. 3. 10/08/13 Heiko Paulheim 3 Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities
  4. 4. 10/08/13 Heiko Paulheim 4 Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include?
  5. 5. 10/08/13 Heiko Paulheim 5 Motivation http://lod-cloud.net/
  6. 6. 10/08/13 Heiko Paulheim 6 Motivation • Idea: – reuse background knowledge from Linked Open Data – include it in the data mining process as needed • Two main variants: – develop mining/learning algorithms that run directly on Linked Data – create relational features from Linked Data
  7. 7. 10/08/13 Heiko Paulheim 7 Motivation • Develop mining/learning algorithms – e.g., DL Learner – e.g., dedicated Kernel functions • Advantages: – can be quite efficient – no reduction to “flat” table structure – semantics can be respected directly
  8. 8. 10/08/13 Heiko Paulheim 8 Motivation • Create relational features – e.g., LiDDM – e.g., AutoSPARQL – e.g., FeGeLOD / RapidMiner Linked Open Data Extension • Advantages: – Easy combination of knowledge from various sources • including relational features in the original data – Arbitrary mining algorithms/tools possible
  9. 9. 10/08/13 Heiko Paulheim 9 FeGeLOD – Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  10. 10. 10/08/13 Heiko Paulheim 10 FeGeLOD – Feature Generation from LOD • Original prototype, based on Weka: – Simple NER (guessing URIs) – Seven generators: • direct types • data properties • unqualified relations (boolean, numeric) • qualified relations (boolean, numeric) • individuals (dangerous!) - may be restricted to specific property – Simple feature selection: filtering features • that have only* different values (expect numerical) • that have only* identical values • that are mostly missing* *) 95% or 99%
  11. 11. 10/08/13 Heiko Paulheim 11 Experiments • Testing with two* standard machine learning data sets – Zoo: classifying animals – AAUP: predicting income of university employees (regression task) • Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce!
  12. 12. 10/08/13 Heiko Paulheim 12 Experiments: Zoo Dataset
  13. 13. 10/08/13 Heiko Paulheim 13 First Results: AAUP
  14. 14. 10/08/13 Heiko Paulheim 14 Experiments: Early Insights • Additional features often improve the results • Zoo dataset: – Ripper: 89.11 to 96.04 – SMO: 93.07 to 97.03 – No improvement for Naive Bayes • AAUP dataset (compensation): – M5: 59.88 to 51.28 – SMO: 74.12 to 61.97 – No improvement for linear regression • ...but they may also cause problems – extreme example: 6.54 to 189.90 for linear regression – memory and timeouts due to large datasets
  15. 15. 10/08/13 Heiko Paulheim 15 Experiments: Quality of Features • Information gain of features on Zoo dataset
  16. 16. 10/08/13 Heiko Paulheim 16 Experiments: Quality of Features • Information gain of features on AAUP dataset (compensation)
  17. 17. 10/08/13 Heiko Paulheim 17 Application: Classifying Events from Wikipedia • Event Extraction from Wikipedia • Joint work with Dennis Wegener and Daniel Hienert (GESIS) • Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/
  18. 18. 10/08/13 Heiko Paulheim 18 Application: Classifying Events from Wikipedia • Source Material: http://www.vizgr.org/historical-events/timeline/
  19. 19. 10/08/13 Heiko Paulheim 19 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest.
  20. 20. 10/08/13 Heiko Paulheim 20 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest. • Possible learned model: – "Angela Merkel" → Politics
  21. 21. 10/08/13 Heiko Paulheim 21 Application: Classifying Events from Wikipedia • Possibly Learned Model: – "Angela Merkel" → Politics • How can we do better? • Background knowledge from Linked Open Data – 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. – 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. • Model learned in that case: – "[class: Politician]" → Politics
  22. 22. 10/08/13 Heiko Paulheim 22 Application: Classifying Events from Wikipedia • Model learned in that case: – "[class: Politician]" → Politics • Much more general – Can also classify events with politicians not contained in the training set • Less training examples required – A few events with politicians, athletes, singers, ... are enough
  23. 23. 10/08/13 Heiko Paulheim 23 Application: Classifying Events from Wikipedia • Experiments on Wikipedia data – >10 categories – 1,000 labeled examples as training set – Classification accuracy: 80% • Plus: – We have trained a language-independent model! • often, models are like "elect*" → Politics – 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt. – 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för Vänsterpartiet efter Lars Ohly [class: Politician].
  24. 24. 10/08/13 Heiko Paulheim 24 Application: Classifying Tweets • Joint work with Axel Schulz and Petar Ristoski (SAP Research) • Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  25. 25. 10/08/13 Heiko Paulheim 25 Application: Classifying Tweets • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  26. 26. 10/08/13 Heiko Paulheim 26 Application: Classifying Tweets • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  27. 27. 10/08/13 Heiko Paulheim 27 Application: Classifying Tweets • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  28. 28. 10/08/13 Heiko Paulheim 28 Using LOD for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  29. 29. 10/08/13 Heiko Paulheim 29 Explaining Statistics • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  30. 30. 10/08/13 Heiko Paulheim 30 Explaining Statistics • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  31. 31. 10/08/13 Heiko Paulheim 31 Explaining Statistics http://xkcd.com/605/
  32. 32. 10/08/13 Heiko Paulheim 32 Explaining Statistics • What statistics often look like
  33. 33. 10/08/13 Heiko Paulheim 33 Explaining Statistics • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • Approach: – use Linked Open Data for enriching statistical data (e.g., FeGeLOD) – run analysis tools for finding explanations
  34. 34. 10/08/13 Heiko Paulheim 34 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Runs basic analysis (correlation, rule learning) • Presents explanations
  35. 35. 10/08/13 Heiko Paulheim 35 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in 216 cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • LOD data sets used in the examples: – DBpedia – CIA World Factbook for statistics by country
  36. 36. 10/08/13 Heiko Paulheim 36 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 • a very accurate rule • but what's the interpretation? Next Record Studio 2547 miles Next Record Studio 2547 miles
  37. 37. 10/08/13 Heiko Paulheim 37 Statistical Data: Examples
  38. 38. 10/08/13 Heiko Paulheim 38 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  39. 39. 10/08/13 Heiko Paulheim 39 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% • Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  40. 40. 10/08/13 Heiko Paulheim 40 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  41. 41. 10/08/13 Heiko Paulheim 41 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Quality of living dataset
  42. 42. 10/08/13 Heiko Paulheim 42 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Corruption dataset
  43. 43. 10/08/13 Heiko Paulheim 43 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones • Exception: Frankfurt
  44. 44. 10/08/13 Heiko Paulheim 44 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance- 2011/pharmaceutical-consumption_health_glance-2011-39-en
  45. 45. 10/08/13 Heiko Paulheim 45 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: • Nordic countries, countries at the Atlantic: high • Mediterranean: medium • Alpine countries: low – High average age → high consumption – High birth rates → high consumption
  46. 46. 10/08/13 Heiko Paulheim 46 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  47. 47. 10/08/13 Heiko Paulheim 47 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: • At the sea → low • In the mountains → high – High Gini index → low suicide rate • High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  48. 48. 10/08/13 Heiko Paulheim 48 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  49. 49. 10/08/13 Heiko Paulheim 49 Statistical Data: Examples • Findings on sexual activity – By geography: • High in Europe, low in Asia • Low in Island states – By language: • English speaking: low • French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  50. 50. 10/08/13 Heiko Paulheim 50 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  51. 51. 10/08/13 Heiko Paulheim 51 RapidMiner Linked Open Data Extension • August 16th , 2013: FeGeLOD celebrates its 2nd birthday • Problems – still no nice UI – special configurations are tricky – difficult to enhance • Decision – Reimplementation on RapidMiner platform – September 13th , 2013: Release of RapidMiner Linked Open Data Extension – Available from RapidMiner marketplace • http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
  52. 52. 10/08/13 Heiko Paulheim 52 RapidMiner Linked Open Data Extension • Simple wiring of operators – linkers – generators • Combination with powerful RapidMiner operators
  53. 53. 10/08/13 Heiko Paulheim 53 RapidMiner Linked Open Data Extension • Easy SPARQL endpoint definitions • Support of custom SPARQL statements
  54. 54. 10/08/13 Heiko Paulheim 54 Challenges and Future Work • SPARQL variants – Some endpoints support special/non-standard SPARQL constructs – COUNT(...) – transitive closure – exploit where applicable • Implementations without SPARQL – Freebase – OpenCyc
  55. 55. 10/08/13 Heiko Paulheim 55 Challenges and Future Work • Linking is still challenging – URI patterns are not flexible – Search by label is time consuming – Services like DBpedia Lookup are scarce • Limitations of completely unsupervised linking – e.g., Hurricanes – how to use headlines/attribute names?
  56. 56. 10/08/13 Heiko Paulheim 56 Challenges and Future Work • Linking as optimization problem – find candidates for all entities, e.g., by DBpedia lookup – find a selection of candidates that are most similar to each other • e.g., all of them are U.S. cities – some experiments with types and categories • problem: not complete – some problems cannot be addressed (e.g.: Hurricanes) • Alternatives: – semi supervised linking – user provides some example links – active learning
  57. 57. 10/08/13 Heiko Paulheim 57 Challenges and Future Work • Exploiting semantics for feature selection • Given two features: – f1: type(RoadsInAlaska) – f2: type(Road) • and the schema definition Road rdfs:subclassOf RoadsInAlaska • Exploit that information for feature selection – e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
  58. 58. 10/08/13 Heiko Paulheim 58 Challenges and Future Work • Incompleteness of LOD – e.g., type information in DBpedia – may lead to findings such as • if a city is of type Place, the quality of living is high – possible remedy: autocomplete on the dataset (e.g., Paulheim/Bizer 2013) • Biases in LOD – e.g., DBpedia has a bias towards western culture – may lead to findings such as • if many records have been made in a city, the quality of living is high
  59. 59. 10/08/13 Heiko Paulheim 59 Challenges and Future Work • Features not used for scalability reasons: – features for single entities • e.g., “Roman Polanski directorOf X” – features more than one hop away • e.g., “Cities with a university which has a computer science department” – some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990” • but subject to YAGO's selection bias • Approaches are required to use such features – which respect scalability – “generate first, filter later” is not the best solution • e.g., “Cities with at least one of ArtSchoolsInParis” – on-the-fly filtering may be more suitable • e.g., sampling
  60. 60. 10/08/13 Heiko Paulheim 60 Challenges and Future Work • Automatically exploit data sources with non-simple structures EU18931 a Funding . EU18931 has-grant-value [ has-amount 1300000 . has-unit-of-measure EUR . ] • Support geo/temporal features – e.g., Data Cubes – e.g., Linked Geo Data • Construct complex features (in a scalable way!) – e.g., cinemas per inhabitant real example from CORDIS dataset
  61. 61. 10/08/13 Heiko Paulheim 61 Wrap-up • Linked Data is useful as background knowledge – especially on problems which have little knowledge in themselves • Unsupervised methods – avoid biases and work without knowledge about LOD – but: scalability and generality problems • RapidMiner LOD extension – a constantly growing toolkit
  62. 62. 10/08/13 Heiko Paulheim 62 Credits & Thanks • Past contributors of FeGeLOD: – Johannes Fürnkranz – Raad Bahmani – Alexander Gabriel – Simon Holthausen • Current team of RapidMiner Linked Open Data Extension: – Chris Bizer – Petar Ristoski – Evgeny Mitichkin
  63. 63. 10/08/13 Heiko Paulheim 63 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim

×