Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploiting Linked Open Data as Background Knowledge in Data Mining

3,925 views

Published on

Invited talk at Data Mining on Linked Data (DMoLD) workshop, co-located with ECML-PKDD 2013

Published in: Technology, Education
  • Be the first to comment

Exploiting Linked Open Data as Background Knowledge in Data Mining

  1. 1. 10/08/13 Heiko Paulheim 1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim
  2. 2. 10/08/13 Heiko Paulheim 2 Outline • Motivation • The original FeGeLOD framework • Experiments • Applications • The RapidMiner Linked Open Data Extension • Challenges and Future Work
  3. 3. 10/08/13 Heiko Paulheim 3 Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities
  4. 4. 10/08/13 Heiko Paulheim 4 Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include?
  5. 5. 10/08/13 Heiko Paulheim 5 Motivation http://lod-cloud.net/
  6. 6. 10/08/13 Heiko Paulheim 6 Motivation • Idea: – reuse background knowledge from Linked Open Data – include it in the data mining process as needed • Two main variants: – develop mining/learning algorithms that run directly on Linked Data – create relational features from Linked Data
  7. 7. 10/08/13 Heiko Paulheim 7 Motivation • Develop mining/learning algorithms – e.g., DL Learner – e.g., dedicated Kernel functions • Advantages: – can be quite efficient – no reduction to “flat” table structure – semantics can be respected directly
  8. 8. 10/08/13 Heiko Paulheim 8 Motivation • Create relational features – e.g., LiDDM – e.g., AutoSPARQL – e.g., FeGeLOD / RapidMiner Linked Open Data Extension • Advantages: – Easy combination of knowledge from various sources • including relational features in the original data – Arbitrary mining algorithms/tools possible
  9. 9. 10/08/13 Heiko Paulheim 9 FeGeLOD – Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  10. 10. 10/08/13 Heiko Paulheim 10 FeGeLOD – Feature Generation from LOD • Original prototype, based on Weka: – Simple NER (guessing URIs) – Seven generators: • direct types • data properties • unqualified relations (boolean, numeric) • qualified relations (boolean, numeric) • individuals (dangerous!) - may be restricted to specific property – Simple feature selection: filtering features • that have only* different values (expect numerical) • that have only* identical values • that are mostly missing* *) 95% or 99%
  11. 11. 10/08/13 Heiko Paulheim 11 Experiments • Testing with two* standard machine learning data sets – Zoo: classifying animals – AAUP: predicting income of university employees (regression task) • Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce!
  12. 12. 10/08/13 Heiko Paulheim 12 Experiments: Zoo Dataset
  13. 13. 10/08/13 Heiko Paulheim 13 First Results: AAUP
  14. 14. 10/08/13 Heiko Paulheim 14 Experiments: Early Insights • Additional features often improve the results • Zoo dataset: – Ripper: 89.11 to 96.04 – SMO: 93.07 to 97.03 – No improvement for Naive Bayes • AAUP dataset (compensation): – M5: 59.88 to 51.28 – SMO: 74.12 to 61.97 – No improvement for linear regression • ...but they may also cause problems – extreme example: 6.54 to 189.90 for linear regression – memory and timeouts due to large datasets
  15. 15. 10/08/13 Heiko Paulheim 15 Experiments: Quality of Features • Information gain of features on Zoo dataset
  16. 16. 10/08/13 Heiko Paulheim 16 Experiments: Quality of Features • Information gain of features on AAUP dataset (compensation)
  17. 17. 10/08/13 Heiko Paulheim 17 Application: Classifying Events from Wikipedia • Event Extraction from Wikipedia • Joint work with Dennis Wegener and Daniel Hienert (GESIS) • Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/
  18. 18. 10/08/13 Heiko Paulheim 18 Application: Classifying Events from Wikipedia • Source Material: http://www.vizgr.org/historical-events/timeline/
  19. 19. 10/08/13 Heiko Paulheim 19 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest.
  20. 20. 10/08/13 Heiko Paulheim 20 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest. • Possible learned model: – "Angela Merkel" → Politics
  21. 21. 10/08/13 Heiko Paulheim 21 Application: Classifying Events from Wikipedia • Possibly Learned Model: – "Angela Merkel" → Politics • How can we do better? • Background knowledge from Linked Open Data – 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. – 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. • Model learned in that case: – "[class: Politician]" → Politics
  22. 22. 10/08/13 Heiko Paulheim 22 Application: Classifying Events from Wikipedia • Model learned in that case: – "[class: Politician]" → Politics • Much more general – Can also classify events with politicians not contained in the training set • Less training examples required – A few events with politicians, athletes, singers, ... are enough
  23. 23. 10/08/13 Heiko Paulheim 23 Application: Classifying Events from Wikipedia • Experiments on Wikipedia data – >10 categories – 1,000 labeled examples as training set – Classification accuracy: 80% • Plus: – We have trained a language-independent model! • often, models are like "elect*" → Politics – 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt. – 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för Vänsterpartiet efter Lars Ohly [class: Politician].
  24. 24. 10/08/13 Heiko Paulheim 24 Application: Classifying Tweets • Joint work with Axel Schulz and Petar Ristoski (SAP Research) • Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  25. 25. 10/08/13 Heiko Paulheim 25 Application: Classifying Tweets • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  26. 26. 10/08/13 Heiko Paulheim 26 Application: Classifying Tweets • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  27. 27. 10/08/13 Heiko Paulheim 27 Application: Classifying Tweets • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  28. 28. 10/08/13 Heiko Paulheim 28 Using LOD for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  29. 29. 10/08/13 Heiko Paulheim 29 Explaining Statistics • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  30. 30. 10/08/13 Heiko Paulheim 30 Explaining Statistics • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  31. 31. 10/08/13 Heiko Paulheim 31 Explaining Statistics http://xkcd.com/605/
  32. 32. 10/08/13 Heiko Paulheim 32 Explaining Statistics • What statistics often look like
  33. 33. 10/08/13 Heiko Paulheim 33 Explaining Statistics • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • Approach: – use Linked Open Data for enriching statistical data (e.g., FeGeLOD) – run analysis tools for finding explanations
  34. 34. 10/08/13 Heiko Paulheim 34 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Runs basic analysis (correlation, rule learning) • Presents explanations
  35. 35. 10/08/13 Heiko Paulheim 35 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in 216 cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • LOD data sets used in the examples: – DBpedia – CIA World Factbook for statistics by country
  36. 36. 10/08/13 Heiko Paulheim 36 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 • a very accurate rule • but what's the interpretation? Next Record Studio 2547 miles Next Record Studio 2547 miles
  37. 37. 10/08/13 Heiko Paulheim 37 Statistical Data: Examples
  38. 38. 10/08/13 Heiko Paulheim 38 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  39. 39. 10/08/13 Heiko Paulheim 39 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% • Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  40. 40. 10/08/13 Heiko Paulheim 40 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  41. 41. 10/08/13 Heiko Paulheim 41 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Quality of living dataset
  42. 42. 10/08/13 Heiko Paulheim 42 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Corruption dataset
  43. 43. 10/08/13 Heiko Paulheim 43 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones • Exception: Frankfurt
  44. 44. 10/08/13 Heiko Paulheim 44 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance- 2011/pharmaceutical-consumption_health_glance-2011-39-en
  45. 45. 10/08/13 Heiko Paulheim 45 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: • Nordic countries, countries at the Atlantic: high • Mediterranean: medium • Alpine countries: low – High average age → high consumption – High birth rates → high consumption
  46. 46. 10/08/13 Heiko Paulheim 46 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  47. 47. 10/08/13 Heiko Paulheim 47 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: • At the sea → low • In the mountains → high – High Gini index → low suicide rate • High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  48. 48. 10/08/13 Heiko Paulheim 48 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  49. 49. 10/08/13 Heiko Paulheim 49 Statistical Data: Examples • Findings on sexual activity – By geography: • High in Europe, low in Asia • Low in Island states – By language: • English speaking: low • French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  50. 50. 10/08/13 Heiko Paulheim 50 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  51. 51. 10/08/13 Heiko Paulheim 51 RapidMiner Linked Open Data Extension • August 16th , 2013: FeGeLOD celebrates its 2nd birthday • Problems – still no nice UI – special configurations are tricky – difficult to enhance • Decision – Reimplementation on RapidMiner platform – September 13th , 2013: Release of RapidMiner Linked Open Data Extension – Available from RapidMiner marketplace • http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
  52. 52. 10/08/13 Heiko Paulheim 52 RapidMiner Linked Open Data Extension • Simple wiring of operators – linkers – generators • Combination with powerful RapidMiner operators
  53. 53. 10/08/13 Heiko Paulheim 53 RapidMiner Linked Open Data Extension • Easy SPARQL endpoint definitions • Support of custom SPARQL statements
  54. 54. 10/08/13 Heiko Paulheim 54 Challenges and Future Work • SPARQL variants – Some endpoints support special/non-standard SPARQL constructs – COUNT(...) – transitive closure – exploit where applicable • Implementations without SPARQL – Freebase – OpenCyc
  55. 55. 10/08/13 Heiko Paulheim 55 Challenges and Future Work • Linking is still challenging – URI patterns are not flexible – Search by label is time consuming – Services like DBpedia Lookup are scarce • Limitations of completely unsupervised linking – e.g., Hurricanes – how to use headlines/attribute names?
  56. 56. 10/08/13 Heiko Paulheim 56 Challenges and Future Work • Linking as optimization problem – find candidates for all entities, e.g., by DBpedia lookup – find a selection of candidates that are most similar to each other • e.g., all of them are U.S. cities – some experiments with types and categories • problem: not complete – some problems cannot be addressed (e.g.: Hurricanes) • Alternatives: – semi supervised linking – user provides some example links – active learning
  57. 57. 10/08/13 Heiko Paulheim 57 Challenges and Future Work • Exploiting semantics for feature selection • Given two features: – f1: type(RoadsInAlaska) – f2: type(Road) • and the schema definition Road rdfs:subclassOf RoadsInAlaska • Exploit that information for feature selection – e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
  58. 58. 10/08/13 Heiko Paulheim 58 Challenges and Future Work • Incompleteness of LOD – e.g., type information in DBpedia – may lead to findings such as • if a city is of type Place, the quality of living is high – possible remedy: autocomplete on the dataset (e.g., Paulheim/Bizer 2013) • Biases in LOD – e.g., DBpedia has a bias towards western culture – may lead to findings such as • if many records have been made in a city, the quality of living is high
  59. 59. 10/08/13 Heiko Paulheim 59 Challenges and Future Work • Features not used for scalability reasons: – features for single entities • e.g., “Roman Polanski directorOf X” – features more than one hop away • e.g., “Cities with a university which has a computer science department” – some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990” • but subject to YAGO's selection bias • Approaches are required to use such features – which respect scalability – “generate first, filter later” is not the best solution • e.g., “Cities with at least one of ArtSchoolsInParis” – on-the-fly filtering may be more suitable • e.g., sampling
  60. 60. 10/08/13 Heiko Paulheim 60 Challenges and Future Work • Automatically exploit data sources with non-simple structures EU18931 a Funding . EU18931 has-grant-value [ has-amount 1300000 . has-unit-of-measure EUR . ] • Support geo/temporal features – e.g., Data Cubes – e.g., Linked Geo Data • Construct complex features (in a scalable way!) – e.g., cinemas per inhabitant real example from CORDIS dataset
  61. 61. 10/08/13 Heiko Paulheim 61 Wrap-up • Linked Data is useful as background knowledge – especially on problems which have little knowledge in themselves • Unsupervised methods – avoid biases and work without knowledge about LOD – but: scalability and generality problems • RapidMiner LOD extension – a constantly growing toolkit
  62. 62. 10/08/13 Heiko Paulheim 62 Credits & Thanks • Past contributors of FeGeLOD: – Johannes Fürnkranz – Raad Bahmani – Alexander Gabriel – Simon Holthausen • Current team of RapidMiner Linked Open Data Extension: – Chris Bizer – Petar Ristoski – Evgeny Mitichkin
  63. 63. 10/08/13 Heiko Paulheim 63 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim

×