Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension

  • 183 views
Uploaded on

Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, …

Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is tedious manual work. In this paper, we introduce the RapidMiner Linked Open Data Extension, which can extend a dataset at hand with additional attributes drawn from the Linked Open Data (LOD) cloud, a large collection of publicly available datasets on various topics. The extension contains operators for linking local data to open data in the LOD cloud, and for augmenting it with additional attributes. In a case study, we show that the prediction error of car fuel consumption can be reduced by 50% by adding additional attributes, e.g., describing the automobile layout and the car body configuration, from Linked Open Data.

More in: Data & Analytics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
183
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining with Background Knowledge from the Web Introducing the RapidMiner Linked Open Data Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 1 Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
  • 2. Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm-stadt 144402 ... Crime Bloody 3-43784-324-2 Mann-heim 291458 … Crime Guns Ltd. … 493 ... Books ... 124 3-145-34587-0 Roß-dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 2
  • 3. Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include? 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 3
  • 4. Linked Open Data in a Nutshell • Started in 2007 • A collection of ~1,000 open datasets – from various domains, e.g., general knowledge, government data, … – using semantic web standards (HTTP, RDF, SPARQL,…) • Machine processable • Free of charge • Sophisticated tool stacks 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 4
  • 5. Linked Open Data in a Nutshell http://lod-cloud.net/ 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 5
  • 6. Example: DBpedia 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 6
  • 7. The RapidMiner LOD Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 7
  • 8. The RapidMiner LOD Extension • Automatic discovery of links to Linked Open Data – for local data objects – e.g., the database entry Boston is linked to http://dbpedia.org/resource/Boston • Automatic generation of attributes – e.g., add all numeric values found for Boston (and other cities) • Plus – Feature selection algorithms optimized for LOD – Automatic following of links to other datasets – Schema matching (coming soon) • No need to know Semantic Web technologies! 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 8
  • 9. Example: the Auto MPG Dataset • A well-known UCI dataset – Goal: predict fuel consumption of cars • Hypothesis: background knowledge → more accurate predictions • Used background knowledge: – Entity types and categories from DBpedia (=Wikipedia) 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 9
  • 10. Example: the Auto MPG Dataset • A well-known UCI dataset – Goal: predict fuel consumption of cars • Hypothesis: background knowledge → more accurate predictions • Used background knowledge: – Entity types and categories from DBpedia (=Wikipedia) • Result: M5Rules down to almost half the prediction error – i.e., on average, we are wrong by 1.6 instead of 2.9 MPG 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 10
  • 11. Example: the Auto MPG Dataset • The original attributes are – cylinders, displacement, horsepower, weight, acceleration, model, origin – plus name (unique string) and mpg (target) • Models built are, e.g., – high horsepower/weight → high consumption • Additional attributes lead to further insights, e.g. – front-wheel drives have a lower consumption than rear-wheel drives – hatchbacks have a lower consumption than station wagons – rally cars generally have a low consumption 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 11
  • 12. Example: Analyzing Statistics • As shown, e.g., at ESWC 2012, SemStats 2013 • Statistics found on the web often contain only few attributes – extreme case: only entity + target • Examples: – Quality of living in cities (right) – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ... 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 12
  • 13. Example: Analyzing Statistics • Process in RapidMiner: – load statistic – link entities (cities, countries, etc.) to LOD cloud – collect additional attributes – analyze for correlations with target attribute of statistic 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 13
  • 14. Example: Analyzing Statistics • Quality of living in cities worldwide: indicators for low quality – too hot (highest temperature in June exceeds 27°C) – too cold (highest temperature in January below 16°C) – too big (total area exceeds 334km²) – poor cultural live (no music recordings made in this city) – or simply: wrong place on the map (latitude<24, longitude<47) all those attributes come from LOD! 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 14
  • 15. Example: Analyzing Statistics • Corruption Perception Index (CPI) by Transparency International • Indicators for low corruption: – high HDI (human development index) – large number of companies – large number of NGOs – small number of cargo airlines?! • Burnout rates in German DAX companies – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – Local companies are less prone to burnout than international ones • Exception: Frankfurt 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 15
  • 16. Example: Analyzing Statistics • Sexual activity (based on Durex survey 2005-2009) – Higher in French speaking than in English speaking countries – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISPs → low activity http://xkcd.com/552/ 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 16
  • 17. Further Usage Examples • Classification of Twitter messages (SMILE, 2013) – given a target, e.g., messages related to car traffic – annotate message, extract abstract features for concepts – e.g. “I-90” → highway • Prediction of user location for Twitter (ICWSM, 2013) – useful, e.g., for market research – combination with sentiment analysis: public opinion maps • Identifying disputed topics in the news (LD4KD, 2014) – on a corpus of different online newspapers – identified, e.g., concurrent opinions on drug legislation and gay marriage • Debugging Linked Open Data as such – e.g., identifying wrong links and axioms – combination with outlier detection 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 17
  • 18. Conclusions • Many data mining tasks are better solved with more background knowledge – better predictive models – more insights from additional attributes • A lot of such knowledge exists as Linked Open Data • The Linked Open Data extension grants easy access to that data – from within RapidMiner – without the need to know anything about RDF, SPARQL, etc. • Try it out! – find “Linked Open Data” on the marketplace – Google Group: https://groups.google.com/forum/#!forum/rmlod 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 18
  • 19. Data Mining with Background Knowledge from the Web Introducing the RapidMiner Linked Open Data Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 19 Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer