Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining the Web of Linked Data with RapidMiner


Published on

Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.

Published in: Data & Analytics

Mining the Web of Linked Data with RapidMiner

  1. 1. Mining the Web of Linked Data with RapidMiner Introducing the RapidMiner Linked Open Data Extension Petar Ristoski, Christian Bizer, Heiko Paulheim
  2. 2. Motivation Which factors lead to a high corruption rate? How to improve the quality of living? How to find good books to read? How to publish more scientific articles? How to prevent inflation? What makes cars to consume less fuel? How to decrease the electricity consumption? 10/27/14 Ristoski, Bizer, Paulheim 2
  3. 3. Motivation ?? 10/27/14 Ristoski, Bizer, Paulheim 3
  4. 4. Motivation Local LOD Data link combine cleanse transform analyze 10/27/14 Ristoski, Bizer, Paulheim 4
  5. 5. RapidMiner Linked Open Data Extension Introducing RapidMiner: ● An open source platform for data mining and predictive analytics ● Processes are designed by wiring operators in a GUI (no programming) ● Operators for data loading, transformation, modeling, visualization, … ● Scalable, distributed, parallel processing in a cloud environment ● 200,000 active users ● Developers can write their own extensions 10/27/14 Ristoski, Bizer, Paulheim 5
  6. 6. RapidMiner Linked Open Data Extension • The extension adds operators for – accessing local and remote semantic web data (RDF, SPARQL, …) – linking local to remote data (e.g., DBpedia Lookup) – enriching local data (e.g., with data properties from LOD sources) – automatically following links to other datasets – exploiting semantic schemata for optimizing attribute subset selection (DiscoveryScience'14) – matching and fusing data from different sources • Data analysts can use it without knowing SPARQL etc. 10/27/14 Ristoski, Bizer, Paulheim 6
  7. 7. Example Use Case • Which factors correlate with the increase of published scientific and technical journal articles? • RapidMiner workflow: – Import data from WorldBank RDF data cube – Link countries to DBpedia – Explore additional datasets – Generate attributes – Analyze the results • now live! 10/27/14 Ristoski, Bizer, Paulheim 7
  8. 8. Example Use Case • Starting from links to DBpedia, we follow links and collect data from – DBpedia – Linked GeoData – Eurostat – GeoNames – WHO’s Global Health Observatory – Linked Energy Data – OpenCyc – World Factbook – YAGO • Related data is fused – e.g., population figures from different sources 10/27/14 Ristoski, Bizer, Paulheim 8
  9. 9. Example Use Case • Factors that correlate with large number of publications – The fragile state index – FSI (positive) – Human development index – HDI (positive) – GDP (positive) • wealthier countries being able to invest more federal money into science funding? – For EU countries, the number of EU seats (positive) • an increasing fraction of EU funding for science being attributed to those countries? – Many climate indicators (precipitation, hours of sun, temperature) • unequal distribution of wealth across different climate zones? 10/27/14 Ristoski, Bizer, Paulheim 9
  10. 10. Other Use Cases • Improving performance of predictive models (RMWorld'14) – UCI car dataset: predicting fuel consumption • Reducing the prediction error of M5' by half – on average, we are wrong by 1.6 instead of 2.9 MPG 10/27/14 Ristoski, Bizer, Paulheim 10
  11. 11. Other Use Cases • Building Semantic Recommeder Systems (ESWC'14) • Combines two extensions: – Linked Open Data extension – Recommender system extension • Use data about books for content-based recommender – best system (out of 24) on two out of three tasks 10/27/14 Ristoski, Bizer, Paulheim 12
  12. 12. Other Use Cases • Debugging Linked Open Data – loading a subset of statements – augment with additional features – run outlier detection • again: a special extension • Example: identify wrong dataset interlinks (WoDOOM'14) – AUC up to 85% 10/27/14 Ristoski, Bizer, Paulheim 13
  13. 13. Summary • This challenge entry – brings data analysis to the web of data – can be used by data analysts without learning SPARQL • Availability – on the RapidMiner marketplace – installable from inside RapidMiner – >4,000 installations and counting 10/27/14 Ristoski, Bizer, Paulheim 14
  14. 14. Mining the Web of Linked Data with RapidMiner Introducing the RapidMiner Linked Open Data Extension Petar Ristoski, Christian Bizer, Heiko Paulheim