Many data mining problems can be solved better if more background knowledge is added: predictive models can become more accurate, and descriptive models can reveal more interesting findings. However, collecting and integrating background knowledge is tedious manual work. In this paper, we introduce the RapidMiner Linked Open Data Extension, which can extend a dataset at hand with additional attributes drawn from the Linked Open Data (LOD) cloud, a large collection of publicly available datasets on various topics. The extension contains operators for linking local data to open data in the LOD cloud, and for augmenting it with additional attributes. In a case study, we show that the prediction error of car fuel consumption can be reduced by 50% by adding additional attributes, e.g., describing the automobile layout and the car body configuration, from Linked Open Data.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension
1. Data Mining with Background Knowledge
from the Web
Introducing the RapidMiner
Linked Open Data Extension
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 1
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
2. Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-stadt
144402 ... Crime Bloody
3-43784-324-2 Mann-heim
291458 … Crime Guns Ltd. … 493
...
Books
... 124
3-145-34587-0 Roß-dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 2
3. Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 3
4. Linked Open Data in a Nutshell
• Started in 2007
• A collection of ~1,000 open datasets
– from various domains, e.g., general knowledge, government data, …
– using semantic web standards (HTTP, RDF, SPARQL,…)
• Machine processable
• Free of charge
• Sophisticated tool stacks
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 4
5. Linked Open Data in a Nutshell
http://lod-cloud.net/
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 5
8. The RapidMiner LOD Extension
• Automatic discovery of links to Linked Open Data
– for local data objects
– e.g., the database entry Boston is linked to
http://dbpedia.org/resource/Boston
• Automatic generation of attributes
– e.g., add all numeric values found for Boston (and other cities)
• Plus
– Feature selection algorithms optimized for LOD
– Automatic following of links to other datasets
– Schema matching (coming soon)
• No need to know Semantic Web technologies!
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 8
9. Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 9
10. Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Result: M5Rules down to almost half the prediction error
– i.e., on average, we are wrong by 1.6 instead of 2.9 MPG
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 10
11. Example: the Auto MPG Dataset
• The original attributes are
– cylinders, displacement, horsepower, weight, acceleration, model, origin
– plus name (unique string) and mpg (target)
• Models built are, e.g.,
– high horsepower/weight → high consumption
• Additional attributes lead to further insights, e.g.
– front-wheel drives have a lower consumption than rear-wheel drives
– hatchbacks have a lower consumption than station wagons
– rally cars generally have a low consumption
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 11
12. Example: Analyzing Statistics
• As shown, e.g., at ESWC 2012, SemStats 2013
• Statistics found on the web often
contain only few attributes
– extreme case: only entity + target
• Examples:
– Quality of living in cities (right)
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 12
13. Example: Analyzing Statistics
• Process in RapidMiner:
– load statistic
– link entities (cities, countries, etc.) to LOD cloud
– collect additional attributes
– analyze for correlations with target attribute of statistic
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 13
14. Example: Analyzing Statistics
• Quality of living in cities worldwide: indicators for low quality
– too hot (highest temperature in June exceeds 27°C)
– too cold (highest temperature in January below 16°C)
– too big (total area exceeds 334km²)
– poor cultural live (no music recordings made in this city)
– or simply: wrong place on the map (latitude<24, longitude<47)
all those attributes
come from LOD!
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 14
15. Example: Analyzing Statistics
• Corruption Perception Index (CPI) by Transparency International
• Indicators for low corruption:
– high HDI (human development index)
– large number of companies
– large number of NGOs
– small number of cargo airlines?!
• Burnout rates in German DAX companies
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– Local companies are less prone to burnout than international ones
• Exception: Frankfurt
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 15
16. Example: Analyzing Statistics
• Sexual activity (based on Durex survey 2005-2009)
– Higher in French speaking than in English speaking countries
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISPs → low activity
http://xkcd.com/552/
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 16
17. Further Usage Examples
• Classification of Twitter messages (SMILE, 2013)
– given a target, e.g., messages related to car traffic
– annotate message, extract abstract features for concepts
– e.g. “I-90” → highway
• Prediction of user location for Twitter (ICWSM, 2013)
– useful, e.g., for market research
– combination with sentiment analysis: public opinion maps
• Identifying disputed topics in the news (LD4KD, 2014)
– on a corpus of different online newspapers
– identified, e.g., concurrent opinions on drug legislation and gay marriage
• Debugging Linked Open Data as such
– e.g., identifying wrong links and axioms
– combination with outlier detection
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 17
18. Conclusions
• Many data mining tasks are better solved
with more background knowledge
– better predictive models
– more insights from additional attributes
• A lot of such knowledge exists as Linked Open Data
• The Linked Open Data extension grants easy access to that data
– from within RapidMiner
– without the need to know anything about RDF, SPARQL, etc.
• Try it out!
– find “Linked Open Data” on the marketplace
– Google Group: https://groups.google.com/forum/#!forum/rmlod
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 18
19. Data Mining with Background Knowledge
from the Web
Introducing the RapidMiner
Linked Open Data Extension
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 19
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer