SlideShare a Scribd company logo
10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim
10/08/13 Heiko Paulheim 2
Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMiner Linked Open Data Extension
• Challenges and Future Work
10/08/13 Heiko Paulheim 3
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
10/08/13 Heiko Paulheim 4
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
10/08/13 Heiko Paulheim 5
Motivation
http://lod-cloud.net/
10/08/13 Heiko Paulheim 6
Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data mining process as needed
• Two main variants:
– develop mining/learning algorithms that run directly on Linked Data
– create relational features from Linked Data
10/08/13 Heiko Paulheim 7
Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel functions
• Advantages:
– can be quite efficient
– no reduction to “flat” table structure
– semantics can be respected directly
10/08/13 Heiko Paulheim 8
Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / RapidMiner Linked Open Data Extension
• Advantages:
– Easy combination of knowledge from various sources
• including relational features in the original data
– Arbitrary mining algorithms/tools possible
10/08/13 Heiko Paulheim 9
FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
10/08/13 Heiko Paulheim 10
FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessing URIs)
– Seven generators:
• direct types
• data properties
• unqualified relations (boolean, numeric)
• qualified relations (boolean, numeric)
• individuals (dangerous!) - may be restricted to specific property
– Simple feature selection: filtering features
• that have only* different values (expect numerical)
• that have only* identical values
• that are mostly missing*
*) 95% or 99%
10/08/13 Heiko Paulheim 11
Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
– AAUP: predicting income of university employees
(regression task)
• Question: how much improvement do additional features bring?
*) standard ML datasets with speaking labels are scarce!
10/08/13 Heiko Paulheim 12
Experiments: Zoo Dataset
10/08/13 Heiko Paulheim 13
First Results: AAUP
10/08/13 Heiko Paulheim 14
Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ripper: 89.11 to 96.04
– SMO: 93.07 to 97.03
– No improvement for Naive Bayes
• AAUP dataset (compensation):
– M5: 59.88 to 51.28
– SMO: 74.12 to 61.97
– No improvement for linear regression
• ...but they may also cause problems
– extreme example: 6.54 to 189.90 for linear regression
– memory and timeouts due to large datasets
10/08/13 Heiko Paulheim 15
Experiments: Quality of Features
• Information gain of features on Zoo dataset
10/08/13 Heiko Paulheim 16
Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)
10/08/13 Heiko Paulheim 17
Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work with Dennis Wegener and Daniel Hienert (GESIS)
• Task: event classification (e.g., Politics, Sports, ...)
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 18
Application: Classifying Events from Wikipedia
• Source Material:
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 19
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
10/08/13 Heiko Paulheim 20
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
• Possible learned model:
– "Angela Merkel" → Politics
10/08/13 Heiko Paulheim 21
Application: Classifying Events from Wikipedia
• Possibly Learned Model:
– "Angela Merkel" → Politics
• How can we do better?
• Background knowledge from Linked Open Data
– 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts
down the seven oldest German nuclear power plants.
– 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class:
Politician] is elected to continue as Minister-President, heading an SPD-
Green coalition.
• Model learned in that case:
– "[class: Politician]" → Politics
10/08/13 Heiko Paulheim 22
Application: Classifying Events from Wikipedia
• Model learned in that case:
– "[class: Politician]" → Politics
• Much more general
– Can also classify events with politicians
not contained in the training set
• Less training examples required
– A few events with politicians, athletes, singers, ... are enough
10/08/13 Heiko Paulheim 23
Application: Classifying Events from Wikipedia
• Experiments on Wikipedia data
– >10 categories
– 1,000 labeled examples as training set
– Classification accuracy: 80%
• Plus:
– We have trained a language-independent model!
• often, models are like "elect*" → Politics
– 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von
Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt.
– 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för
Vänsterpartiet efter Lars Ohly [class: Politician].
10/08/13 Heiko Paulheim 24
Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)
• Goal: using Twitter for emergency management
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
10/08/13 Heiko Paulheim 25
Application: Classifying Tweets
• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)
10/08/13 Heiko Paulheim 26
Application: Classifying Tweets
• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
10/08/13 Heiko Paulheim 27
Application: Classifying Tweets
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident
10/08/13 Heiko Paulheim 28
Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided
10/08/13 Heiko Paulheim 29
Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
10/08/13 Heiko Paulheim 30
Explaining Statistics
• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)
10/08/13 Heiko Paulheim 31
Explaining Statistics
http://xkcd.com/605/
10/08/13 Heiko Paulheim 32
Explaining Statistics
• What statistics often look like
10/08/13 Heiko Paulheim 33
Explaining Statistics
• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• Approach:
– use Linked Open Data for enriching statistical data (e.g., FeGeLOD)
– run analysis tools for finding explanations
10/08/13 Heiko Paulheim 34
Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge
• Runs basic analysis (correlation, rule learning)
• Presents explanations
10/08/13 Heiko Paulheim 35
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• LOD data sets used in the examples:
– DBpedia
– CIA World Factbook for statistics by country
10/08/13 Heiko Paulheim 36
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
• a very accurate rule
• but what's the interpretation? Next Record Studio
2547 miles
Next Record Studio
2547 miles
10/08/13 Heiko Paulheim 37
Statistical Data: Examples
10/08/13 Heiko Paulheim 38
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results
10/08/13 Heiko Paulheim 39
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
• Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines
10/08/13 Heiko Paulheim 40
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
10/08/13 Heiko Paulheim 41
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Quality of living dataset
10/08/13 Heiko Paulheim 42
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Corruption dataset
10/08/13 Heiko Paulheim 43
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
• Exception: Frankfurt
10/08/13 Heiko Paulheim 44
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-
2011/pharmaceutical-consumption_health_glance-2011-39-en
10/08/13 Heiko Paulheim 45
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
• Nordic countries, countries at the Atlantic: high
• Mediterranean: medium
• Alpine countries: low
– High average age → high consumption
– High birth rates → high consumption
10/08/13 Heiko Paulheim 46
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html
10/08/13 Heiko Paulheim 47
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
• At the sea → low
• In the mountains → high
– High Gini index → low suicide rate
• High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates
10/08/13 Heiko Paulheim 48
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya
10/08/13 Heiko Paulheim 49
Statistical Data: Examples
• Findings on sexual activity
– By geography:
• High in Europe, low in Asia
• Low in Island states
– By language:
• English speaking: low
• French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity
10/08/13 Heiko Paulheim 50
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/
10/08/13 Heiko Paulheim 51
RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday
• Problems
– still no nice UI
– special configurations are tricky
– difficult to enhance
• Decision
– Reimplementation on RapidMiner platform
– September 13th
, 2013:
Release of RapidMiner Linked Open Data Extension
– Available from RapidMiner marketplace
• http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
10/08/13 Heiko Paulheim 52
RapidMiner Linked Open Data Extension
• Simple wiring of operators
– linkers
– generators
• Combination with powerful RapidMiner operators
10/08/13 Heiko Paulheim 53
RapidMiner Linked Open Data Extension
• Easy SPARQL endpoint definitions
• Support of custom SPARQL statements
10/08/13 Heiko Paulheim 54
Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPARQL constructs
– COUNT(...)
– transitive closure
– exploit where applicable
• Implementations without SPARQL
– Freebase
– OpenCyc
10/08/13 Heiko Paulheim 55
Challenges and Future Work
• Linking is still challenging
– URI patterns are not flexible
– Search by label is time consuming
– Services like DBpedia Lookup are scarce
• Limitations of completely unsupervised linking
– e.g., Hurricanes
– how to use headlines/attribute names?
10/08/13 Heiko Paulheim 56
Challenges and Future Work
• Linking as optimization problem
– find candidates for all entities, e.g., by DBpedia lookup
– find a selection of candidates that are most similar to each other
• e.g., all of them are U.S. cities
– some experiments with types and categories
• problem: not complete
– some problems cannot be addressed (e.g.: Hurricanes)
• Alternatives:
– semi supervised linking – user provides some example links
– active learning
10/08/13 Heiko Paulheim 57
Challenges and Future Work
• Exploiting semantics for feature selection
• Given two features:
– f1: type(RoadsInAlaska)
– f2: type(Road)
• and the schema definition Road rdfs:subclassOf RoadsInAlaska
• Exploit that information for feature selection
– e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
10/08/13 Heiko Paulheim 58
Challenges and Future Work
• Incompleteness of LOD
– e.g., type information in DBpedia
– may lead to findings such as
• if a city is of type Place, the quality of living is high
– possible remedy: autocomplete on the dataset
(e.g., Paulheim/Bizer 2013)
• Biases in LOD
– e.g., DBpedia has a bias towards western culture
– may lead to findings such as
• if many records have been made in a city, the quality of living is high
10/08/13 Heiko Paulheim 59
Challenges and Future Work
• Features not used for scalability reasons:
– features for single entities
• e.g., “Roman Polanski directorOf X”
– features more than one hop away
• e.g., “Cities with a university which has a computer science department”
– some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990”
• but subject to YAGO's selection bias
• Approaches are required to use such features
– which respect scalability
– “generate first, filter later” is not the best solution
• e.g., “Cities with at least one of ArtSchoolsInParis”
– on-the-fly filtering may be more suitable
• e.g., sampling
10/08/13 Heiko Paulheim 60
Challenges and Future Work
• Automatically exploit data sources with non-simple structures
EU18931 a Funding .
EU18931 has-grant-value [
has-amount 1300000 .
has-unit-of-measure EUR .
]
• Support geo/temporal features
– e.g., Data Cubes
– e.g., Linked Geo Data
• Construct complex features (in a scalable way!)
– e.g., cinemas per inhabitant
real example from
CORDIS dataset
10/08/13 Heiko Paulheim 61
Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have little knowledge in themselves
• Unsupervised methods
– avoid biases and work without knowledge about LOD
– but: scalability and generality problems
• RapidMiner LOD extension
– a constantly growing toolkit
10/08/13 Heiko Paulheim 62
Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexander Gabriel
– Simon Holthausen
• Current team of RapidMiner Linked Open Data Extension:
– Chris Bizer
– Petar Ristoski
– Evgeny Mitichkin
10/08/13 Heiko Paulheim 63
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim

More Related Content

Viewers also liked

Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...
Piet J.H. Daas
 
Quality management system procedures
Quality management system proceduresQuality management system procedures
Quality management system proceduresselinasimpson2101
 
Quality framework
Quality frameworkQuality framework
Quality frameworksaurabhshri
 
WebeX Presentation - Quality Consortium
WebeX Presentation - Quality ConsortiumWebeX Presentation - Quality Consortium
WebeX Presentation - Quality Consortium
The Avoca Group
 
Sharepoint quality management system
Sharepoint quality management systemSharepoint quality management system
Sharepoint quality management systemselinasimpson2101
 
Mixed Methods Research
Mixed Methods ResearchMixed Methods Research
Mixed Methods Research
Roller Research
 
Process asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolProcess asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolKobi Vider
 
2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDFMelissa Jones
 
QMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition DocumentQMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition Document
Melissa Jones
 
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data DictionaryPart 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Melissa Jones
 
QMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you useQMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you use
Melissa Jones
 
Quality framework 1
Quality framework 1Quality framework 1
Quality framework 1
Shwetha Bhat
 
Metadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full versionMetadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full version
Péter Király
 
Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?
Grzegorz Grela
 
Audit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAsAudit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAs
International Federation of Accountants
 
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...Barry Peters
 
QMS Calibration Powerpoint
QMS Calibration PowerpointQMS Calibration Powerpoint
QMS Calibration PowerpointDennis J Morgan
 
PAS: The Planning Quality Framework
PAS: The Planning Quality FrameworkPAS: The Planning Quality Framework
PAS: The Planning Quality Framework
PAS_Team
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
Kevin Watters
 

Viewers also liked (20)

Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...
 
Quality management system procedures
Quality management system proceduresQuality management system procedures
Quality management system procedures
 
Quality framework
Quality frameworkQuality framework
Quality framework
 
Bpo risk management 2013
Bpo risk management 2013Bpo risk management 2013
Bpo risk management 2013
 
WebeX Presentation - Quality Consortium
WebeX Presentation - Quality ConsortiumWebeX Presentation - Quality Consortium
WebeX Presentation - Quality Consortium
 
Sharepoint quality management system
Sharepoint quality management systemSharepoint quality management system
Sharepoint quality management system
 
Mixed Methods Research
Mixed Methods ResearchMixed Methods Research
Mixed Methods Research
 
Process asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolProcess asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing tool
 
2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF
 
QMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition DocumentQMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition Document
 
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data DictionaryPart 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
 
QMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you useQMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you use
 
Quality framework 1
Quality framework 1Quality framework 1
Quality framework 1
 
Metadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full versionMetadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full version
 
Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?
 
Audit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAsAudit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAs
 
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
 
QMS Calibration Powerpoint
QMS Calibration PowerpointQMS Calibration Powerpoint
QMS Calibration Powerpoint
 
PAS: The Planning Quality Framework
PAS: The Planning Quality FrameworkPAS: The Planning Quality Framework
PAS: The Planning Quality Framework
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 

Similar to Exploiting Linked Open Data as Background Knowledge in Data Mining

Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13
Fink & Partner Media Services GmbH
 
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Dagmar Monett
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsGESIS
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
Heiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Heiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
Heiko Paulheim
 
Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018
Ian Dolphin
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
slejay
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
Heiko Paulheim
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
Irina Radchenko
 
Research Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, GuidanceResearch Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, Guidance
Frank Uiterwaal
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
Piet J.H. Daas
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
Heiko Paulheim
 
Spark
SparkSpark
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
Edwin de Jonge
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
cneudecker
 
Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information Research
Eric Kokke
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
Mohd Izhar Firdaus Ismail
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
Alex Clark
 

Similar to Exploiting Linked Open Data as Background Knowledge in Data Mining (20)

Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13
 
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systems
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Datainnovation
DatainnovationDatainnovation
Datainnovation
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
Research Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, GuidanceResearch Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, Guidance
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Spark
SparkSpark
Spark
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information Research
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
 

More from Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
Heiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
Heiko Paulheim
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Heiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
Heiko Paulheim
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
Heiko Paulheim
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
Heiko Paulheim
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
Heiko Paulheim
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
Heiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Heiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Heiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
Heiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
Heiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
Heiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
Heiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Heiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
Heiko Paulheim
 

More from Heiko Paulheim (19)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 

Exploiting Linked Open Data as Background Knowledge in Data Mining

  • 1. 10/08/13 Heiko Paulheim 1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim
  • 2. 10/08/13 Heiko Paulheim 2 Outline • Motivation • The original FeGeLOD framework • Experiments • Applications • The RapidMiner Linked Open Data Extension • Challenges and Future Work
  • 3. 10/08/13 Heiko Paulheim 3 Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities
  • 4. 10/08/13 Heiko Paulheim 4 Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include?
  • 5. 10/08/13 Heiko Paulheim 5 Motivation http://lod-cloud.net/
  • 6. 10/08/13 Heiko Paulheim 6 Motivation • Idea: – reuse background knowledge from Linked Open Data – include it in the data mining process as needed • Two main variants: – develop mining/learning algorithms that run directly on Linked Data – create relational features from Linked Data
  • 7. 10/08/13 Heiko Paulheim 7 Motivation • Develop mining/learning algorithms – e.g., DL Learner – e.g., dedicated Kernel functions • Advantages: – can be quite efficient – no reduction to “flat” table structure – semantics can be respected directly
  • 8. 10/08/13 Heiko Paulheim 8 Motivation • Create relational features – e.g., LiDDM – e.g., AutoSPARQL – e.g., FeGeLOD / RapidMiner Linked Open Data Extension • Advantages: – Easy combination of knowledge from various sources • including relational features in the original data – Arbitrary mining algorithms/tools possible
  • 9. 10/08/13 Heiko Paulheim 9 FeGeLOD – Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  • 10. 10/08/13 Heiko Paulheim 10 FeGeLOD – Feature Generation from LOD • Original prototype, based on Weka: – Simple NER (guessing URIs) – Seven generators: • direct types • data properties • unqualified relations (boolean, numeric) • qualified relations (boolean, numeric) • individuals (dangerous!) - may be restricted to specific property – Simple feature selection: filtering features • that have only* different values (expect numerical) • that have only* identical values • that are mostly missing* *) 95% or 99%
  • 11. 10/08/13 Heiko Paulheim 11 Experiments • Testing with two* standard machine learning data sets – Zoo: classifying animals – AAUP: predicting income of university employees (regression task) • Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce!
  • 12. 10/08/13 Heiko Paulheim 12 Experiments: Zoo Dataset
  • 13. 10/08/13 Heiko Paulheim 13 First Results: AAUP
  • 14. 10/08/13 Heiko Paulheim 14 Experiments: Early Insights • Additional features often improve the results • Zoo dataset: – Ripper: 89.11 to 96.04 – SMO: 93.07 to 97.03 – No improvement for Naive Bayes • AAUP dataset (compensation): – M5: 59.88 to 51.28 – SMO: 74.12 to 61.97 – No improvement for linear regression • ...but they may also cause problems – extreme example: 6.54 to 189.90 for linear regression – memory and timeouts due to large datasets
  • 15. 10/08/13 Heiko Paulheim 15 Experiments: Quality of Features • Information gain of features on Zoo dataset
  • 16. 10/08/13 Heiko Paulheim 16 Experiments: Quality of Features • Information gain of features on AAUP dataset (compensation)
  • 17. 10/08/13 Heiko Paulheim 17 Application: Classifying Events from Wikipedia • Event Extraction from Wikipedia • Joint work with Dennis Wegener and Daniel Hienert (GESIS) • Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/
  • 18. 10/08/13 Heiko Paulheim 18 Application: Classifying Events from Wikipedia • Source Material: http://www.vizgr.org/historical-events/timeline/
  • 19. 10/08/13 Heiko Paulheim 19 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest.
  • 20. 10/08/13 Heiko Paulheim 20 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest. • Possible learned model: – "Angela Merkel" → Politics
  • 21. 10/08/13 Heiko Paulheim 21 Application: Classifying Events from Wikipedia • Possibly Learned Model: – "Angela Merkel" → Politics • How can we do better? • Background knowledge from Linked Open Data – 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. – 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. • Model learned in that case: – "[class: Politician]" → Politics
  • 22. 10/08/13 Heiko Paulheim 22 Application: Classifying Events from Wikipedia • Model learned in that case: – "[class: Politician]" → Politics • Much more general – Can also classify events with politicians not contained in the training set • Less training examples required – A few events with politicians, athletes, singers, ... are enough
  • 23. 10/08/13 Heiko Paulheim 23 Application: Classifying Events from Wikipedia • Experiments on Wikipedia data – >10 categories – 1,000 labeled examples as training set – Classification accuracy: 80% • Plus: – We have trained a language-independent model! • often, models are like "elect*" → Politics – 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt. – 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för Vänsterpartiet efter Lars Ohly [class: Politician].
  • 24. 10/08/13 Heiko Paulheim 24 Application: Classifying Tweets • Joint work with Axel Schulz and Petar Ristoski (SAP Research) • Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  • 25. 10/08/13 Heiko Paulheim 25 Application: Classifying Tweets • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  • 26. 10/08/13 Heiko Paulheim 26 Application: Classifying Tweets • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  • 27. 10/08/13 Heiko Paulheim 27 Application: Classifying Tweets • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  • 28. 10/08/13 Heiko Paulheim 28 Using LOD for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  • 29. 10/08/13 Heiko Paulheim 29 Explaining Statistics • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  • 30. 10/08/13 Heiko Paulheim 30 Explaining Statistics • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  • 31. 10/08/13 Heiko Paulheim 31 Explaining Statistics http://xkcd.com/605/
  • 32. 10/08/13 Heiko Paulheim 32 Explaining Statistics • What statistics often look like
  • 33. 10/08/13 Heiko Paulheim 33 Explaining Statistics • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • Approach: – use Linked Open Data for enriching statistical data (e.g., FeGeLOD) – run analysis tools for finding explanations
  • 34. 10/08/13 Heiko Paulheim 34 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Runs basic analysis (correlation, rule learning) • Presents explanations
  • 35. 10/08/13 Heiko Paulheim 35 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in 216 cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • LOD data sets used in the examples: – DBpedia – CIA World Factbook for statistics by country
  • 36. 10/08/13 Heiko Paulheim 36 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 • a very accurate rule • but what's the interpretation? Next Record Studio 2547 miles Next Record Studio 2547 miles
  • 37. 10/08/13 Heiko Paulheim 37 Statistical Data: Examples
  • 38. 10/08/13 Heiko Paulheim 38 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  • 39. 10/08/13 Heiko Paulheim 39 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% • Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  • 40. 10/08/13 Heiko Paulheim 40 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  • 41. 10/08/13 Heiko Paulheim 41 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Quality of living dataset
  • 42. 10/08/13 Heiko Paulheim 42 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Corruption dataset
  • 43. 10/08/13 Heiko Paulheim 43 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones • Exception: Frankfurt
  • 44. 10/08/13 Heiko Paulheim 44 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance- 2011/pharmaceutical-consumption_health_glance-2011-39-en
  • 45. 10/08/13 Heiko Paulheim 45 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: • Nordic countries, countries at the Atlantic: high • Mediterranean: medium • Alpine countries: low – High average age → high consumption – High birth rates → high consumption
  • 46. 10/08/13 Heiko Paulheim 46 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  • 47. 10/08/13 Heiko Paulheim 47 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: • At the sea → low • In the mountains → high – High Gini index → low suicide rate • High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  • 48. 10/08/13 Heiko Paulheim 48 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  • 49. 10/08/13 Heiko Paulheim 49 Statistical Data: Examples • Findings on sexual activity – By geography: • High in Europe, low in Asia • Low in Island states – By language: • English speaking: low • French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  • 50. 10/08/13 Heiko Paulheim 50 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  • 51. 10/08/13 Heiko Paulheim 51 RapidMiner Linked Open Data Extension • August 16th , 2013: FeGeLOD celebrates its 2nd birthday • Problems – still no nice UI – special configurations are tricky – difficult to enhance • Decision – Reimplementation on RapidMiner platform – September 13th , 2013: Release of RapidMiner Linked Open Data Extension – Available from RapidMiner marketplace • http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
  • 52. 10/08/13 Heiko Paulheim 52 RapidMiner Linked Open Data Extension • Simple wiring of operators – linkers – generators • Combination with powerful RapidMiner operators
  • 53. 10/08/13 Heiko Paulheim 53 RapidMiner Linked Open Data Extension • Easy SPARQL endpoint definitions • Support of custom SPARQL statements
  • 54. 10/08/13 Heiko Paulheim 54 Challenges and Future Work • SPARQL variants – Some endpoints support special/non-standard SPARQL constructs – COUNT(...) – transitive closure – exploit where applicable • Implementations without SPARQL – Freebase – OpenCyc
  • 55. 10/08/13 Heiko Paulheim 55 Challenges and Future Work • Linking is still challenging – URI patterns are not flexible – Search by label is time consuming – Services like DBpedia Lookup are scarce • Limitations of completely unsupervised linking – e.g., Hurricanes – how to use headlines/attribute names?
  • 56. 10/08/13 Heiko Paulheim 56 Challenges and Future Work • Linking as optimization problem – find candidates for all entities, e.g., by DBpedia lookup – find a selection of candidates that are most similar to each other • e.g., all of them are U.S. cities – some experiments with types and categories • problem: not complete – some problems cannot be addressed (e.g.: Hurricanes) • Alternatives: – semi supervised linking – user provides some example links – active learning
  • 57. 10/08/13 Heiko Paulheim 57 Challenges and Future Work • Exploiting semantics for feature selection • Given two features: – f1: type(RoadsInAlaska) – f2: type(Road) • and the schema definition Road rdfs:subclassOf RoadsInAlaska • Exploit that information for feature selection – e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
  • 58. 10/08/13 Heiko Paulheim 58 Challenges and Future Work • Incompleteness of LOD – e.g., type information in DBpedia – may lead to findings such as • if a city is of type Place, the quality of living is high – possible remedy: autocomplete on the dataset (e.g., Paulheim/Bizer 2013) • Biases in LOD – e.g., DBpedia has a bias towards western culture – may lead to findings such as • if many records have been made in a city, the quality of living is high
  • 59. 10/08/13 Heiko Paulheim 59 Challenges and Future Work • Features not used for scalability reasons: – features for single entities • e.g., “Roman Polanski directorOf X” – features more than one hop away • e.g., “Cities with a university which has a computer science department” – some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990” • but subject to YAGO's selection bias • Approaches are required to use such features – which respect scalability – “generate first, filter later” is not the best solution • e.g., “Cities with at least one of ArtSchoolsInParis” – on-the-fly filtering may be more suitable • e.g., sampling
  • 60. 10/08/13 Heiko Paulheim 60 Challenges and Future Work • Automatically exploit data sources with non-simple structures EU18931 a Funding . EU18931 has-grant-value [ has-amount 1300000 . has-unit-of-measure EUR . ] • Support geo/temporal features – e.g., Data Cubes – e.g., Linked Geo Data • Construct complex features (in a scalable way!) – e.g., cinemas per inhabitant real example from CORDIS dataset
  • 61. 10/08/13 Heiko Paulheim 61 Wrap-up • Linked Data is useful as background knowledge – especially on problems which have little knowledge in themselves • Unsupervised methods – avoid biases and work without knowledge about LOD – but: scalability and generality problems • RapidMiner LOD extension – a constantly growing toolkit
  • 62. 10/08/13 Heiko Paulheim 62 Credits & Thanks • Past contributors of FeGeLOD: – Johannes Fürnkranz – Raad Bahmani – Alexander Gabriel – Simon Holthausen • Current team of RapidMiner Linked Open Data Extension: – Chris Bizer – Petar Ristoski – Evgeny Mitichkin
  • 63. 10/08/13 Heiko Paulheim 63 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim