STI Summit 2011 - Global data integration and global data mining

STI Summit
July 6th, 2011 Riga Latvia
2011, Riga,

Global Data Integration
and Global Data Mining

Prof. Dr. Christian Bizer
Freie U i
F i Universität Berlin
ität B li
Germany

Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Outline

1. Topology of the Web of Data
 What data is out there?

2. Global Data Integration
 How to split the integration effort

3. Global Data Mining
 The logical next step


Linked Data Deployment on the Web

Year Datasets Triples Growth
2007 12 500.000.000
500 000 000
2008 45 2.000.000.000 300%
2009 95 6.726.000.000 236%
2010 203 26.930.509.703 300%


Uptake in the Government Domain

 The EU is starting to publish Linked Data (LOD2, LATC)
 Various other national efforts
 W3C eGovernment Interest Group


Uptake in the Libraries Community

 Institutions publishing Linked Data
 Library of Congress (subject headings)
 German National Library (PND dataset and subject headings)
 S edish National Librar (Libris - catalog)
Swedish Library
 Hungarian National Library (OPAC and Digital Library)
 E
Europeana project j t released d t about 4 million artifacts
j t just l d data b t illi tif t

 Growth of Library Linked Data (2009-2010): 1000%
 W3C Library Linked Data Incubator Group
 Goals:
1. Integrate Library Catalogs on global scale.
2. Interconnect resources between repositories
(by topic, by location, by historical period, by ...).


LOD data set statistics as of November 2010

Domain Data Sets Triples Percent RDF Links Percent
Cross‐domain 20 1,999,085,950 7.42 29,105,638 7.36
Geographic 16 5,904,980,833 21.93 16,589,086 4.19
Government 25 11,613,525,437 43.12 17,658,869 4.46
Media 26 2,453,898,811 9.11 50,374,304 12.74
Libraries
Lib i 67 2,237,435,732
2 237 435 732 8.31
8 31 77,951,898
77 951 898 19.71
19 71
Life sciences 42 2,664,119,184 9.89 200,417,873 50.67
User Content
User Content 7 57,463,756
57 463 756 0.21
0 21 3,402,228
3 402 228 0.86
0 86
203 26,930,509,703 395,499,896

LOD Cloud Data Catalog on CKAN
http://www.ckan.net/group/lodcloud
http://www ckan net/group/lodcloud

More statistics
http://www4.wiwiss.fu-berlin.de/lodcloud/state/

What are the big players doing?


Structured Data becomes a SEO Topic

Data Snippets
pp

Query Answer


Result: Further growth …

usage of RDFa has increased 510%
g
between March, 2009 and October, 2010
430 million webpages contain RDFa

Source: Yahoo
http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/

The Structural Continuum

The Web of Data is interwoven with the classic Web.

 Unstructured text: HTML
 Structured data:
 RDFa embed into HTML (Open Graph)
 Microdata embed into HTML (Schema.org)
 Microformats embed into HTML

 Linked data: RDF/XML


Topology of the Web of Data


How to get the data?

 Download the Billion Triples Challenge Dataset
 2 billion triples (20GB gzipped)
 crawled from the public Web of Linked Data in May/June 2011
 http://challenge.semanticweb.org/

 Download the Sindice Dump
 12 billion triples (164GB gzipped, ~1 16TB uncompressed)
gzipped 1,16TB
 crawled from the public Web of Linked Data and
 includes RDFa Microformat and wrapped API data
RDFa, Microformat,
 http://data.sindice.com/trec2011/download.html


2. Global Data Integration

Applications hate heterogeneity!
pp g y

The wild wild west My little world

The Dataspace Vision

Alternative to classic data integration systems in
order to cope with growing number of data sources.

P
Properties of dataspaces
ti fd t
 no upfront investment into a global schema
 rely on pay-as-you-go d t integration
l data i t ti
 give best effort answers to queries

Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces
A new Abstraction for Information Management SIGMOD Rec. 2005
Management, Rec 2005.

Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford
to Pay As You Go, CIDR 2007


Linked Data relies on Pay-as-You-Go Idea

 for Identity Management
 for Schema/Vocabulary Management


Publish Identity Links on the Web

Identity Link
<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
owl:sameAs
<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .

 You publish links pointing at other data sources.
S
Somebody else publishes li k pointing at your
b d l bli h links i ti t
data source.


Effort Distribution between Publisher and Consumer

Consumer data mines
identity
identit links

Effort
Distribution

Publishers or third
parties provides
identity links
y


Vocabularies on the Web of Data

 Everyone can use whatever vocabularies she likes
to publish Data on the Web.
Web
 Or invest effort and reuse Common Vocabularies
 Friend-of-a-Friend for describing people and their social network
 SIOC for describing forums and blogs
 SKOS for representing topic taxonomies
 Organization Ontology for describing the structure of organizations
 GoodRelations provides terms for describing products and business entities
 Music Ontology for describing artists, albums, and performances
 Review Vocabulary provides terms for representing reviews

 Many Linked Data Source use mixture of common and
proprietary vocabulary terms.


Publish Vocabulary Links on the Web

Vocabulary Link
<http://xmlns.com/foaf/0.1/Person>
owl:equivalentClass
<http://dbpedia.org/ontology/Person> .

 Simple Mappings: RDFS, OWL
 rdfs:subClassOf, rdfs:subPropertyOf
 owl:equivalentClass, owl:equivalentProperty

 Complex Mappings: R2R
p pp g
 provides value transformation functions
 structural transformations


Deployment of Vocabulary Links

Source: Li k d O
S Linked Open V
Vocabularies,
b l i
http://labs.mondeca.com/dataset/lov

Effort Distribution between Publisher and Consumer

Consumer defines or
data mines mappings

Effort
Distribution

Publisher reuses
vocabularies

Publisher or third party
publishes mappings


Somebody-Pays-As-You-Go

The overall data integration effort is
split between the data publisher, the
publisher
data consumer and third parties. Fix
Overall Data
Integration
 Data Publisher Effort
 publishes data as RDF
 sets identity links
 reuses terms or publishes mappings

 Third Parties
 set identity links pointing at y
y p g your data Publisher‘s
Third
Party
Effort
 publish mappings to the Web Effort

 Data Consumer
Consumer‘s
 has to do the rest Effort
 using record linkage and schema matching
techniques

Research Directions

1. More research on pay-as-you-go data integration is needed.

2. More research on data mining mappings and
identity resolution heuristics is needed.
 Identity links make it easier to mine vocabulary links.
 Vocabulary links make it easier to mine identity links.

3.
3 More research on SPAM detection and data quality
assessment is needed.


LDIF – Linked Data Integration Framework

 Combines vocabulary normalization and identity resolution
 C
Currently only i
tl l in-memory i l
implementation
t ti
 Next release: Hadoop-based implementation

 htt //
http://www4.wiwiss.fu-berlin.de/bizer/ldif/
4 i i f b li d /bi /ldif/ Normalize Identity
vocabularies Resolution


What can we do afterwards …

… build better entity search engines


3. Global Data Mining


Think about interesting questions …

… that you can answer based on the Web of Data
… that require
 aggregation
 summarization
 classification
 association rule mining

… combined with
 text mining
 sediment analysis
y


Everybody has the tools to find the answers


Research Directions

1. More research on data space profiling is needed.

2. More research on global data mining i needed.
2 M h l b ld t i i is d d

 Google, Yahoo, Microsoft, Facebook will get there soon.
g , , , g


Semantic Web Challenge

 Submission Statistics

Year Open Track Billion Triple Track
2008 13 9
2009 16 3
2010 14 4

 Do something interesting with the Billion Triple Data
 and submit your results to the challenge until October 1st
 present your results at the 10th International Semantic Web Conference
(ISWC2011), October 2011, Koblenz, Germany


Conclusions

 The Web of Data is there
 Linked Data, Microdata, RDFa, Microformats

 Upcoming research topics
 pay-as-you-go data integration
 mapping discovery, schema clustering
 identity resolution heuristics discovery
 probabilistic data integration
 data quality assessment
 data space profiling
 global data mining


Thanks!

References
 Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
Heath
Data Space. http://linkeddatabook.com/
 Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf


STI Summit 2011 - Global data integration and global data mining

Recommended

Recommended

More Related Content

Similar to STI Summit 2011 - Global data integration and global data mining

Similar to STI Summit 2011 - Global data integration and global data mining (20)

More from Semantic Technology Institute International

More from Semantic Technology Institute International (20)

Recently uploaded

Recently uploaded (20)

STI Summit 2011 - Global data integration and global data mining