STI Summit 2011 - Global data integration and global data mining
Upcoming SlideShare
Loading in...5
×
 

STI Summit 2011 - Global data integration and global data mining

on

  • 378 views

 

Statistics

Views

Total Views
378
Views on SlideShare
378
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

STI Summit 2011 - Global data integration and global data mining STI Summit 2011 - Global data integration and global data mining Presentation Transcript

  • STI Summit July 6th, 2011 Riga Latvia 2011, Riga,Global Data Integrationand Global Data Mining Prof. Dr. Christian Bizer Freie U i F i Universität Berlin ität B li Germany Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Outline 1. Topology of the Web of Data  What data is out there? 2. Global Data Integration  How to split the integration effort 3. Global Data Mining  The logical next step Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Linked Data Deployment on the Web Year Datasets Triples Growth 2007 12 500.000.000 500 000 000 2008 45 2.000.000.000 300% 2009 95 6.726.000.000 236% 2010 203 26.930.509.703 300% Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011) View slide
  • Uptake in the Government Domain  The EU is starting to publish Linked Data (LOD2, LATC)  Various other national efforts  W3C eGovernment Interest Group Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011) View slide
  • Uptake in the Libraries Community  Institutions publishing Linked Data  Library of Congress (subject headings)  German National Library (PND dataset and subject headings)  S edish National Librar (Libris - catalog) Swedish Library  Hungarian National Library (OPAC and Digital Library)  E Europeana project j t released d t about 4 million artifacts j t just l d data b t illi tif t  Growth of Library Linked Data (2009-2010): 1000%  W3C Library Linked Data Incubator Group  Goals: 1. Integrate Library Catalogs on global scale. 2. Interconnect resources between repositories (by topic, by location, by historical period, by ...). Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • LOD data set statistics as of November 2010 Domain Data Sets Triples Percent RDF Links Percent Cross‐domain 20 1,999,085,950 7.42 29,105,638 7.36 Geographic 16 5,904,980,833 21.93 16,589,086 4.19 Government 25 11,613,525,437 43.12 17,658,869 4.46 Media 26 2,453,898,811 9.11 50,374,304 12.74 Libraries Lib i 67 2,237,435,732 2 237 435 732 8.31 8 31 77,951,898 77 951 898 19.71 19 71 Life sciences 42 2,664,119,184 9.89 200,417,873 50.67 User Content User Content 7 57,463,756 57 463 756 0.21 0 21 3,402,228 3 402 228 0.86 0 86 203 26,930,509,703 395,499,896 LOD Cloud Data Catalog on CKAN http://www.ckan.net/group/lodcloud http://www ckan net/group/lodcloud More statistics http://www4.wiwiss.fu-berlin.de/lodcloud/state/ Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • What are the big players doing? Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Structured Data becomes a SEO Topic Data Snippets pp Query Answer Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Result: Further growth … usage of RDFa has increased 510% g between March, 2009 and October, 2010 430 million webpages contain RDFa Source: Yahoo http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/ Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • The Structural Continuum The Web of Data is interwoven with the classic Web.  Unstructured text: HTML  Structured data:  RDFa embed into HTML (Open Graph)  Microdata embed into HTML (Schema.org)  Microformats embed into HTML  Linked data: RDF/XML Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Topology of the Web of Data Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • How to get the data?  Download the Billion Triples Challenge Dataset  2 billion triples (20GB gzipped)  crawled from the public Web of Linked Data in May/June 2011  http://challenge.semanticweb.org/  Download the Sindice Dump  12 billion triples (164GB gzipped, ~1 16TB uncompressed) gzipped 1,16TB  crawled from the public Web of Linked Data and  includes RDFa Microformat and wrapped API data RDFa, Microformat,  http://data.sindice.com/trec2011/download.html Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 2. Global Data Integration Applications hate heterogeneity! pp g y The wild wild west My little world Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • The Dataspace Vision Alternative to classic data integration systems in order to cope with growing number of data sources. P Properties of dataspaces ti fd t  no upfront investment into a global schema  rely on pay-as-you-go d t integration l data i t ti  give best effort answers to queries Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces A new Abstraction for Information Management SIGMOD Rec. 2005 Management, Rec 2005. Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford to Pay As You Go, CIDR 2007 Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Linked Data relies on Pay-as-You-Go Idea  for Identity Management  for Schema/Vocabulary Management Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Publish Identity Links on the Web Identity Link <http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .  You publish links pointing at other data sources. S Somebody else publishes li k pointing at your b d l bli h links i ti t data source. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Effort Distribution between Publisher and ConsumerConsumer data mines identity identit links Effort Distribution Publishers or third parties provides identity links y Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Vocabularies on the Web of Data  Everyone can use whatever vocabularies she likes to publish Data on the Web. Web  Or invest effort and reuse Common Vocabularies  Friend-of-a-Friend for describing people and their social network  SIOC for describing forums and blogs  SKOS for representing topic taxonomies  Organization Ontology for describing the structure of organizations  GoodRelations provides terms for describing products and business entities  Music Ontology for describing artists, albums, and performances  Review Vocabulary provides terms for representing reviews  Many Linked Data Source use mixture of common and proprietary vocabulary terms. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Publish Vocabulary Links on the Web Vocabulary Link <http://xmlns.com/foaf/0.1/Person> owl:equivalentClass <http://dbpedia.org/ontology/Person> .  Simple Mappings: RDFS, OWL  rdfs:subClassOf, rdfs:subPropertyOf  owl:equivalentClass, owl:equivalentProperty  Complex Mappings: R2R p pp g  provides value transformation functions  structural transformations Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Deployment of Vocabulary LinksSource: Li k d OS Linked Open V Vocabularies, b l ihttp://labs.mondeca.com/dataset/lov Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Effort Distribution between Publisher and ConsumerConsumer defines ordata mines mappings Effort Distribution Publisher reuses vocabulariesPublisher or third party publishes mappings Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Somebody-Pays-As-You-Go The overall data integration effort is split between the data publisher, the publisher data consumer and third parties. Fix  Overall Data   Integration  Data Publisher Effort  publishes data as RDF  sets identity links  reuses terms or publishes mappings  Third Parties  set identity links pointing at y y p g your data Publisher‘s Third  Party  Effort  publish mappings to the Web Effort  Data Consumer Consumer‘s  has to do the rest Effort  using record linkage and schema matching techniques Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Research Directions 1. More research on pay-as-you-go data integration is needed. 2. More research on data mining mappings and identity resolution heuristics is needed.  Identity links make it easier to mine vocabulary links.  Vocabulary links make it easier to mine identity links. 3. 3 More research on SPAM detection and data quality assessment is needed. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • LDIF – Linked Data Integration Framework  Combines vocabulary normalization and identity resolution  C Currently only i tl l in-memory i l implementation t ti  Next release: Hadoop-based implementation  htt // http://www4.wiwiss.fu-berlin.de/bizer/ldif/ 4 i i f b li d /bi /ldif/ Normalize Identity vocabularies Resolution Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • What can we do afterwards … … build better entity search engines Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 3. Global Data Mining Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Think about interesting questions … … that you can answer based on the Web of Data … that require  aggregation  summarization  classification  association rule mining … combined with  text mining  sediment analysis y Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Everybody has the tools to find the answers Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Research Directions 1. More research on data space profiling is needed. 2. More research on global data mining i needed. 2 M h l b ld t i i is d d  Google, Yahoo, Microsoft, Facebook will get there soon. g , , , g Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Semantic Web Challenge  Submission Statistics Year Open Track Billion Triple Track 2008 13 9 2009 16 3 2010 14 4  Do something interesting with the Billion Triple Data  and submit your results to the challenge until October 1st  present your results at the 10th International Semantic Web Conference (ISWC2011), October 2011, Koblenz, Germany Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Conclusions  The Web of Data is there  Linked Data, Microdata, RDFa, Microformats  Upcoming research topics  pay-as-you-go data integration  mapping discovery, schema clustering  identity resolution heuristics discovery  probabilistic data integration  data quality assessment  data space profiling  global data mining Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • Thanks!References  Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Heath Data Space. http://linkeddatabook.com/  Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)