How to integrate Linked Data into your application

20,520 views

Published on

Slides presented by Christian Becker at the Semantic Technology & Business Conference, San Francisco, June 2012.
More details at: http://ldif.wbsg.de

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
20,520
On SlideShare
0
From Embeds
0
Number of Embeds
18,072
Actions
Shares
0
Downloads
42
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

How to integrate Linked Data into your application

  1. 1. SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE |SAN FRANCISCO, JUNE 5, 2012 HOW TO INTEGRATE LINKED DATA INTO YOUR APPLICATION LDIF Team: Andreas Schultz, Freie Universität Berlin Andrea Matteini, mes|semantics Robert Isele, Freie Universität Berlin Pablo N. Mendes, Freie Universität Berlin Christian Becker, mes|semantics Christian Bizer, Freie Universität Berlin With contributions by: Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
  2. 2. | WHAT IS LINKED DATA?• Raw data (RDF)• Accessible on the web• Data can link to other data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing data link data link data link data link A B C D E• Benefits: Ease of access and re-use; enables discovery• One API for all data sources?
  3. 3. | LINKING OPEN DATA CLOUD Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN Points Thesau- Last.FM Thesaur Media Poké- Popula- artists pédia Didactal us rus W LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Family Norwe- Genera- AKTing) Mortality BBC wrappr Sudoc PSH Tune) gian (En- tors Program MeSH Geographic AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid nance Publications Election RDF graphie Data legislation Survey Local nl data Ulm Resources NSZL Swedish EU Tele- New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York URI Open Mashup Cultural tutions Times Greek P20 UK Post- Burner Calais Heritage codes DBpedia ECS Wiki statistics lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNB Brazilian uk Concept ECS ampton sations Geo World BibBase STW GESISUser-generated content OS South- ECS Poli- ESD Names Fact- ampton (RKB ticians stan- reference book Budapest data.gov.uk Freebase EPrints Explorer) dards data.gov. NASA uk intervals Project OAI Lichfield transport (Data DBpedia data Guten- Pisa Spen- data.gov. Incu- dcs RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP Species Berlin) Government gration IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- (RKB some SIDER RAE2001 castle LOCAH Explorer) Linked Eurécom Cross-domain CORDIS Drug Roma Eurostat Sensor Data CiteSeer (FUB) (Ontology Bank GovTrack (Kno.e.sis) Open Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS Life sciences (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland Geo- GeoWord LODE graphy Net WordNet WordNet JISC (W3C) (RKB Climbing Linked Affy- KEGG SMC Explorer) SISVU Pub VIVO UF Piedmont GeoData metrix Drug ECCO- Finnish Journals PubMed Gene SGD Chem Accomo- El TCP Munici- AGROV Ontology dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Austria PBAC Ocean GEMET Enzyme Metoffice ChEMBL Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway Data schools Forecasts Open InterPro GeneID KEGG EARTh Thesau- Turismo rus Colors Reaction de Zaragoza Product Smart KEGG Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National Chem2 wrapper Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator http://lod-cloud.net As of September 2011
  4. 4. | TYPES OF LINKED DATA VERY SOON? Open, Linked Commercial Public Data Enterprise Linked Data (LOD Cloud) Data... AND WHAT YOU CAN DO WITH THEM• Provide interfaces on top of them• Augment your website• Integrate them into your application logic• Create specialized data marts
  5. 5. |AUGMENT YOUR WEBSITE: BBC BBC online properties make intensive use of data from Wikipedia and MusicBrainz
  6. 6. | DATA MARTS: NEUROWIKI• NeuroWiki creates views for genes, drugs and diseases data from four RDF data sources• Provides navigation and composition tools for accessing and mining the data
  7. 7. | APPLICATION LOGIC: IBM WATSON http://www.flickr.com/photos/ibm_media/• IBM Watson makes use of Linked Data sources such as DBpedia
  8. 8. | 4 STEPS TOLINKED DATA INTEGRATION
  9. 9. | STEP #1: ACCESS LINKED DATA• Linked Data is published via HTTP, SPARQL endpoints, RDF dumps Access Methods Decision Factors Architecture HTTP Dump SPARQL Recency Speed / Scalability Reliability Complexity Dereferencing importOn-The-Fly X High Low Low HighDereferencing Decreases Moderate with exponentially asQuery Federation X High Low SPARQL 1.1 new sources are SERVICE clause addedCrawling and Caching X X X Depends High High High Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011)• Live access allows quick prototyping and limited production use• As data sets grow in size and more data sources are added, a crawling/caching architecture often becomes necessary
  10. 10. | STEP #1: ACCESS LINKED DATAImplementations:• On-the-fly dereferencing • LDspider, SQUIN, Semantic Web Client library• Query federation • SPARQL 1.1 SERVICE clause• Crawling and Caching • Triplestore import script • Public caches (e.g. Sindice, OpenLink LOD endpoint) • LDIF
  11. 11. | STEP #2: NORMALIZE VOCABULARIES Data sources that overlap in content use a wide range of vocabularies. mpeg7 swrc po dcam bib tl wot rdfg txncompass metalex doap dc wdrs admingeo vann api org sawsdl Over 60 % of all LOD sources use sdmx • geospecies qb xml rev vu-wordnet umbel uniprot http scovo void tag proprietary vocabularies dbp bio ore dbo gr dbpedia event time xsd • It’s up to the data consumer to frbr geonames cc normalize the vocabularies sioc foaf vcard • Enterprise: Need to translate mo between internal and external bibo akt vocabularies xhtml skos geo Most widely used vocabularies in the LOD cloud (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  12. 12. | STEP #2: NORMALIZE VOCABULARIESApproaches to Schema Mapping:• Hand-crafting queries against individual sources – no different than an API OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc } OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php• Ontology Representation Languages: OWL, RDFS• Rules: SWRL, RIF• Query Languages • SPARQL CONSTRUCT clause • TopQuadrant SPARQLMotion • Mosto • R2R (part of LDIF)
  13. 13. | STEP #2: NORMALIZE VOCABULARIESUsing SPARQL:• Rename a class CONSTRUCT { ?s a mo:MusicArtist } WHERE { ?s a dbpedia-owl:MusicalArtist }• Value transformation CONSTRUCT { ?s movie:runtime ?runtimeInMinutes . } WHERE { ?s dbpedia-owl:runtime ?runtime . BIND(?runtime * 60 As ?runtimeInMinutes) }• Create URI from literal CONSTRUCT { ?s diseasome:omim ?omimuri . ?omimuri dc:identifier ?identifier . } WHERE { ?s dbpedia-owl:omim ?omim . BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri) BIND(concat(“omim:”, ?omim) As ?identifier) } Slide credits: Andreas Schultz
  14. 14. | STEP #3: RESOLVE IDENTIFIERS Data sources that overlap in content use different identifiers for the same real-world entity. 1 linked data sets 98 • Most LOD sources only provide 2 linked data sets 62 owl:sameAs links to one other data source 3 linked data sets 38 4 linked data sets 19 • It’s up to the data consumer to generate additional links 5 linked data sets 5 • Enterprise: Need to link both6 - 10 linked data sets 17 internal and external resources > 10 linked data sets 27 0 25 50 75 100 Number of linked data sets per source (08/10/2011)Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  15. 15. | STEP #3: RESOLVE IDENTIFIERSApproaches to Identity Resolution:• Improvised or manual merging• Rule-based approaches: • SILK (part of LDIF) • LIMES Union Sq., New York Union Sq., Seattle Union Sq., San Francisco ′N 47 W ° 24′ 37 2° 12 Union Sq. Union = Square Union Sq., San Francisco ′N 47 W ° 24′ 37 2° 12
  16. 16. | STEP #4: FILTER DATAData sources that overlap in content provide data that is conflicting and ofvarying quality.• Data sources have... • ... different knowledge levels, views or intents • ... wrong, biased, inconsistent or outdated information• Approaches: • Import data into distinct Named Graphs; query them separately using the SPARQL GRAPH clause • Sieve (part of LDIF)
  17. 17. | LDIF – LINKED DATA INTEGRATION FRAMEWORKIntegrates Linked Data from multiple sources into a clean, local targetrepresentation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs NEW 4 Cleanse data; resolving the conflicting values 5 Output• Follows the Crawling and Caching Architecture Pattern• Open source (Apache License, Version 2.0)• Collaboration between Freie Universität Berlin and mes|semantics
  18. 18. | LDIF PIPELINE1 Collect data Supported data sources:2 Translate data • RDF dumps (all common formats) • SPARQL Endpoints3 Resolve identities • Crawling Linked Data via HTTP4 Cleanse data5 Output
  19. 19. | LDIF PIPELINE1 Collect data Sources use a wide range of different RDF vocabularies2 Translate data dbpedia-owl: City3 Resolve identities schema:Place R2R local:City fb:location.citytown4 Cleanse data5 Output • Simple mappings using OWL / RDFS statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Built-in transformation function library (XPath)
  20. 20. | LDIF PIPELINE1 Collect data Sources use different identifiers for the same entity2 Translate data Union Sq., New York Union Sq., Seattle3 Resolve identities Union Sq., San Francisco ′N ° 47 4′ W 37 2°2 124 Cleanse data Union Sq. Union =5 Output Square Silk Union Sq., San Francisco ′N ° 47 4′ W 37 2°2 12 • Automated link creation based on Link Specifications • Supports various comparators and transformations (string similarity, basic arithmetics, time, geographical distance)
  21. 21. | LDIF PIPELINE Sources provide different values for the same property1 Collect data San Francisco2 Translate data population is 0.7M3 Resolve identities ★ ★ San Francisco San4 Cleanse data population is Francisco 0.8M Sieve population5 Output is 0.8M ★ ★ ★ 1. Quality Assessment – assign quality scores to Named Graphs (by time, by source preference, thresholds) 2. Data Fusion – resolve conflicting property values (according to quality scores, frequency, averages)
  22. 22. | LDIF PIPELINE1 Collect data Output options:2 Translate data • N-Quads3 Resolve identities • N-Triples • SPARQL Update Stream4 Cleanse data5 Output • Provenance tracking using Named Graphs
  23. 23. ! |!!! LDIF ARCHITECTUREApplication!Layer! Application!Code!! SPARQL!or!RDF!API! !!!!!!LDIF!! !!Data!Access,!! Data! Identity! Data!Quality!Integration!and!! Web!Data! Integrated! Translation! Resolution! and!Fusion! Access!Module! Web!Data!Storage!Layer! ! Module! Module! Module! ! ! HTTP!Web!of!Data! HTTP! HTTP! HTTP! RDFa! LD!Wrapper! LD!Wrapper!Publication!Layer! RDF/X ML! Database!A! Database!B! CMS!
  24. 24. | VERSIONS• In-memory • fast, but scalability limited by local RAM• RDF Store (TDB) • stores intermediate results in a Jena TDB RDF store • can process more data than In-memory but doesnt scale• Cluster (Hadoop) • scales by parallelizing work across multiple machines using Hadoop • can process a virtually unlimited amount of data • ready for Amazon Elastic MapReduce
  25. 25. | BENCHMARKSKEGG GENES VS. UNIPROT (CLUSTER) 300M TRIPLES 3.6B TRIPLES
  26. 26. |Q&A
  27. 27. | THANKS!• Early adopters wanted!• Website: http://bit.ly/ldifweb• Google Group: http://bit.ly/ldifgroup• http://mes-semantics.com• Supported in part by • Vulcan Inc. as part of its Project Halo • EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943)• Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz

×