Joshua Shinavier



 The state of the art in
     Linked Data


Advanced Semantic Web, Spring 2009
         Literature Survey
Outline
•   Linked Data

•   Linking Open Data

•   describing linked datasets

•   growing the data web

•   keeping Linked Data connected

•   indexing and searching

•   applications

•   navigation

•   state of the data web



                             2
Linked Data overview

•   resource -- an item of interest

•   URI -- global identifier for a resource

•   representation -- data corresponding to the state
    of a resource

•   information resource -- a “document” containing
    information

•   non-information resource -- anything else

•   associated description -- representation describing
    a Semantic Web resource




                             3
The Linking Open Data initiative
•   “bootstrap” the data web with large, interconnected data sets
    to reach a critical mass of semantics

•   strict adherence to W3C standards

    •   identification and transportation (URI, HTTP) of resource
        descriptions

    •   interpretation (RDF, RDFS, OWL) of resource descriptions

•   LOD grows as data providers:

    •   publish structured data on the Web

    •   set RDF links between entities in different data sources

•   transition of the web from a distributed document repository
    into a universal, ubiquitous database [Erling 09]

                                 4
The LOD cloud




      5
LOD data sets




      6
Link sets in LOD




        7
Describing linked datasets

•   voiD (Vocabulary of Interlinked Datasets)
    [Alexander, Cyganiak, Hausenblas, Zhao 09]

    •   describes data sets the link sets between them

•   DING (Dataset RankING) [Toupikov, Umbrich,
    Delbru, Hausenblas, Tummarello 09]

    •   ranking of linked datasets using formal
        descriptions

•   modeling of the Linked Data domain [Halpin,
    Presutti 09]




                            8
Keeping Linked Data connected

•   network-shaped Entity Name System to enable
    systematic reuse of URIs [Bouquet, Stoermer,
    Cordioli, Tummarello 08]

    •   similar to DNS for interlinking hypertext

•   n2Mate framework [Peterson, Cregan, Atkinson,
    Brisbin 08]

    •   use social networking principles to facilitate
        vocabulary and instance reuse

•   graph-based disambiguation of Semantic Web
    entities with idMesh [Cudré-Mauroux, Haghani,
    Jost, Aberer, de Meer 09]



                              9
Managing co-reference
•   many conflated resources in DBpedia [Jaffri,
    Glaser, Millard 08]

    •   representative of LOD as a whole

•   Co-Reference Resolution Service [Glaser, Jaffri,
    Millard 09]

    •   when co-reference is context-specific,
        owl:sameAs is inappropriate

    •   stores co-reference information as a first-class
        entity

•   ontology-level alignment should precede data-level
    alignment [Nikolov, Uren, Motta 09]



                             10
Growing the data web

•   how to get data out there?

•   challenges of the read-write Semantic Web

    •   user awareness of social context of data (e.g.
        licensing, privacy)

    •   view update problem

    •   is the wiki model applicable?

•   incentives for posting data on the SW

•   validating existing Linked Data with Vapour
    [Berrueta, Fernandez, Frade 08]



                              11
Examples of LOD data sets


•   DBpedia [Auer, Bizer, Kobilarov, Lehmann,
    Cyganiak, Ives 07]

    •   extracts structured information from Wikipedia

    •   linking hub for the LOD cloud

•   RDF Book Mashup [Bizer, Cyganiak, Gauss 07]

    •   product metadata from Amazon.com




                            12
Music and movies as Linked Data
•   Linked Movie Database [Hassanzadeh, Consens 09]

    •   combines data from IMDb, Freebase, OMDB,
        DBPedia, RottenTomatoes.com, Stanford Movie
        Database

•   interlinked music datasets [Raimond, Sutton,
    Sandler 08]

    •   combines data from Jamendo on DBTune, BBC
        John Peel sessions, SBSimilarity, Musicbrainz,
        DBpedia, Geonames

    •   links artists, albums, tracks, personal music
        collections

    •   generated links based similarity of resources,
        similarity of neighbors

                             13
Other sources of data


•   the hypertext Web itself [Li, Zhao 08]

    •   extraction of semantic links from hypertext links and
        hierarchical relationships among Web documents

•   RDF representation of HTML DOM from using SparqPlug
    [Coetzee, Heath, Motta 08]

•   multimedia metadata

    •   interlinking multimedia fragments [Hausenblas, Troncy,
        Bürger, Raimond 09]




                                14
Other sources of data (cont.)

•   XML Business Reporting Language (XBRL) [Garcia, Gil
    09]

    •   mapping data to RDF and schemas to OWL
        facilitates interoperability

•   large thesauri [Neubert 09]

    •   as interlinking hubs for professional communities

•   enterprise data, e.g. technical documentation [Servant
    08]

•   MARC21 bibliographic records [Styles, Ayers, Shabir
    08]



                             15
Mapping tools


•   D2R Server for customizable mappings from
    relational databases to ontologies [Bizer, Cyganiak
    06]

•   browser-based tools for defining RDB-to-RDF
    mappings [Zhou, Xu, Chen, Idehen 08]

•   Triplify [Auer, Dietzold, Lehmann, Hellmann,
    Aumueller 09]

•   from generic data silos to Linked Data using
    OpenLink Data Spaces [Idehen, Erling 08]




                           16
Aggregated resources


•   Open Archives Initiative Protocol for Metadata
    Harvesting (OAI-PMH)

    •   can be made Web-accessible with OAI2LOD
        Server [Haslhofer, Schandl 08]

•   Open Archives Initiative - Object Reuse and
    Exchange (OAI-ORE) [Van de Sompel, Lagoze,
    Nelson, Warner, Sanderson, Johnston 09]

    •   adheres to Web principles




                            17
User-driven Linked Data


•   existing Linked Data datasets are more
    appropriate for machine than human
    consumption

•   template-generated interlinks are of limited quality

•   data from existing silos quickly becomes out of
    date

•   need human involvement to grow the data web
    organically




                           18
User-driven Linked Data (cont.)
•   direct modification using SPARQL/Update

    •   e.g. in Tabulator [Berners-Lee, Hollenbach, Lu, Presbrey,
        Prud’hommeaux, Schraefel 08]

•   User Contributed Interlinking [Halb, Raimond, Hausenblas]

•   semantic wikis

•   Loomp [Roesch, Heese 09]

    •   semantic annotation of content using a text editor
        interface




                                19
User-driven Linked Data (cont.)
•   public data from existing social networks

    •   wrappers for Web 2.0 services [Passant 08]

    •   unifying personal identity across various
        networks [Rowe 09]

•   Semantically Interlinked Online Communities
    (SIOC)

    •   integrating social media sites (forums, blogs,
        wikis, etc. with the data web [Bojars, Passant,
        Cyganiak, Breslin 08]

•   Meaning of a Tag (MOAT) ontology gives meaning
    to tags on Web 2.0 [Passant, Laublet 08]



                             20
Usability and licensing

•   usability (for humans) of Linked Data [Halb,
    Raimond, Hausenblas 08]

    •   current LOD datasets are primarily for machine
        consumption

    •   low semantic strength of current LOD link sets

•   provenance information for Linked Data [Hartig
    09]

•   Open Data Commons license [Miller, Styles, Heath
    08]




                            21
Indexing and searching
•   W3C’s TAP semantic search [Guha, McCool 01]

•   Swoogle [Ding, Finin, Joshi, Pan, Cost, Peng, Reddivari,
    Doshi, Sachs 04]

    •   adapts PageRank concept to ontologies

•   SWSE [Hogan, Harth, Umbrich, Decker 07]

    •   MultiCrawler [Harth, Umbrich, Decker 06]

•   RDF Gateway search

•   Watson document-based search

•   Falcons [Cheng, Ge, Wu, Qu 08]

    •   textual search using class hierarchies for query restriction

•   Sindice Semantic Web index [Tummarello, Delbru, Oren 07]
                                22
Link discovery


•   Silk link discovery framework [Volz, Bizer, Gaedke,
    Kobilarov 09]

    •   find relationships between entities within
        different data sources

    •   generation of owl:sameAs links

•   value of Web of Data depends on the amount and
    quality of links between data sources




                             23
Navigation
•   like early Web, it’s easy to get “Lost in Hyperspace”

•   Tabulator generic Linked Data browser [Berners-
    Lee, Chen, Chilton, Connolly, Dhanaraj,
    Hollenbach, Lerer, Sheets 06]

    •   encourage deployment of Linked Data

    •   test, refine and promote Linked Data standards

•   faceted views over large-scale linked data with
    Virtuoso Cluster Edition [Erling 09]

•   Explorator RDF browser [Araujo, Schwabe 09]

    •   exploratory search using direct manipulation



                            24
Navigation (cont.)
•   DBPedia Mobile map view and faceted Linked
    Data browser [Becker, Bizer 08]

    •   explore the geospatial Semantic Web

    •   uses current GPS position as a starting point

    •   potential for Linked Data publishing




                            25
Navigation (cont.)
•   Fenfire generic Linked Data browser [Hastrup,
    Cyganiak, Bojars 08]

    •   uses graph views rather than tables or outlines

    •   shows graph data as directly as possible

    •   related to Fentwine [Fallenstein, Lukka 04]




                            26
Navigation (cont.)


•   Humboldt [Kobilarov,
    Dickinson 08]

    •   exploratory browsing

    •   faceted views

    •   “resource at a time”

    •   uses a “pivot” operation
        to refocus the view




                                   27
Navigation (cont.)
•   zLinks plugin [Bergman, Giasson 08]

    •   WordPress plugin with supporting server

    •   relates hypertext links with contextually
        relevant Linked Data

    •   WOWY (WordNet, OpenCyc, Wikipedia, YAGO)

        •   distinguish between types of resources

        •   disambiguate alternate senses




                              28
Navigation (cont.)
•   mapping of Linked Data to a file system model
    [Schandl 09]

    •   enables use of this data within desktop
        applications




                            29
Other applications
•   how to use the data that is out there?

    •   emerging applications which exploit Linked
        Data [Hausenblas 09]

•   integrating data sources related to drug and
    clinical trials [Jentzsch, Andersson, Hassanzadeh,
    Stephens, Bizer 09]

•   mashups

    •   MashQL [Jarrar, Dikaiakos 09]

        •   Internet is a database, mashup is a query
            over that database

•   benefit of specialized, independent Linked Data
    services acting together [Bojars, Passant, Giasson,
    Breslin 07]
                              30
The gray area
•   U-P2P framework for peer-to-peer linked data [Davoust,
    Esfandiari 09]

    •   data replication provides a measure of popularity

•   Linked Data with Named Graphs

    •   e.g. interlinks with embedded provenance information
        [Zhao, Klyne, Shotton 08]

•   Ripple scripting language [Shinavier 07]

    •   embeds Turing-complete programs in the Web of Data




                                31
State of the data web
•   where are we with the Linked Data graph?

    •   size

    •   number and type of links

    •   usefulness to end users

    •   network characteristics

•   single-point-of-access (e.g. DBpedia, GeoNames)
    vs. distributed datasets (e.g. FOAF-o-sphere,
    SIOC-land)

•   syntactic and semantic analysis of the LOD
    dataset [Hausenblas, Halb, Raimond, Heath 08]



                            32
Statistics of the data web

•   today’s Linked Data is very different than the first-
    generation data web [Halpin 09]

    •   LOD data accounts for the vast majority of data

    •   power-law distributions are emerging

    •   data web is not growing organically

    •   Web standards are generally adhered to

•   is Linked Data useful to ordinary users?

    •   sampling of Linked Data using Live.com query
        logs and FALCON-S semantic search engine


                            33
Query popularity follows a power law




 •   ...




                 34
URI frequency... not so much




•   ...




                  35
Data publishing lacks a “long tail”




•   ...




                 36
A few dominant ontologies are emerging




          # of URIs by vocabulary
                     37
(DBpedia bias)




# of URIs by domain name
           38
Graph analysis for the data web

•   common network analysis techniques can be used
    to investigate interoperability and structural
    patterns of the LOD cloud [Rodriguez 09]

•   results based on March 2009 statistics of the LOD
    data set graph:

    •   LOD graph is not strongly connected

    •   diameter of 8 is large given relatively small size
        of the cloud

    •   data sets have nearly identical incoming and
        outgoing link patterns (⇒ majority of reciprocal
        owl:sameAs links)



                              39
Ranking and clustering of LOD data sets




                   40
•       Original slide show:

    •    http://tw.rpi.edu/proj/portal.wiki/images/f/f0/
         LinkedData.pdf

•       References:

    •    http://tw.rpi.edu/proj/portal.wiki/images/e/e0/
         LinkedDataSurvey.pdf

•       BibTeX:

    •    http://tw.rpi.edu/proj/portal.wiki/images/3/37/
         LinkedDataSurvey.bbl




                                     41

The state of the art in Linked Data

  • 1.
    Joshua Shinavier Thestate of the art in Linked Data Advanced Semantic Web, Spring 2009 Literature Survey
  • 2.
    Outline • Linked Data • Linking Open Data • describing linked datasets • growing the data web • keeping Linked Data connected • indexing and searching • applications • navigation • state of the data web 2
  • 3.
    Linked Data overview • resource -- an item of interest • URI -- global identifier for a resource • representation -- data corresponding to the state of a resource • information resource -- a “document” containing information • non-information resource -- anything else • associated description -- representation describing a Semantic Web resource 3
  • 4.
    The Linking OpenData initiative • “bootstrap” the data web with large, interconnected data sets to reach a critical mass of semantics • strict adherence to W3C standards • identification and transportation (URI, HTTP) of resource descriptions • interpretation (RDF, RDFS, OWL) of resource descriptions • LOD grows as data providers: • publish structured data on the Web • set RDF links between entities in different data sources • transition of the web from a distributed document repository into a universal, ubiquitous database [Erling 09] 4
  • 5.
  • 6.
  • 7.
  • 8.
    Describing linked datasets • voiD (Vocabulary of Interlinked Datasets) [Alexander, Cyganiak, Hausenblas, Zhao 09] • describes data sets the link sets between them • DING (Dataset RankING) [Toupikov, Umbrich, Delbru, Hausenblas, Tummarello 09] • ranking of linked datasets using formal descriptions • modeling of the Linked Data domain [Halpin, Presutti 09] 8
  • 9.
    Keeping Linked Dataconnected • network-shaped Entity Name System to enable systematic reuse of URIs [Bouquet, Stoermer, Cordioli, Tummarello 08] • similar to DNS for interlinking hypertext • n2Mate framework [Peterson, Cregan, Atkinson, Brisbin 08] • use social networking principles to facilitate vocabulary and instance reuse • graph-based disambiguation of Semantic Web entities with idMesh [Cudré-Mauroux, Haghani, Jost, Aberer, de Meer 09] 9
  • 10.
    Managing co-reference • many conflated resources in DBpedia [Jaffri, Glaser, Millard 08] • representative of LOD as a whole • Co-Reference Resolution Service [Glaser, Jaffri, Millard 09] • when co-reference is context-specific, owl:sameAs is inappropriate • stores co-reference information as a first-class entity • ontology-level alignment should precede data-level alignment [Nikolov, Uren, Motta 09] 10
  • 11.
    Growing the dataweb • how to get data out there? • challenges of the read-write Semantic Web • user awareness of social context of data (e.g. licensing, privacy) • view update problem • is the wiki model applicable? • incentives for posting data on the SW • validating existing Linked Data with Vapour [Berrueta, Fernandez, Frade 08] 11
  • 12.
    Examples of LODdata sets • DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak, Ives 07] • extracts structured information from Wikipedia • linking hub for the LOD cloud • RDF Book Mashup [Bizer, Cyganiak, Gauss 07] • product metadata from Amazon.com 12
  • 13.
    Music and moviesas Linked Data • Linked Movie Database [Hassanzadeh, Consens 09] • combines data from IMDb, Freebase, OMDB, DBPedia, RottenTomatoes.com, Stanford Movie Database • interlinked music datasets [Raimond, Sutton, Sandler 08] • combines data from Jamendo on DBTune, BBC John Peel sessions, SBSimilarity, Musicbrainz, DBpedia, Geonames • links artists, albums, tracks, personal music collections • generated links based similarity of resources, similarity of neighbors 13
  • 14.
    Other sources ofdata • the hypertext Web itself [Li, Zhao 08] • extraction of semantic links from hypertext links and hierarchical relationships among Web documents • RDF representation of HTML DOM from using SparqPlug [Coetzee, Heath, Motta 08] • multimedia metadata • interlinking multimedia fragments [Hausenblas, Troncy, Bürger, Raimond 09] 14
  • 15.
    Other sources ofdata (cont.) • XML Business Reporting Language (XBRL) [Garcia, Gil 09] • mapping data to RDF and schemas to OWL facilitates interoperability • large thesauri [Neubert 09] • as interlinking hubs for professional communities • enterprise data, e.g. technical documentation [Servant 08] • MARC21 bibliographic records [Styles, Ayers, Shabir 08] 15
  • 16.
    Mapping tools • D2R Server for customizable mappings from relational databases to ontologies [Bizer, Cyganiak 06] • browser-based tools for defining RDB-to-RDF mappings [Zhou, Xu, Chen, Idehen 08] • Triplify [Auer, Dietzold, Lehmann, Hellmann, Aumueller 09] • from generic data silos to Linked Data using OpenLink Data Spaces [Idehen, Erling 08] 16
  • 17.
    Aggregated resources • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) • can be made Web-accessible with OAI2LOD Server [Haslhofer, Schandl 08] • Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) [Van de Sompel, Lagoze, Nelson, Warner, Sanderson, Johnston 09] • adheres to Web principles 17
  • 18.
    User-driven Linked Data • existing Linked Data datasets are more appropriate for machine than human consumption • template-generated interlinks are of limited quality • data from existing silos quickly becomes out of date • need human involvement to grow the data web organically 18
  • 19.
    User-driven Linked Data(cont.) • direct modification using SPARQL/Update • e.g. in Tabulator [Berners-Lee, Hollenbach, Lu, Presbrey, Prud’hommeaux, Schraefel 08] • User Contributed Interlinking [Halb, Raimond, Hausenblas] • semantic wikis • Loomp [Roesch, Heese 09] • semantic annotation of content using a text editor interface 19
  • 20.
    User-driven Linked Data(cont.) • public data from existing social networks • wrappers for Web 2.0 services [Passant 08] • unifying personal identity across various networks [Rowe 09] • Semantically Interlinked Online Communities (SIOC) • integrating social media sites (forums, blogs, wikis, etc. with the data web [Bojars, Passant, Cyganiak, Breslin 08] • Meaning of a Tag (MOAT) ontology gives meaning to tags on Web 2.0 [Passant, Laublet 08] 20
  • 21.
    Usability and licensing • usability (for humans) of Linked Data [Halb, Raimond, Hausenblas 08] • current LOD datasets are primarily for machine consumption • low semantic strength of current LOD link sets • provenance information for Linked Data [Hartig 09] • Open Data Commons license [Miller, Styles, Heath 08] 21
  • 22.
    Indexing and searching • W3C’s TAP semantic search [Guha, McCool 01] • Swoogle [Ding, Finin, Joshi, Pan, Cost, Peng, Reddivari, Doshi, Sachs 04] • adapts PageRank concept to ontologies • SWSE [Hogan, Harth, Umbrich, Decker 07] • MultiCrawler [Harth, Umbrich, Decker 06] • RDF Gateway search • Watson document-based search • Falcons [Cheng, Ge, Wu, Qu 08] • textual search using class hierarchies for query restriction • Sindice Semantic Web index [Tummarello, Delbru, Oren 07] 22
  • 23.
    Link discovery • Silk link discovery framework [Volz, Bizer, Gaedke, Kobilarov 09] • find relationships between entities within different data sources • generation of owl:sameAs links • value of Web of Data depends on the amount and quality of links between data sources 23
  • 24.
    Navigation • like early Web, it’s easy to get “Lost in Hyperspace” • Tabulator generic Linked Data browser [Berners- Lee, Chen, Chilton, Connolly, Dhanaraj, Hollenbach, Lerer, Sheets 06] • encourage deployment of Linked Data • test, refine and promote Linked Data standards • faceted views over large-scale linked data with Virtuoso Cluster Edition [Erling 09] • Explorator RDF browser [Araujo, Schwabe 09] • exploratory search using direct manipulation 24
  • 25.
    Navigation (cont.) • DBPedia Mobile map view and faceted Linked Data browser [Becker, Bizer 08] • explore the geospatial Semantic Web • uses current GPS position as a starting point • potential for Linked Data publishing 25
  • 26.
    Navigation (cont.) • Fenfire generic Linked Data browser [Hastrup, Cyganiak, Bojars 08] • uses graph views rather than tables or outlines • shows graph data as directly as possible • related to Fentwine [Fallenstein, Lukka 04] 26
  • 27.
    Navigation (cont.) • Humboldt [Kobilarov, Dickinson 08] • exploratory browsing • faceted views • “resource at a time” • uses a “pivot” operation to refocus the view 27
  • 28.
    Navigation (cont.) • zLinks plugin [Bergman, Giasson 08] • WordPress plugin with supporting server • relates hypertext links with contextually relevant Linked Data • WOWY (WordNet, OpenCyc, Wikipedia, YAGO) • distinguish between types of resources • disambiguate alternate senses 28
  • 29.
    Navigation (cont.) • mapping of Linked Data to a file system model [Schandl 09] • enables use of this data within desktop applications 29
  • 30.
    Other applications • how to use the data that is out there? • emerging applications which exploit Linked Data [Hausenblas 09] • integrating data sources related to drug and clinical trials [Jentzsch, Andersson, Hassanzadeh, Stephens, Bizer 09] • mashups • MashQL [Jarrar, Dikaiakos 09] • Internet is a database, mashup is a query over that database • benefit of specialized, independent Linked Data services acting together [Bojars, Passant, Giasson, Breslin 07] 30
  • 31.
    The gray area • U-P2P framework for peer-to-peer linked data [Davoust, Esfandiari 09] • data replication provides a measure of popularity • Linked Data with Named Graphs • e.g. interlinks with embedded provenance information [Zhao, Klyne, Shotton 08] • Ripple scripting language [Shinavier 07] • embeds Turing-complete programs in the Web of Data 31
  • 32.
    State of thedata web • where are we with the Linked Data graph? • size • number and type of links • usefulness to end users • network characteristics • single-point-of-access (e.g. DBpedia, GeoNames) vs. distributed datasets (e.g. FOAF-o-sphere, SIOC-land) • syntactic and semantic analysis of the LOD dataset [Hausenblas, Halb, Raimond, Heath 08] 32
  • 33.
    Statistics of thedata web • today’s Linked Data is very different than the first- generation data web [Halpin 09] • LOD data accounts for the vast majority of data • power-law distributions are emerging • data web is not growing organically • Web standards are generally adhered to • is Linked Data useful to ordinary users? • sampling of Linked Data using Live.com query logs and FALCON-S semantic search engine 33
  • 34.
    Query popularity followsa power law • ... 34
  • 35.
    URI frequency... notso much • ... 35
  • 36.
    Data publishing lacksa “long tail” • ... 36
  • 37.
    A few dominantontologies are emerging # of URIs by vocabulary 37
  • 38.
    (DBpedia bias) # ofURIs by domain name 38
  • 39.
    Graph analysis forthe data web • common network analysis techniques can be used to investigate interoperability and structural patterns of the LOD cloud [Rodriguez 09] • results based on March 2009 statistics of the LOD data set graph: • LOD graph is not strongly connected • diameter of 8 is large given relatively small size of the cloud • data sets have nearly identical incoming and outgoing link patterns (⇒ majority of reciprocal owl:sameAs links) 39
  • 40.
    Ranking and clusteringof LOD data sets 40
  • 41.
    Original slide show: • http://tw.rpi.edu/proj/portal.wiki/images/f/f0/ LinkedData.pdf • References: • http://tw.rpi.edu/proj/portal.wiki/images/e/e0/ LinkedDataSurvey.pdf • BibTeX: • http://tw.rpi.edu/proj/portal.wiki/images/3/37/ LinkedDataSurvey.bbl 41