Data on the Semantic
               Web
                       Peter Mika
         Senior Research Scientist
                 Yahoo! Research
Vague, but exciting… Berners-Lee and the dawn of the Web




                           -2-
Semantic Web

  • Publish information in a way that is easier to process for machines
  • Web of Data instead of Web of Documents
  • Two main architectural challenges
      – A common format for sharing data
      – Sharing the meaning of data
          • Through social means (shared schemas)
          • By using powerful schema languages
  • Semantic Web standards from W3C
      – Languages (RDF, OWL, RIF)
      – Serializations (RDF/XML, RDFa)
      – Protocols (SPARQL, HTTP)
  • Semantic Web research into knowledge representation and
    reasoning, data integration, data quality and many other topics
  • Community efforts to publish data and develop schemas
                                      -3-
Resource Description Framework (RDF)

  • Each resource (thing, entity) is identified by a URI
      – Globally unique identifiers
  • RDF represents knowledge as a set of triples
      – Each triple is a single fact about the entity (an attribute or a
        relationship)
  •   A set of triples forms an RDF graph

      RDF document
                                 type       foaf:Person

                 example:roi        name

                                            “Roi Blanco”
                                      -4-
Linking across the Web
 Roi’s homepage                                  Friend-of-a-Friend ontology

                                 type
   example:roi                                         foaf:Person
                      name


                          “Roi Blanco”                        knows
           sameAs



 Yahoo!’s website
                                                type

                    worksWith
   #roi2                        #peter

                                              email

                                                “pmika@yahoo-inc.com”
                                        -5-
Vocabularies (ontologies)

   • Ontologies are collections of classes and properties used to
     describe objects in a particular domain
      – OWL (the Web Ontology Language) is the standard ontology
        language
      – OWL has an RDF serialization: ontologies are part of the
        Semantic Web
   • Classes can be described by sub- and superclasses,
     required properties
      – Class membership in RDF is expressed using the rdf:type
        property
      – An instance can have multiple classes (types)
      – A class can have multiple superclasses
   • Properties can be described by their domain, range,
     cardinalities, etc.
                                  -7-
Example: schema.org

  • Agreement on a shared set of schemas for common types of
    web content
     – Bing, Google, and Yahoo! as initial supporters
     – Similar in intent to sitemaps.org (2006)
         • Use a single format to communicate the same information to all
           three search engines
  • Support for microdata
  • schema.org covers areas of interest to all search engines
     – Business listings (local), creative works (video), recipes,
       reviews
     – User defined extensions
  • Each search engine continues to develop its products


                                    -8-
Documentation and OWL ontology




                         -9-
Sources of data
Data on the Web

  • Most web pages on the Web are generated from structured
    data
     – Data is stored in relational databases (typically)
     – Queried through web forms
     – Presented as tables or simply as unstructured text
  • The structure and semantics (meaning) of the data is not
    directly accessible to search engines
  • Two solutions
     – Extraction using Information Extraction (IE) techniques
       (implicit metadata)
         • Supervised vs. unsupervised methods
     – Relying on publishers to expose structured data using standard
       Semantic Web formats (explicit metadata)
         • Particularly interesting for long tail content
                                       - 11 -
Information Extraction methods

   • Natural Entity Recognition (NER) and Disambiguation
     (NED)
      •   OpenCalais, Zemanta API, Dbpedia Spotlight
      •   Yahoo! Placemaker

   • Extraction of structured data from text
      – Yago system (demo)
   • Exploiting patterns in web page structure
      – Dapper
      – ScraperWiki
   • Extraction from HTML tables
      – Google Squared (deprecated)


                                     - 12 -
Publishing and consuming data on the Semantic Web

  • Publishing data involves
     – Deciding in which format to publish your data
     – Deciding which schema (ontology, vocabulary) to use
         • OR you can create a new schema and publish it as well


  • Multiple ways of publishing RDF data:
     1. Linked Data
     2. Metadata in HTML
     3. SPARQL endpoints
     4. Feeds, e.g. OData


     Note: you may implement more than one

                                    - 13 -
Option 1: Linked Data

   • A web of RDF documents in parallel to the current Web
      – Most often implemented as wrappers around databases or APIs
   • The four rules of Linked Data:
      – Use URIs to identify things.
      – Use HTTP URIs so that these things can be referred to and
        looked up ("dereference") by people and user agents.
      – Provide useful information about the thing when its URI is
        dereferenced, using standard formats such as RDF-XML.
      – Include links to other, related URIs in the exposed data to
        improve discovery of other related information on the Web.

                                                                                                           ..
                                                                                                           #PeterM
                                                                                                                          “Peter Mika”
                                                                                                                                           “Budapest”



         ..             “Peter Mika”
                                                          ..
                                                          #PeterM
                                                                         “Peter Mika”
                                                                                          “Budapest”
                                                                                                                     label
                                                                                                                        #Bud                  “2,000,000”

                                         “Budapest”                                                                                label
         #PeterM                                                                                                 born
                                                                    label                    “2,000,000”                                     #Hun
                   label
                                                                       #Bud                                                        population
                                            “2,000,000”
                      #Bud                                                        label
                                                                born                                                             capital-of
               born
                                 label                                                      #Hun
                                           #Hun                                   population
                                 population
                                                                                capital-of
                               capital-of


                                                                              - 14 -
Option 1: Linked Data

   • Advantages:
      – No change to the publishing of the HTML documents
      – Data can be published by third party (e.g. Dbpedia)
   • Disadvantages:
      – Web servers need to be configured to properly handle URIs that
        identify concepts instead of documents
      – Not favored by search engines
          • Lack of use cases
          • Crawling needs to be changed
          • Authority is difficult to determine
   • Tools
      – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
      – RDB-to-RDF mappers (e.g. D2RQ, Triplify)
      – Validators (Vapour)
      – Linked Data browsers (many)
                                            - 15 -
Growth of Linked Data

   • Community effort to (re)publish open datasets as Linked
     Data
      – In particular, scientific and government datasets
      – see linkeddata.org, the Data Hub




                                   - 16 -
Option 2: Metadata in HTML

  •   Using microformats, RDFa, Microdata (more later)
  •   Advantages:
      – Data and document are always in sync
                                                         Peter Mika
                                                          Peter Mika
      – Browser plug-in friendly                         was born
                                                          was born
                                                         in
                                                          in
      – Search engine friendly                           Budapest.
                                                          Budapest.
      – Copy-paste friendly
                                                                  “Peter Mika”
                                                                                 “Budapest”
                                                     #PeterM
                                                               label               “2,000,000”
                                                               #Bud
                                                                           label


  •
                                                           born

      Tools:
                                                                                   #Hun
                                                                           population

                                                                         capital-of



      – Any23 (Anything to Triples)                                                               Peter Mika
                                                                                                   Peter Mika
      – RDFaCE                                                                                    was born
                                                                                                   was born
                                                                                                  in
                                                                                                   in
      – RDFa Distiller                                                                            Budapest.
                                                                                                   Budapest.  “Peter Mika”
                                                                                                                             “Budapest”
                                                                                                 #PeterM
                                                                                                           label               “2,000,000”
                                                                                                           #Bud
                                                                                                                       label
                                                                                                       born
                                                                                                                               #Hun
                                                                                                                       population

                                                                                                                     capital-of




                                      - 17 -
Example: Facebook’s Open Graph Protocol

  • RDF vocabulary to be used in conjunction with RDFa
      – Simplify the work of developers by restricting the freedom in RDFa
  • Activities, Businesses, Groups, Organizations, People, Places,
    Products and Entertainment
  • Only HTML <head> accepted
  • http://opengraphprotocol.org/

 <html xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
    <title>The Rock (1996)</title>
    <meta property="og:title" content="The Rock" />
    <meta property="og:type" content="movie" />
    <meta property="og:url"
    content="http://www.imdb.com/title/tt0117500/" />
    <meta property="og:image" content="http://ia.media-
    imdb.com/images/rock.jpg" /> …
 </head> ...                        - 18 -
Current state of metadata on the Web

   • 31% of webpages, 5% of domains contain some
     metadata
       – Analysis of the Bing Crawl (US crawl, January, 2012)
       – RDFa is most common format
   • By URL: 25% RDFa, 7% microdata, 9% microformat
   • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
       – Adoption is stronger among large publishers
   • Especially for RDFa and microdata
   • See also
      – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
        LDOW 2012
      – H.Mühleisen, C.Bizer.Web
         Data Commons - Extracting Structured Data from Two Large Web Corpo
        , LDOW 2012


                                         - 19 -
Exponential growth in RDFa data

                      Another five-fold increase
                      Another five-fold increase
                      between October 2010 and
                      between October 2010 and
                      January, 2012
                      January, 2012



                  Five-fold increase
                  Five-fold increase
                  between March, 2009 and
                  between March, 2009 and
                  October, 2010
                  October, 2010




     Percentage of URLs with embedded metadata in various formats
                                    - 20 -
Option 3: SPARQL endpoints

  • An API for accessing RDF databases on the Web
     – A query language and an HTTP protocol                                    “Peter Mika”
                                                                                                 “Budapest”
                                                                 #PeterM


  • Advantages:                                                        born
                                                                           label
                                                                              #Bud
                                                                                         label
                                                                                                    “2,000,000”



                                                                                                   #Hun
                                                                                         population



     – Flexible access: make any query you want
                                                                                       capital-of




     – Also possible to expose a traditional RDBMs via a wrapper
  • Disadvantages:
     – For the publisher: cost of supporting arbitrary queries
     – For the search engine: discovery of SPARQL servers is unsolved
  • Tools:
     – Triple stores
         • Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc.
     – RDB-to-RDF mappers such as D2RQ and Triplify

                                   - 21 -
Example: Dbpedia

  • demo




                   - 22 -
Crawling the Semantic Web

  • Linked Data
     – Similar to HTML crawling, but the the crawler needs to parse
       RDF/XML (and others) to extract URIs to be crawled
     – Semantic Sitemap/VOID descriptions
  • RDFa
     – Same as HTML crawling, but data is extracted after crawling
     – Mika et al. Investigating the Semantic Gap through Query Log
       Analysis, ISWC 2010.
  • SPARQL endpoints
     – Endpoints are not linked, need to be discovered by other
       means
     – Semantic Sitemap/VOID descriptions


                                  - 24 -
Data fusion

   • Ontology (schema) matching
      – Widely studied in Semantic Web research
          • ontologymatching.org
   • Entity resolution
      – Finding links between datasets
      – Tools: SILK, LIMES
   • Blending
      – Merging objects that represent the same real world entity and
        reconciling information from multiple sources
   • Cleaning
      – Google Refine



                                   - 25 -
More info

   • Ideas for hacks
      – http://challenge.semanticweb.org/
      – http://iswc2011.semanticweb.org/calls/linked-data-a-thon/
   • Book
      – Segaran, Evans and Taylor. Programming the Semantic Web.
        O’Reilly, 2009.
   • More tools
      – Exhibit: faceted browsing and other visualizations
      – http://www.dajobe.org/talks/200906-semtech-open/
      – LOD2 stack (stack.lod2.eu)




                                   - 26 -

Hackathon s pb

  • 1.
    Data on theSemantic Web Peter Mika Senior Research Scientist Yahoo! Research
  • 2.
    Vague, but exciting…Berners-Lee and the dawn of the Web -2-
  • 3.
    Semantic Web • Publish information in a way that is easier to process for machines • Web of Data instead of Web of Documents • Two main architectural challenges – A common format for sharing data – Sharing the meaning of data • Through social means (shared schemas) • By using powerful schema languages • Semantic Web standards from W3C – Languages (RDF, OWL, RIF) – Serializations (RDF/XML, RDFa) – Protocols (SPARQL, HTTP) • Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics • Community efforts to publish data and develop schemas -3-
  • 4.
    Resource Description Framework(RDF) • Each resource (thing, entity) is identified by a URI – Globally unique identifiers • RDF represents knowledge as a set of triples – Each triple is a single fact about the entity (an attribute or a relationship) • A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco” -4-
  • 5.
    Linking across theWeb Roi’s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAs Yahoo!’s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com” -5-
  • 6.
    Vocabularies (ontologies) • Ontologies are collections of classes and properties used to describe objects in a particular domain – OWL (the Web Ontology Language) is the standard ontology language – OWL has an RDF serialization: ontologies are part of the Semantic Web • Classes can be described by sub- and superclasses, required properties – Class membership in RDF is expressed using the rdf:type property – An instance can have multiple classes (types) – A class can have multiple superclasses • Properties can be described by their domain, range, cardinalities, etc. -7-
  • 7.
    Example: schema.org • Agreement on a shared set of schemas for common types of web content – Bing, Google, and Yahoo! as initial supporters – Similar in intent to sitemaps.org (2006) • Use a single format to communicate the same information to all three search engines • Support for microdata • schema.org covers areas of interest to all search engines – Business listings (local), creative works (video), recipes, reviews – User defined extensions • Each search engine continues to develop its products -8-
  • 8.
  • 9.
  • 10.
    Data on theWeb • Most web pages on the Web are generated from structured data – Data is stored in relational databases (typically) – Queried through web forms – Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions – Extraction using Information Extraction (IE) techniques (implicit metadata) • Supervised vs. unsupervised methods – Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata) • Particularly interesting for long tail content - 11 -
  • 11.
    Information Extraction methods • Natural Entity Recognition (NER) and Disambiguation (NED) • OpenCalais, Zemanta API, Dbpedia Spotlight • Yahoo! Placemaker • Extraction of structured data from text – Yago system (demo) • Exploiting patterns in web page structure – Dapper – ScraperWiki • Extraction from HTML tables – Google Squared (deprecated) - 12 -
  • 12.
    Publishing and consumingdata on the Semantic Web • Publishing data involves – Deciding in which format to publish your data – Deciding which schema (ontology, vocabulary) to use • OR you can create a new schema and publish it as well • Multiple ways of publishing RDF data: 1. Linked Data 2. Metadata in HTML 3. SPARQL endpoints 4. Feeds, e.g. OData Note: you may implement more than one - 13 -
  • 13.
    Option 1: LinkedData • A web of RDF documents in parallel to the current Web – Most often implemented as wrappers around databases or APIs • The four rules of Linked Data: – Use URIs to identify things. – Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents. – Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML. – Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. .. #PeterM “Peter Mika” “Budapest” .. “Peter Mika” .. #PeterM “Peter Mika” “Budapest” label #Bud “2,000,000” “Budapest” label #PeterM born label “2,000,000” #Hun label #Bud population “2,000,000” #Bud label born capital-of born label #Hun #Hun population population capital-of capital-of - 14 -
  • 14.
    Option 1: LinkedData • Advantages: – No change to the publishing of the HTML documents – Data can be published by third party (e.g. Dbpedia) • Disadvantages: – Web servers need to be configured to properly handle URIs that identify concepts instead of documents – Not favored by search engines • Lack of use cases • Crawling needs to be changed • Authority is difficult to determine • Tools – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) – RDB-to-RDF mappers (e.g. D2RQ, Triplify) – Validators (Vapour) – Linked Data browsers (many) - 15 -
  • 15.
    Growth of LinkedData • Community effort to (re)publish open datasets as Linked Data – In particular, scientific and government datasets – see linkeddata.org, the Data Hub - 16 -
  • 16.
    Option 2: Metadatain HTML • Using microformats, RDFa, Microdata (more later) • Advantages: – Data and document are always in sync Peter Mika Peter Mika – Browser plug-in friendly was born was born in in – Search engine friendly Budapest. Budapest. – Copy-paste friendly “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label • born Tools: #Hun population capital-of – Any23 (Anything to Triples) Peter Mika Peter Mika – RDFaCE was born was born in in – RDFa Distiller Budapest. Budapest. “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label born #Hun population capital-of - 17 -
  • 17.
    Example: Facebook’s OpenGraph Protocol • RDF vocabulary to be used in conjunction with RDFa – Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ... - 18 -
  • 18.
    Current state ofmetadata on the Web • 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format • By URL: 25% RDFa, 7% microdata, 9% microformat • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers • Especially for RDFa and microdata • See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpo , LDOW 2012 - 19 -
  • 19.
    Exponential growth inRDFa data Another five-fold increase Another five-fold increase between October 2010 and between October 2010 and January, 2012 January, 2012 Five-fold increase Five-fold increase between March, 2009 and between March, 2009 and October, 2010 October, 2010 Percentage of URLs with embedded metadata in various formats - 20 -
  • 20.
    Option 3: SPARQLendpoints • An API for accessing RDF databases on the Web – A query language and an HTTP protocol “Peter Mika” “Budapest” #PeterM • Advantages: born label #Bud label “2,000,000” #Hun population – Flexible access: make any query you want capital-of – Also possible to expose a traditional RDBMs via a wrapper • Disadvantages: – For the publisher: cost of supporting arbitrary queries – For the search engine: discovery of SPARQL servers is unsolved • Tools: – Triple stores • Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc. – RDB-to-RDF mappers such as D2RQ and Triplify - 21 -
  • 21.
    Example: Dbpedia • demo - 22 -
  • 22.
    Crawling the SemanticWeb • Linked Data – Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled – Semantic Sitemap/VOID descriptions • RDFa – Same as HTML crawling, but data is extracted after crawling – Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. • SPARQL endpoints – Endpoints are not linked, need to be discovered by other means – Semantic Sitemap/VOID descriptions - 24 -
  • 23.
    Data fusion • Ontology (schema) matching – Widely studied in Semantic Web research • ontologymatching.org • Entity resolution – Finding links between datasets – Tools: SILK, LIMES • Blending – Merging objects that represent the same real world entity and reconciling information from multiple sources • Cleaning – Google Refine - 25 -
  • 24.
    More info • Ideas for hacks – http://challenge.semanticweb.org/ – http://iswc2011.semanticweb.org/calls/linked-data-a-thon/ • Book – Segaran, Evans and Taylor. Programming the Semantic Web. O’Reilly, 2009. • More tools – Exhibit: faceted browsing and other visualizations – http://www.dajobe.org/talks/200906-semtech-open/ – LOD2 stack (stack.lod2.eu) - 26 -

Editor's Notes

  • #9 Facebook invited, but continues to pursue OGP