Hackathon s pb


Published on

Slides for the Open Data Hackathon at Saint Petersburg

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Facebook invited, but continues to pursue OGP
  • Hackathon s pb

    1. 1. Data on the Semantic Web Peter Mika Senior Research Scientist Yahoo! Research
    2. 2. Vague, but exciting… Berners-Lee and the dawn of the Web -2-
    3. 3. Semantic Web • Publish information in a way that is easier to process for machines • Web of Data instead of Web of Documents • Two main architectural challenges – A common format for sharing data – Sharing the meaning of data • Through social means (shared schemas) • By using powerful schema languages • Semantic Web standards from W3C – Languages (RDF, OWL, RIF) – Serializations (RDF/XML, RDFa) – Protocols (SPARQL, HTTP) • Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics • Community efforts to publish data and develop schemas -3-
    4. 4. Resource Description Framework (RDF) • Each resource (thing, entity) is identified by a URI – Globally unique identifiers • RDF represents knowledge as a set of triples – Each triple is a single fact about the entity (an attribute or a relationship) • A set of triples forms an RDF graph RDF document type foaf:Person example:roi name “Roi Blanco” -4-
    5. 5. Linking across the Web Roi’s homepage Friend-of-a-Friend ontology type example:roi foaf:Person name “Roi Blanco” knows sameAs Yahoo!’s website type worksWith #roi2 #peter email “pmika@yahoo-inc.com” -5-
    6. 6. Vocabularies (ontologies) • Ontologies are collections of classes and properties used to describe objects in a particular domain – OWL (the Web Ontology Language) is the standard ontology language – OWL has an RDF serialization: ontologies are part of the Semantic Web • Classes can be described by sub- and superclasses, required properties – Class membership in RDF is expressed using the rdf:type property – An instance can have multiple classes (types) – A class can have multiple superclasses • Properties can be described by their domain, range, cardinalities, etc. -7-
    7. 7. Example: schema.org • Agreement on a shared set of schemas for common types of web content – Bing, Google, and Yahoo! as initial supporters – Similar in intent to sitemaps.org (2006) • Use a single format to communicate the same information to all three search engines • Support for microdata • schema.org covers areas of interest to all search engines – Business listings (local), creative works (video), recipes, reviews – User defined extensions • Each search engine continues to develop its products -8-
    8. 8. Documentation and OWL ontology -9-
    9. 9. Sources of data
    10. 10. Data on the Web • Most web pages on the Web are generated from structured data – Data is stored in relational databases (typically) – Queried through web forms – Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions – Extraction using Information Extraction (IE) techniques (implicit metadata) • Supervised vs. unsupervised methods – Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata) • Particularly interesting for long tail content - 11 -
    11. 11. Information Extraction methods • Natural Entity Recognition (NER) and Disambiguation (NED) • OpenCalais, Zemanta API, Dbpedia Spotlight • Yahoo! Placemaker • Extraction of structured data from text – Yago system (demo) • Exploiting patterns in web page structure – Dapper – ScraperWiki • Extraction from HTML tables – Google Squared (deprecated) - 12 -
    12. 12. Publishing and consuming data on the Semantic Web • Publishing data involves – Deciding in which format to publish your data – Deciding which schema (ontology, vocabulary) to use • OR you can create a new schema and publish it as well • Multiple ways of publishing RDF data: 1. Linked Data 2. Metadata in HTML 3. SPARQL endpoints 4. Feeds, e.g. OData Note: you may implement more than one - 13 -
    13. 13. Option 1: Linked Data • A web of RDF documents in parallel to the current Web – Most often implemented as wrappers around databases or APIs • The four rules of Linked Data: – Use URIs to identify things. – Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents. – Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML. – Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web. .. #PeterM “Peter Mika” “Budapest” .. “Peter Mika” .. #PeterM “Peter Mika” “Budapest” label #Bud “2,000,000” “Budapest” label #PeterM born label “2,000,000” #Hun label #Bud population “2,000,000” #Bud label born capital-of born label #Hun #Hun population population capital-of capital-of - 14 -
    14. 14. Option 1: Linked Data • Advantages: – No change to the publishing of the HTML documents – Data can be published by third party (e.g. Dbpedia) • Disadvantages: – Web servers need to be configured to properly handle URIs that identify concepts instead of documents – Not favored by search engines • Lack of use cases • Crawling needs to be changed • Authority is difficult to determine • Tools – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) – RDB-to-RDF mappers (e.g. D2RQ, Triplify) – Validators (Vapour) – Linked Data browsers (many) - 15 -
    15. 15. Growth of Linked Data • Community effort to (re)publish open datasets as Linked Data – In particular, scientific and government datasets – see linkeddata.org, the Data Hub - 16 -
    16. 16. Option 2: Metadata in HTML • Using microformats, RDFa, Microdata (more later) • Advantages: – Data and document are always in sync Peter Mika Peter Mika – Browser plug-in friendly was born was born in in – Search engine friendly Budapest. Budapest. – Copy-paste friendly “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label • born Tools: #Hun population capital-of – Any23 (Anything to Triples) Peter Mika Peter Mika – RDFaCE was born was born in in – RDFa Distiller Budapest. Budapest. “Peter Mika” “Budapest” #PeterM label “2,000,000” #Bud label born #Hun population capital-of - 17 -
    17. 17. Example: Facebook’s Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa – Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ... - 18 -
    18. 18. Current state of metadata on the Web • 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format • By URL: 25% RDFa, 7% microdata, 9% microformat • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers • Especially for RDFa and microdata • See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpo , LDOW 2012 - 19 -
    19. 19. Exponential growth in RDFa data Another five-fold increase Another five-fold increase between October 2010 and between October 2010 and January, 2012 January, 2012 Five-fold increase Five-fold increase between March, 2009 and between March, 2009 and October, 2010 October, 2010 Percentage of URLs with embedded metadata in various formats - 20 -
    20. 20. Option 3: SPARQL endpoints • An API for accessing RDF databases on the Web – A query language and an HTTP protocol “Peter Mika” “Budapest” #PeterM • Advantages: born label #Bud label “2,000,000” #Hun population – Flexible access: make any query you want capital-of – Also possible to expose a traditional RDBMs via a wrapper • Disadvantages: – For the publisher: cost of supporting arbitrary queries – For the search engine: discovery of SPARQL servers is unsolved • Tools: – Triple stores • Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc. – RDB-to-RDF mappers such as D2RQ and Triplify - 21 -
    21. 21. Example: Dbpedia • demo - 22 -
    22. 22. Crawling the Semantic Web • Linked Data – Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled – Semantic Sitemap/VOID descriptions • RDFa – Same as HTML crawling, but data is extracted after crawling – Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. • SPARQL endpoints – Endpoints are not linked, need to be discovered by other means – Semantic Sitemap/VOID descriptions - 24 -
    23. 23. Data fusion • Ontology (schema) matching – Widely studied in Semantic Web research • ontologymatching.org • Entity resolution – Finding links between datasets – Tools: SILK, LIMES • Blending – Merging objects that represent the same real world entity and reconciling information from multiple sources • Cleaning – Google Refine - 25 -
    24. 24. More info • Ideas for hacks – http://challenge.semanticweb.org/ – http://iswc2011.semanticweb.org/calls/linked-data-a-thon/ • Book – Segaran, Evans and Taylor. Programming the Semantic Web. O’Reilly, 2009. • More tools – Exhibit: faceted browsing and other visualizations – http://www.dajobe.org/talks/200906-semtech-open/ – LOD2 stack (stack.lod2.eu) - 26 -