Semantic Web and


Published on

Talk given at SemTech 2014 (and earlier, at ISWC 2013) on the evolution of the Semantic Web and

Published in: Internet, Software, Technology

Semantic Web and

  1. What a long, strange trip it’s been R.V.Guha Google
  2. Outline of talk • The context – How did we end up where we are • – What it is, status of adoption – principles, how does it work • Looking ahead – Next Generation Applications
  3. About 18 years ago, … • People started thinking about structured data on the web – A few people from Netscape, Microsoft and W3C got together @MIT • Trying to make sense of a flurry of activity/proposals – XML, MCF, CDF, Sitemaps, … • There were a number of problems – PICS, Meta data, sitemaps, … • But one unifying idea
  4. Context: The Web for humans Structured Data Web server HTML
  5. Goal: Web for Machines & Humans Structured Data Web server Apps
  6. What does that mean? birthplace Chuck Norris Ryan, Oklahama birthdate March 10th 1940 Actor type - Notable points - Graph Data Model - Common Vocabulary
  7. How do we get there? • How does the author give us the graph – Data Model: Graph vs tree vs … – Syntax – Vocabulary – Identifiers for objects • Why should the author give us the graph?
  8. Going depth first • Many heated battles – Lot of proposals, standards, companies, … • Data model – Trees vs DLGs vs Vertical specific vs who needs one? • Syntax – XML vs RDF vs json vs … • Model theory anyone – We need one vs who cares vs what’s that?
  9. Timeline of ‘standards’ • ‘96: Meta Content Framework (MCF) (Apple) • ’97: MCF using XML (Netscape)  RDF, CDF • ’99 -- : RDF, RDFS • ’01 -- : DAML, OWL, OWL EL, OWL QL, OWL RL • ’03: Microformats • And many many many more … SPARQL, Turtle, N3, GRDDL, R2RML, FOAF, SIOC, SKOS, … • Lots of bells & whistles: model theory, inference, type systems, …
  10. But something was missing … • Fewer than 1000 sites were using these standards • Something was clearly missing and it wasn’t more language features • We had forgotten the ‘Why’ part of the problem • The RSS story
  11. ’07 - :Rise of the consumers • Yahoo! Search Monkey, Google Rich Snippets, Facebook Open Graph • Offer webmasters a simple value proposition • Search engines to webmasters: – You give us data … we make your results nicer • Usage begins to take off – 1000x increase in markup’ed up pages in 3 years
  12. Yahoo Search Monkey • Give websites control over snippet presentation • Moderate adoption – Targeted at high end developers – Too many choices
  13. Google Rich Snippets: Reviews
  14. Google Rich Snippets: Events
  15. Google Rich Snippets • Multi-syntax • Adhoc vocabulary for each vertical • Very clear carrot • Lots of experimentation on UI • Moderately successful: 10ks of sites • Scaling issues with vocabulary
  16. Situation in 2010 • Too many choices/decisions for webmasters – Divergence in vocabularies • Too much fragmentation • N versions of person, address, … • A lot of bad/wrong markup – ~25% for micro-formats, ~40% with RDFA – Some spam, mostly unintended mistakes • Absolute adoption numbers still rather low – Less than 100k sites
  17. • Work started in August 2010 – Google, Yahoo!, Microsoft & then Yandex • Goals: – One vocabulary understood by all the search engines – Make it very easy for the webmaster • It is A vocabulary. Not The vocabulary. – Webmasters can use it together other vocabs – We might not understand the other vocabs. Others might
  18. Major sites • News: Nytimes,,, • Movies: imdb, rottentomatoes, • Jobs / careers:,, • People:, • Products:,,,,, • Videos: youtube, dailymotion,, • Medical:, • Local:,, • Events:,,, eventful • Music:,,
  19. principles: Simplicity • Simple things should be simple – For webmasters, not necessarily for consumers of markup – Webmasters shouldn’t have to deal with N namespaces • Complex things should be possible – Advanced webmasters should be able to mix and match vocabularies • Syntax – Microdata, usability studies – RDFa, json-ld, …
  20. principles: Simplicity • Can’t expect webmasters to understand Knowledge Representation, Semantic Web Query Languages, etc. • It has to fit in with existing workflows – A posteriori ‘markup tools’ don’t work • Avoid KR system driven artifacts – Multiple domain / range for attributes – No classes like ‘Agent’ – Categories and attributes should be concrete
  21. principles: Simplicity • Copy and edit as the default mode for authors – It is not a linear spec, but a tree of examples • Vocabularies – Authors only need to have local view – But tries to have a single global coherent vocabulary
  22. principles: Incremental • Started simple – ~ 100 categories at launch • Applies to every area – Add complexity after adoption – now ~1200 vocab items – Go back and fill in the blanks • Move fast, accept mistakes, iterate fast
  23. Principles: URIs • ~1000s of terms like Actor, birthdate – ~10s for most sites – Common across sites • ~10ks of terms like USA – External enumerations Chuck Norris birthplace • ~1b-100b terms like Chuck Norris and Ryan, Oklahama – Cannot expect agreement on these – Reference by description – Consumers can reconcile entity references Ryan, Oklahama March 10th 1940 Actor type citizenOf USA birthdate
  24. An Actor named Chuck Norris March 10th 1940 citizenOf USA birthdate A city named Ryan In the state OK birthplace birthdate March 10th 1940 An Actor named Chuck Norris + spouse A Person named Geena O’Kelley = Chuck Norris USA Ryan, Oklahama birthplace spouse March 10th 1940 Actor type citizenOf birthdate Geena O’Kelley
  25. Principles: Collaborations • Most discussions on public W3C lists • Work closely with interest communities • Work with others to incorporate their vocabularies – We give them attribution on – Webmasters should not have to worry about where each piece of the vocabulary came from – Webmasters can mix and match vocabs
  26. Principles: Collaborations • IPTC /NYTimes / Getty with rNews • Martin Hepp with Good Relations • US Veterans, Whitehouse, with Job Posting • Creative Commons with LRMI • NIH National Library of Medicine for Medical vocab. • Bibextend, Highwire Press for Bibliographic vocabulary • Benetech for Accessibility • BBC, European Broadcasting Union for TV & Radio schema • Stackexchange, SKOS group for message board • Lots and lots and lots of individuals
  27. Principles: Partners • Partner with Authoring platforms – Drupal, Wordpress, Blogger, YouTube • Drupal 8 – markup for many types • News articles, comments, users, events, … – More types can be created by site author – Markup in HTML5 & RDFa Lite – Will come out early 2015
  28. Recent Additions • From Nouns to Verbs: Actions – Object  potential actions – Constraints on actions – E.g., ThorMovie  Stream, Buy, … • Introducing time: Roles – E.g., Joe Montana played for the SF 49ers from 1979 to 1992 in the position QuarterBack
  29. Recent Additions • Scholarly work, Comics, Serials, … • Communications: TV, Radio, Q&A, … • Accessibility • Commerce: Reservations • Sports • Buyer/Seller, etc. • Bibtex • The ontology is growing … – ~800 properties – ~600 classes
  30. Looking forward • is doing better than we expected – Thanks to millions of webmasters! • But this is not the final goal – Just the means to the next generation of applications • First generation of applications – Rich presentation of search results • Many new applications – Related to search and beyond
  31. Newer Applications: Knowledge Graph
  32. Newer Applications: Knowledge Graph
  33. Non search applications: Google Now User profile ( + structured data feeds
  34. Pinterest: for Rich Pins
  35. Reservations  Personal Assistant • Open Table website  confirmation email  Android Reminder
  36. Vertical Search • Structured data in search – Web search: annotate search results OR – Filtering based on structured data • Only in specialized corpus • Ecommerce, real estate, etc. • How about filtering based on structured data across the web?
  37. Google Rich Snippets: Recipe View
  38. Web scale vertical search • Searching for Veteran friendly jobs
  39. Web Scale custom vertical search • Build your own custom vertical search engine – Google does the heavy lifting: crawling, indexing, etc. – You specify the restricts – APIs to help build your own UI • Searches over all pages on the web with a certain markup • Demo
  40. Scientific Data Publishing • US Govt alone spends over $60B/yr on scientific research • Primary output of most of this research is data – Most of the data is thrown away – All that is published are papers • We would like the data published in a easily reusable form
  41. Case study: Clinical Trials • Clinical trials • 4000+ clinical trials at any time in the US alone • Almost all the data ‘thrown away’ • All that gets published is a textual ‘abstract’ • Many of the trials are redundant • Earlier trials have the data • Assumptions, etc. cannot be re-examined • Longitudinal studies extremely hard, but super important • Having all the clinical trial data on the web, in a common schema will make this much easier!
  42. Case study: SkyServer • Huge amount of astronomy data • Jim Gray, NASA and others brought it all together, normalized it and made it available on the web • Has changed the way astronomy research takes place • Students in Africa getting PhDs without leaving Africa! • Radio/Ultra-violet/Visible light data easily brought together • Caveats • SQL biased, not distributed, not scalable • All normalization done by hand, once • Small number of data sources • But shows that it can be done …
  43. First steps for scientific data publication • OPTC directive for data from federally funded research to be freely available • Formation of new ‘Data Science’ institute inside NIH • Seeing traction in scientific data on the web • Lot of interest in creating schemas • Public repositories for scientific data starting
  44. Concluding • Structured data on the web is now ‘web scale’ • has got traction and is evolving • The most interesting applications are yet to come
  45. Questions?