This was a lecture I presented at Professor Stuart Madnick's class, "Evolution Towards Web 3.0" at the MIT Sloan School of Management on April 21, 2011. Please follow along with the speaker notes which add significant commentary to the slides.
1. Evolution Towards Web 3.0: The Semantic Web Experiences and Challenges on the Web and Inside Enterprises Lee Feigenbaum VP Technology & Client Services, Cambridge Semantics Co-chair W3C SPARQL Working Group lee@cambridgesemantics.com for “Evolution Towards Web 3.0”, April 21, 2011
2. Agenda How did we get here? Semantic Web: What and why How is it used today? Semantic Web challenges
3. Acknowledgement Much material used gratefully with permission of Tim Berners-Lee. All opinions and conclusions are Lee Feigenbaum’s.
4. Web Evolution 1992 1993 1994 Widespread success of Web 1.0 IMDB.com PizzaHut.com Whitehouse.gov Lycos.com Universality: anything can link to anything Push information to users Debut of Mosaic browser 1st image on the Web
5. Web Evolution 1994 1999 2004 2006 Web 1.0 is “here”. IE7 has 1st complete AJAX stack First Web 2.0 Conference Highlights User-Generated Content
7. Building Silos Web 2.0: The silo is the application Image originally from March 2008 issue of The Economist and used with permission of creator David Simonds
15. Web Evolution 1994 2004 2001 2007 2009 Web 1.0 is “here”. Web 2.0 is “here”. Semantic Web consumers include Google & Yahoo! Semantic Web publishers include Best Buy, NY Times, US and UK gov’ts
17. “The Semantic Web” Link explicit data on the World Wide Web in a machine-readable fashion …government data …commercial data …social data In order to enable… …targeted, semantic search …data browsing …automated agents Semantic Web – 1st view World Wide Web : Web pages :: The Semantic Web : Data
18. “Semantic Web technologies” A family of technology standards that ‘play nice together’, including: Flexible data model Expressive ontology language Distributed query language Drive Web sites, enterprise applications Data integration Business intelligence Large knowledgebases … Semantic Web – 2nd view The technologies enable us to build applications and solutions that were not possible, practical, or feasible traditionally.
20. Semantic Web Web of Data Giant Global Graph Data Web Web 3.0 Linked Data Web Semantic Data Web Enterprise Information Web Branding
21. Value propositions On the Web, the Semantic Web is about moving from linking documents to linking data What’s the value proposition within the enterprise?
26. Semantic Web Paradigm: Coping with Change The World Changes Traditionally: Change is costly Semantics: Change is cheap RDB 1 RDB 2
27. Integrated Enterprise Data Data Silos(structured, semi-structured, unstructured data) Excel Email MySQL Sybase Oracle …At and Beyond Enterprise Scale
36. Semantic Web In Use: Social Data People, relationships Friend Of A Friend (“FOAF”) – foaf:knows Self-published or site-published (LiveJournal, hi5, …) Blogs, discussion forums, mailing lists Semantically Interlinked Online Communities (“SIOC”) Plug-ins for popular blogging & CMS platforms Calendars, vCards, reviews, … One-offs Why don’t we have portable social networks? Yet?
44. Semantic Web In Use: Enterprises on the Web Thesis: Describe your business more precisely and drive more (and better) traffic to your site Example: NYTimes publishes their article classification scheme as linked data Example: Best Buy, Overstock.com use RDFa to annotate product listings
45. Measurable Results 30% increase in search-engine traffic 15% increase in click-through-rate for search ads
46. Many and Varied Applications Across Industries Health care and pharma integration, classification, ontologies Oil & Gas integration, classification Finance structured data, ontologies, XBRL Publishing metadata Libraries & museums metadata, classification IT rapid application development & evolution Semantic Web In Use: Inside the Enterprise
47. Targeting High-Potential Opportunities in Pharma . . . Profile Territory Preferred targets Regional Analyst Per-analyst relevance filter Universe of considered opportunities High-potential opportunities Mobile device
48. Delivering Dynamic, Data-driven Websites “ The development of this new high-performance dynamic semantic publishing stack is a great innovation for the BBC as we are the first to use this technology on such a high-profile site. It also puts us at the cutting edge of development for the next phase of the Internet, Web 3.0.
49. Semantic Web In Use: Government data Since January 2010, 2,500 (large) datasets published as Linked Data Since May 2009, 250,000 (smaller) datasets published (CSV, XML, …) RPI project to convert datasets toLinked Data
50. Tim Berners-Lee @ TED2010 http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide.html
53. Companies range from small, family-owned businesses to massive global conglomerates. But the challenges faced by even the largest corporation pale in comparison to the scope of the challenges of building a world-wide Semantic Web.
54. Economic Model What sustains Semantic Web applications in industry? What sustains the Linked Data Web? Are there viable economic models for Linked Data?
55. Big Issue: Motivation Retailers have clear motivation to put their data on the Web. But… …what if your business is data? Thomson Reuters, Bloomberg, … …what if your business is your application? Facebook, LinkedIn, Yelp, …
58. Data Quality – Two Issues What ensures data quality on the Linked Data Web? Enterprises spend millions on data quality already Knowledge management Master data management Governance and curation processes …though data quality issues do seep in when enterprises use Semantic Web to link to partners and public sources of data!
59. Trust How do we know which contributions to the Linked Data Web to trust? Trust (distrust) the contributors? Trust (distrust) the contributions? Trust (distrust) the process? How is trust established within an enterprise’s Linked Data Web?
60. Adoption Suggestion: Progress towards enterprise linked data requires far fewer people embrace Semantic Web technologies compared with a global Linked Data Web
61. Other Challenges Data licensing Open world assumption Unique name assumption Temporal data What other challenges can you think of?
Happy to go off track and follow the thread of any interesting questions and discussion that arise as we go.
See http://www.w3.org/2009/Talks/0427-web30-tbl/ for Tim Berners-Lee’s take on many of these same themes.
I’m not going to dwell on this, because everyone in this class by now surely has a deeper and more sophisticated understanding of how we got to where we are. But looking at the steps to this point in the context of a timeline may help us understand the current Semantic Web landscape.Two key characteristics of the birth and success of Web 1.0:From the very beginning was founded on democratic principles of no nodes in the Web being privileged – anyone can link to anyoneThere were (relatively speaking) very few data publishers in the Web initially. Most users browsed only to consume information.
The AJAX technology stack allowed developers to create mature Web applications (approaching parity with fat-client applications) rather than (only) Web pages. It also began allowing Web content to be repurposed to applications beyond the browser (desktop, embedded devices, mobile devices, …).Eventually, these Web applications began allowing Web users to contribute to parts of the Web rather than (only) consume Web pages.Beginning in 2002, Web thought leaders (esp. Dale Dougherty, Tim O’Reilly, John Battelle) began referring to the confluence of user-generated content, Web-as-platform, social Web, read-write Web, wisdom of crowds, … as Web 2.0.
At the physical level, computers are connected to switches, routers, etc. – network links.
The Internet directly links machines by abstracting away the network-link boundaries.
Each computer participating in the Web (Web server) is providing access to many documents (Web pages). The Web lets us make links between these documents.
The Web lets us abstract away the computers and the Internet and focus on the linked documents.
But people are rarely interested in the documents. They’re interested in the information—the data—within the documents.
The Semantic Web abstracts away the documents (the sources of the information), and leaves us with data linked together. “Linked Data”, “Web of Data”, etc. This is the “Web” part of Semantic Web.
It also gives us the tools to understand the Web of data and bring structure (“understanding”) to it. This is the “semantics” part of Semantic Web.
Tim Berners-Lee first used the term Semantic Web to describe a vision for the future of the Web as early as the first WWW conference in 1994.Along with Jim Hendler and OraLasila, he laid out this vision in a 2001 article in Scientific American. In 2007, the birth of the Linking Open Data project saw the first real concerted efforts to build out the Semantic Web by publishing data sets on the Web that could be queried and linked to one another.2008-2010 saw significant uptake in Semantic Web support on the Web and inside enterprises, highlighted by support from Google and Yahoo and data from Best Buy, NY times, US and UK governments. (Also: Drupal support in Drupal 7 and FaceBook Open Graph Protocol (2010)).This is a long time span, and yet many (myself included) would hesitate to say that “the Semantic Web (Web 3.0) is here.” When will that day come? How do we tell?
What’s been happening this whole time? (Between the introduction of the vision and today.) A lot of technology, standards, tool, and product development. Also, a lot of advocacy.
This is the ultimate vision as per the original Scientific American article. Referred to last week as the “top-down approach”.
Many of the people that have been building the technologies, standards, and tools are doing so with these ends in mind. They have (disruptive, game-changing) problems today and these technologies provide a way to solve them today.
Different nuances, but the same actual thing. Still, you can often tell a lot about someone’s view of Semantic Web based on the terms they choose to you to describe it. Linked Data Web has been – relatively speaking – successful in gaining traction.
Ideas?Incremental value – improved efficiencyLink/integrate beyond traditional enterprise sources – greater value, more appealing partnerShadow data (emails, documents, spreadsheets, presentations, …)Partner data (upstream/downstream supply chain, customers, partners, channels, …)Needle in haystack (reasoning, inference, rules) – greater valueReach – improved efficiency
(This slide best told with animation in the original PowerPoint.)The Semantic Web paradigms allows new and updated data to be brought “into the fold” incrementally, without starting over. This makes it particularly amenable to changing requirements.
Databases that traditionally manage enterprise data are IT artifacts.They’re crafted by IT, for IT: asking scientists or other business domain experts to understand a relational model with scores of tables, IDs, key/value tables, unused columns, etc. is completely unrealistic.The semantic model is a conceptual model. It eschews IDs, keys, etc. in favor of concepts and relationships expressed/expressible in human language. This is reflected in software that is built with Semantic Web data. This means that when a researcher is linking their results spreadsheet, they’re dealing only in concepts that they’re familiar with (organism, cell line, % inhibition, 4P, IC50, etc.). And that in turn means that this approach works regardless of whatever spreadsheet layout a particular collaborator is using: researchers can continue using their current spreadsheets, with no change.
We’re not yet at the point where the Semantic Web is a magic crank. It’s not yet:An automated way for pharmaceutical companies to discover new drugs for their pipelinesAn automated way for oil and gas companies to identify productive drilling locationsThe (generic, intelligent) travel butler, or other autonomous Web-based agentBut nevertheless, a lot of people are embracing linked data in a lot of ways, and a lot of companies are using Semantic Web technologies and a linked data approach successfully today. What follows are some examples.
Web 1.0 and Web 2.0(+) are core parts of our lives, from reading CNN.com to buying things on amazon.com to facebook and twitter and Web-delivered mobile apps for scanning bar codes, looking up music, etc.Web 3.0 is not so obvious. The answer to the question is “at least occasionally, but you probably never see it.”. We’ll see some examples of where you might be seeing the fringes of Web 3.0 in the coming slides, including:Facebook open graph protocolDrupalRDFa in search results with Google and Yahoo!BBC World Cup site…
Possible answers: Few people are driven by data ownership, data portabilityPeople are drawn to specific sitesPeople _want_ to segment their online profiles (c.f. Facebook vs. LinkedIn)Drupal—which runs 1% of the world’s Web sites—is on the leading edge of adoption of the Semantic Web for content-driven sites. Drupal 7 exposes the semantics of Drupal sites’ natural structures to Google/Yahoo! with RDFa. Also modules for SIOC and Facebook OGP.
The key point here is that though FB published this protocol, it relies on open Semantic Web standards (RDFa) that anyone else can consume. The same semantics allow people to link the “Like” button to the type of artifact being liked (movie, here) and also can allow search engines to give more structure, query engines to find more data, etc.
Image courtesy of http://bio2rdf.org/ .Scientific data makes up a significant portion of the current Linked Data Web. This is information on proteins and genes, pathways, and sequences, chemistry and genetics, … This diagram shows some of the information available and how its linked together. Nodes are sized according to their quantity of data, and links are sized according to the quantity of links.
Google (Rich Snippets) and Yahoo! (originally Search Monkey) consume semantic markup to enhance search listings.
Many enterprise uses of Semantic Web / Linked Data are highlighted at: http://www.w3.org/2001/sw/sweo/public/UseCases/
Question: Where in this scenario do you think Semantic Web concepts and technologies are being employed? What would the alternative be?Answers: integrating data to get as large a universe as possible; rules and reasoning to intelligently filter the data
Combine manual tagging with ontology-driven reasoning and ontology-driven dynamic aggregation (700 index pages, more than the rest of the sports site combined) to produce a dynamic, cross-indexed, cross-linked, useful site for the World Cup.What is the semantic value here? * Produce an information rich site at many levels of aggregation (player, team, geography, group, …) without employing a large fleet of editors to curate the site’s _content_. Instead, maintain an ontology and provide a content tagging process. * Use the ontology to help automate the tagging process (forward-chaining inference based on taxonomies)For more details:http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html http://www.bbc.co.uk/blogs/bbcinternet/2010/07/the_world_cup_and_a_call_to_ac.html
Other governments with similar efforts. Australia, Sweden,New Zealand, … , various local governments
At TED2010, Tim Berners-Lee reported back on one year’s worth of progress after the push for raw data began in 2009. Q: What's special about Semantic Web / Linked Data here? What would be different if this were all put out using "Web 2.0" approaches? * baked into the Web -- _easy to publish and consume via the existing Web infrastructure, flexible, heterogeneous_ * "semantic" -- _easy for 3rd parties to understand - no screen scraping, "guessing" - lots you can do with it (layer cake)
It’s not all sunshine, rainbows, and puppies…(This slide better with animation, sorry!)
Industry: no different from any other investment – expect to see ROI, whether in the form of time-to-market, competitive advantage, greater efficiencies, lessened resource requirements, etc. Look for disruptive (10x) improvements.Linked Data Web:Putting raw data on the Linked Data Web takes work.Scientific data is funded by government money, with requirements for opennessCommercial data is driven by ROI (cf Best Buy’s experience)Government data is tricky—at the whims of politics. (cf data.gov.uk with the change from Gordon Brown to David Cameron)Maintaining links between data sets is tricky. Is it any trickier than building the document Web? (Maybe not.)
Another example – NY Times, while embracing Linked Data Web to some extent, is putting their real content behind a pay wall.Image copyright Scott Brinker, with attribution to http://www.chiefmartec.com/2010/01/7-business-models-for-linked-data.html .See also http://www.ldodds.com/blog/2010/01/thoughts-on-linked-data-business-models/ --Advertising is hard when people aren’t the consumers and when all data is semantically identified! (Advertising via Ts&Cs possible)
A large (Fortune 100) company might have 10,000 database. And some of those database might be huge – 10 TB or bigger.But large enterprises are also sub-segmentable in ways in which the Web is not. There are divisional, departmental, geographic problems that can be solved as if solving the problems of a much smaller enterprise. There are social challenges (some of which are covered elsewhere in section of the talk), but there are also pure technical challenges when working at Web scale:Distributed queryCache invalidationLink rotData rot (Linked CT example)Rules / reasoning across data sets
(While this is a challenge for being able to fully exploit the Linked Data Web, it’s also an opportunity – before the Linked Data Web, there was little opportunity to find and improve these sorts of data quality issues. Linked Data gives us visibility into these data issues so that source data can be improved. But it is still a challenge to figure out a model for improving and verifying data quality before individual human interpretation can be removed from the chain.)
Possible ideas:Up-vote/down-vote for data and data sets(wisdom of crowds)Build agents off of authoritative (1st-party) sourcesCertified sources, audited sources, regulated sources…
Potential approaches to trust:Digital signaturesSocial network analysisMultiple assertions of the same fact (voting, data quality all over again)Provenance (how did we arrive at this data assertion)Trust the contributions – it’s data quality all over again! (specific facts, sets of facts, entire data sets)Related issue: uncertaintyWithin an enterprise: accepted sources of authority; default trust state
Enterprise’s can derive incremental value via a small number of Semantic Web vendors and Semantic Web knowledgeable system integrators (Sis). To gain traction on a Web-scale, however, requires the world of Web 1.0 & 2.0 (LAMP, JSON, …) developers to adopt these new (and arguably more complex) technologies.
Only 9% of Linked Open Data datasets include machine-readable license information. (http://ivan-herman.name/2011/03/29/ldow2011-workshop/)Some links for further reading:http://www.opendefinition.org/guide/data/