Semantic mark-up with schema.org: helping search engines understand the Web

S e m a n t i c m a r k u p w i t h s c h e ma . o r g :
h e l p i n g s e a r c h e n g i n e s u n d e r s t a n d t h e We b
P R E S E N T E D B Y P e t e r M i k a , D i r e c t o r o f R e s e a r c h , Y a h o o L a b s ⎪ M a r c h 2 6 , 2 0 1 5

What it’s like to be a machine?
Roi Blanco

What it’s like to be a machine?
↵⏏☐ģ
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓
ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫
≠=⅚©§★✓♪ΒΓΕ℠
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ
⏎⌥°¶§ΥΦΦΦ✗✕☐

What can we do?
5
 Improve Information Retrieval
› Harder and harder given the same data
• Exploited term-based relevance models, hyperlink structure and interaction data
• Combination of features using machine learning
• Heavy investment in computational power
– real-time indexing, instant search, datacenters and edge services
 Improve the Web
› Make the Web more searchable?

The Semantic Web (2001-)
3/27/20156
 Part of Tim Berners-Lee’s
original proposal for the Web
 Beginning of a research community
› Formal ontology
› Logical reasoning
› Agents, web services
 Rough start in deployment
› Misplaced expectations
› Lack of adoption

 The Semantic Web, May 2001
 “At the doctor's office, Lucy instructed her
Semantic Web agent through her handheld Web
browser. The agent promptly retrieved
information about Mom's prescribed treatment
from the doctor's agent, looked up several lists
of providers, and checked for the ones in-plan
for Mom's insurance within a 20-mile radius of
her home and with a rating of excellent or very
good on trusted rating services. It then began
trying to find a match between available
appointment times (supplied by the agents of
individual providers through their Web sites) and
Pete's and Lucy's busy schedules.”
 (The emphasized keywords indicate terms
whose semantics, or meaning, were defined for
the agent through the Semantic Web.)
3/27/20157
Misplaced expectations?

Lack of adoption
 Standardization ahead of adoption
› URI, RDF, RDF/XML, RDFa, JSON-LD,
OWL, RIF, SPARQL, OWL-S, POWDER …
 Chicken and egg problem
› No users/use cases, hence no data
› No data, because no users/use cases
 By 2007, some modest progress
› Metadata in HTML: microformats
› Linked Data: simplifying the stack

Microsearch internal prototype (2007)
Personal and
private
homepage
of the same
person
(clear from the
snippet but it
could be also
automatically
de-duplicated)
Conferences
he plans to attend
and his vacations
from homepage
plus bio events
from LinkedIn
Geolocation

Yahoo SearchMonkey (2008)
1. Extract structured data
› Semantic Web markup
• Example:
<span property=“vcard:city”>Santa Clara</span>
<span property=“vcard:region”>CA</span>
› Information Extraction
2. Presentation
› Fixed presentation templates
• One template per object type
› Applications
• Third-party modules to display data (SearchMonkey)

Effectiveness of enhanced results
 Explicit user feedback
› Side-by-side editorial evaluation (A/B testing)
• Editors are shown a traditional search result and enhanced result for the same page
• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)
 Implicit user feedback
› Click-through rate analysis
• Long dwell time limit of 100s (Ciemiewicz et al. 2010)
• 15% increase in ‘good’ clicks
› User interaction model
• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than
textual (III)
• Enhanced results effectively reduce bad clicks!
 See
› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR
2011: 725-734

Other applications of enhanced results
 Google Rich Snippets - June, 2009
› Faceted search for recipes - Feb, 2011
 Bing tiles – Feb, 2011
 Facebook’s Like button and the Open Graph Protocol (2010)
› Shows up in profiles and news feed
› Site owners can later reach users who have liked an object
 Twitter cards (2012)
› More visual/interactive tweets

Other types of applications: vertical search
14

Not just web pages: markup in email
 Google Now
 Yahoo Search/Mail
 Microsoft Cortana

Problem!
16
 Each of these applications require a different markup
› Different schemas and syntax
 What’s a publisher to do?
› Mark up the same content differently for every consumer
• Time consuming
• Error prone

schema.org
 Collaborative effort sponsored by large consumers of Web data
› Bing, Google, and Yahoo! as initial founders (June, 2011)
› Yandex joins schema.org in Nov, 2011
 Agreement on a shared set of schemas for the Web
› Available at schema.org in HTML and machine readable formats
› Free to use under W3C Royalty Free terms

schema.org structure
 Classes
› Each class has a label and descriptions
› Classes form a class hierarchy
• Multiple inheritance allowed but rare (a class with two super-classes)
 Properties
› Each property has a label and description
› Properties have domains and ranges, and inverse properties
 Datatypes
› Boolean, Date, DateTime etc.

schema.org usage in practice
 Depends on the skillset of the publisher
› Instances are rarely given an identifier, or identified by the URL of the webpage
› schema.org consumers (validators etc.) are tolerant to mistakes
• e.g. accept text even when an object is required
 Driven by applications
› Publishers often provide the minimal information required in a particular context
› Validators (Bing, Google, Yandex) validate different subsets

schema.org statistics
 R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote)
› Over 15% of all pages now have schema.org markup
› Over 5 million sites, over 25 billion entity references
› In other words
• Same order of magnitude as the web
 See also
› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
• Based on Bing US corpus
• 31% of webpages, 5% of domains contain some metadata
› WebDataCommons
• Based on CommonCrawl Nov 2013
• 26% of webpages, 14% of domains contain some metadata

schema.org process
 Process
› Initial release
• Group of experts harmonizing existing vocabularies
› Regular updates based on public discussion
• Fixes
• Extensions
• Deprecation
– almost never
 Tooling
› Website (App Engine)
• Open Source
› Github

Extensions
 External proposals integrated
› News (IPTC)
› e-Commerce (GoodRelations)
› TV/Radio fixes (BBC/EBU's)
› Content Accessibility (a11ymetadata.org, IMS)
› Not-for-profit Offers (BibExtend)
› Question/Answer (StackExchange, Drupal)
 Further integration
› Automotive
› GS1
 New extension mechanism
› Coming soon

schema.org and web standards
 schema.org builds on Semantic Web standards
› RDFa, JSON-LD, HTML5 microdata
 Not a standardization effort in the classical sense
› Continuously evolving ontology
› Huge scope (‘everything on the Web’)
› Shallow depths compared to more targeted efforts
 More specialized discussions typically at more targeted forums
› e.g. W3C Community Groups
 Large enumerations and/or rapidly changing knowledge maintained elsewhere
› e.g. PlaceOfWorship
› BuddhistTemple, CatholicChurch, Church, HinduTemple, Mosque, Synagogue …
› Meanwhile over at Wikipedia:
• https://en.wikipedia.org/wiki/Place_of_worship
• https://www.wikidata.org/wiki/Q1370598

BibExtend Community Group (W3C)

Task completion
36
 We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› We need to understand what the available actions are
 Schema.org Actions
› Describe what actions can be taken on a page/email
› See blog post and overview article
THING
THING

Actions
 Schema.org v1.2 (April, 2014)
› See blog post and overview article for detail.
› and public-vocabs threads for even more details.

{
"@type": "Product",
"url": "http://example.com/products/ipod",
"potentialAction": {
"@type": "BuyAction",
"target": {
"@type": "EntryPoint",
"urlTemplate": "https://example.com/products/ipod/buy",
"encodingType": "application/ld+json",
"contentType": "application/ld+json"
},
"result": {
"@type": "Order",
"url-output": "required",
"confirmationNumber-output": "required",
"orderNumber-output": "required",
"orderStatus-output": "required"
}
}
}
{
"@type": "BuyAction",
"actionStatus": "CompletedActionStatus",
"object":
"https://example.com/products/ipod",
"result": {
"@type": "Order",
"url":
"http://example.com/orders/1199334"
"confirmationNumber": "1ABBCDDF23234",
"orderNumber": "1199334",
"orderStatus": "PROCESSING"
},
}
Actions example Here is a Product and
a potential action
(Buy)
After POSTing the
request to the
EntryPoint, here is
your completed action

Interactive search results (Yandex Islands)
40

(Possible) example: quick unsubscribe
41
How do I
unsubscribe?
Not very
visible to
humans…

Q&A
 Many thanks to
› The schema.org group and the many contributors to schema.org
› Dan Brickley
 Get involved
› Join the discussion at public-vocabs@w3.org
› File a bug, fork a schema, track releases at Github.org
 Contact me
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/

Semantic mark-up with schema.org: helping search engines understand the Web

More Related Content

Viewers also liked

Similar to Semantic mark-up with schema.org: helping search engines understand the Web

Semantic mark-up with schema.org: helping search engines understand the Web