A Web-scale Study of the Adoption and
Evolution of the schema.org Vocabulary
over Time
Robert Meusel, Christian Bizer and
Heiko Paulheim
2
Motivation - LOD Cloud with 1.000 data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
3
Motivation - schema.org MD with 700k data providers
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
4
Microdata in a Nutshell
 Adding structured information to web pages
• By marking up contents and entities
 Arbitrary vocabularies are possible
• Practically, only schema.org is deployed on a large scale
• Plus its historical predecessor: data-vocabulary.org
 Similar to RDFa
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
<div itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="name">Data and Web Science Group</span>
<span itemprop="addressLocality">Mannheim</span>,
<span itemprop="postalCode">68131</span>
<span itemprop="addressCountry">Germany</span>
</div>
5
Schema.org in a Nutshell
 Vocabulary for marking up entities on web pages
• 675 classes and 965 properties (as of May 2015, release 2.0)
 Promoted and consumes by major search engine companies
• Google, Bing, Yahoo!, and Yandex
• Google Rich Snippets
 Community-driven
evolution and
development
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
6
Schema.org in a Nutshell – Coverage
 Schema.org has incorporated some popular vocabularies, like:
• Good Relations (2012)
• W3C BibExtend (2014)
• MusicBrainz vocabulary (2015)
• Automotive Ontology (2015)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
7
Microdata with Schema.org in HTML Pages
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580“>
<h1> Predator Instinct FG Fußballschuh
</h1>
<div>
<meta content="EUR">
<span
data-sale-price="219.95">219,95</span>
…
</body>
</html>
HTML pages embed directly
markup languages to annotate
items using different vocabularies
<html>
…
<body>
…
<div id="main-section" class="performance left" data-
sku="M17242_580" itemscope
itemtype="http://schema.org/Product">
<h1 itemprop="name"> Predator Instinct FG Fußballschuh
</h1>
<div itemscope itemtype="http://schema.org/Offer"
itemprop="offers">
<meta itemprop="priceCurrency" content="EUR">
<span itemprop="price" data-sale-
price="219.95">219,95</span>
…
</body>
</html>
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Product> .
2._:node1 <http://schema.org/Product/name> "Predator
Instinct FG Fußballschuh"@de .
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-
ns#type> <http://schema.org/Offer> .
4._:node1 <http://schema.org/Offer/price>
"219,95"@de .
5._:node1 <http://schema.org/Offer/priceCurrency>
"EUR" .
6.…
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
8
Wrap-Up
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
 Semantic annotations are used by more and more websites
 Entities on websites become machine-readable and machine-
understandable
 schema.org together with Microdata is a success story
• Promoted by search engine companies
• Deployed by over 17% of all websites [1] (over 700k data providers)
 Usage is more compliant to the schema than e.g. LOD [2]
[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html
[2] Meusel and Paulheim, ESWC 2015
9
Digging for Reasons
 So, Microdata is more often deployed and is often more
schema compliant, although there are millions of uncontrolled
providers with different skill sets
 But why? Some hypotheses…
• Availability of documentation
• Tool support
• Business incentive
• Schema flexibility
 Can we confirm/reject those from looking at the data?
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
10
A Diachronic Perspective
 Versions of schema.org are archived over time
• Plus: there are several crawl releases per year
• i.e., we can look at change over time
 If we look at both schema and deployed data, we may observe
• Adoption rates of schema changes
• Data-first changes to the schema
• Convergence or divergence of deployed data
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
11
A Diachronic Perspective
 Three releases of WDC Microdata corpus [1]
• 2012, 2013, and 2014
 Versions of schema.org that were valid
• At the beginning of the crawl
• At the end of the crawl
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
[1] http://webdatacommons.org/structureddata
12
Top-down Adoption
 How fast are changes in the schema adopted?
• New classes/properties
• Deprecations
• Domain/range changes
 Measuring adoption: challenges
• Different crawls
• Overall growth of deployed schema.org
 Measure: normalized usage increase (nui) from i to j:
• nui(s)>1.05: usage of schema element s has increased significantly
• nui(s)<0.95: usage of schema element s has decreased significantly
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
13
Top-down Adoption
 Adoption of new classes and properties
• Almost half of all introduced classes are never used!
• Similar for new properties
 Reasons
• Bulk-addition of vocabularies
• not every term is equally needed
• e.g., medical vocabulary
• Blind spot of our approach
• some terms are mainly for e-mail markup
• e.g., Actions
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
SURPRISE!
14
Top-down Adoption
 Main domains of positive adoption
• Meta data for web content
(schema.org/Website has the highest nui)
• Broadcasting (e.g., TV Episodes)
• Questions & Answers
• Postal addresses
 Classes featured in Google Rich Snippets
• Still growth on high level (tens of thousands of data providers)
• But nui(s)<0.95
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Yellow Pages
Search Engine Listings
Collaboration
with BBC and EBU
Influence of CMS adoption
Q&A Pages, such as
Stackoverflow
15
Top-down Adoption
 Adoption of domain/range changes
• Again: rather low overall adoption
 Adopted well for
• Products (height, width, itemCondition, …)
• Broadcasting domain (episode, actor, ...)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Search Engine Listings
Collaboration
with BBC and EBU
16
Top-down Adoption
 Adoption of deprecations
• Works well (29 out of 32 have a significantly low nui)
 Exceptions
• s:map (← s:hasMap)
• s:maps (← s:hasMap)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
For Google Maps
(lots of outdated tutorials)
17
Bottom-up Evolution
 Martin Luther
• Started the protestant church
• A success story, too (like schema.org)
• (i.e., 800 million adopters worldwide)
 Famous quote:
• “Man muss […] dem gemeinen Mann aufs Maul schauen”
• (roughly:
“You have to listen to the way the common man really speaks.”)
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Martin Luther,
1483-1546
Disclaimer:
I do not speak for the
protestant church.
18
Bottom-up Evolution
 Are new features in the schema first used “inofficially”?
• New classes/properties
• Domain/range changes
 Instrument for measurement: ROC curves
• True positives mapped against false positives
• tp: elements used before
• fp: elements not used before
• Ranking by #PLDs
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
19
Bottom-up Evolution
 There are some mild influences observable
• Stronger for domain/range changes
• especially range changes
• Weaker for new classes/properties
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
2012→ 2013 2013→ 2014 2012→ 2014
classes properties domains ranges
20
Bottom-up Evolution
 Extension mechanism
• Allows for user-defined classes/properties
• Those become subclasses implicitly
 Analysis over time
• No measurable impact on standard evolution
• “Inofficial” use is likelier than use of extension mechanism
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
s:Product/ElectronicProduct
s:price/reducedPrice
21
Overall Convergence
 Measuring convergence
• i.e., homogeneity of descriptions of classes
• Example: two instances of s:LocalBusiness
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Birmingham”
“Main Street 24”
s:LocalBusiness
s:PostalAddress _:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
22
Overall Convergence
 Recap
• RDF from Microdata is a set of trees
• i.e., we can enumerate all paths to leaf nodes
(omitting literals)
 Example:
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
_:1
_:2 “Liverpool”
“Church Street 1”
s:LocalBusiness
s:PostalAddress
rdf:type-s:LocalBusiness,
s:address-rdf:type-s:PostalAddress,
s:address-s:addressLocality,
s:address-s:streetAddress
23
Overall Convergence
 Using all paths, we can compute the entropy for each class as
 A low entropy refers to a high homogeneity
 We normalize both by maximum entropy
and the total number of paths
• i.e., we use normalized entropy rate as a measure for homogeneity
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
24
Overall Convergence
 Observations
• Overall entropy decreases over time
 Classes with high convergence rates
• WebSite, Blog, …
• Hotel, Restaurant, …
• Product, Offer, …
• Rating, Review
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Influence of CMS adoption
Yellow pages
Google Rich Snippets
...all of the above
25
Key Adoption Drivers
 Search Engine Optimization
• Web site providers want to be high in Google rankings
• Direct business incentive!
 Tool adoption
• Major CMSs use schema.org
 Standard Agility
• schema.org: 25 revisions in last three years
• cf. FOAF: six revisions in last eight years
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
26
Summary
 Both ways, top-down and bottom-up adoptions can be
observed
 Homogeneity of deployed schema increase over time
 Described empirical data-driven study reveals valuable insights
to understand how and why schema.org is a success story
 Observed key drivers and obstacles can also help to understand
and analysis adoption of other standards, e.g. LOD
 More fine-grained insights might be revealed when extending
the analysis corpus to the mailing list archive and issue tracker
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
27
Thank you! Questions? Feedback?
Raw data can be found on the website of WebDataCommons:
http://webdatacommons.org/structureddata/
More interesting datasets and analysis:
http://webdatacommons.org/index.html
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
Acknowledgement
The extraction and analysis of the datasets was supported
by AWS in Education Grant.

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

  • 1.
    A Web-scale Studyof the Adoption and Evolution of the schema.org Vocabulary over Time Robert Meusel, Christian Bizer and Heiko Paulheim
  • 2.
    2 Motivation - LODCloud with 1.000 data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 3.
    3 Motivation - schema.orgMD with 700k data providers A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 4.
    4 Microdata in aNutshell  Adding structured information to web pages • By marking up contents and entities  Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale • Plus its historical predecessor: data-vocabulary.org  Similar to RDFa A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 <div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span> </div>
  • 5.
    5 Schema.org in aNutshell  Vocabulary for marking up entities on web pages • 675 classes and 965 properties (as of May 2015, release 2.0)  Promoted and consumes by major search engine companies • Google, Bing, Yahoo!, and Yandex • Google Rich Snippets  Community-driven evolution and development A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 6.
    6 Schema.org in aNutshell – Coverage  Schema.org has incorporated some popular vocabularies, like: • Good Relations (2012) • W3C BibExtend (2014) • MusicBrainz vocabulary (2015) • Automotive Ontology (2015) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 7.
    7 Microdata with Schema.orgin HTML Pages <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580“> <h1> Predator Instinct FG Fußballschuh </h1> <div> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> HTML pages embed directly markup languages to annotate items using different vocabularies <html> … <body> … <div id="main-section" class="performance left" data- sku="M17242_580" itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1> <div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> <meta itemprop="priceCurrency" content="EUR"> <span itemprop="price" data-sale- price="219.95">219,95</span> … </body> </html> 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Product> . 2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax- ns#type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 8.
    8 Wrap-Up A Web-scale Studyof the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015  Semantic annotations are used by more and more websites  Entities on websites become machine-readable and machine- understandable  schema.org together with Microdata is a success story • Promoted by search engine companies • Deployed by over 17% of all websites [1] (over 700k data providers)  Usage is more compliant to the schema than e.g. LOD [2] [1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html [2] Meusel and Paulheim, ESWC 2015
  • 9.
    9 Digging for Reasons So, Microdata is more often deployed and is often more schema compliant, although there are millions of uncontrolled providers with different skill sets  But why? Some hypotheses… • Availability of documentation • Tool support • Business incentive • Schema flexibility  Can we confirm/reject those from looking at the data? A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 10.
    10 A Diachronic Perspective Versions of schema.org are archived over time • Plus: there are several crawl releases per year • i.e., we can look at change over time  If we look at both schema and deployed data, we may observe • Adoption rates of schema changes • Data-first changes to the schema • Convergence or divergence of deployed data A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 11.
    11 A Diachronic Perspective Three releases of WDC Microdata corpus [1] • 2012, 2013, and 2014  Versions of schema.org that were valid • At the beginning of the crawl • At the end of the crawl A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 [1] http://webdatacommons.org/structureddata
  • 12.
    12 Top-down Adoption  Howfast are changes in the schema adopted? • New classes/properties • Deprecations • Domain/range changes  Measuring adoption: challenges • Different crawls • Overall growth of deployed schema.org  Measure: normalized usage increase (nui) from i to j: • nui(s)>1.05: usage of schema element s has increased significantly • nui(s)<0.95: usage of schema element s has decreased significantly A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 13.
    13 Top-down Adoption  Adoptionof new classes and properties • Almost half of all introduced classes are never used! • Similar for new properties  Reasons • Bulk-addition of vocabularies • not every term is equally needed • e.g., medical vocabulary • Blind spot of our approach • some terms are mainly for e-mail markup • e.g., Actions A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 SURPRISE!
  • 14.
    14 Top-down Adoption  Maindomains of positive adoption • Meta data for web content (schema.org/Website has the highest nui) • Broadcasting (e.g., TV Episodes) • Questions & Answers • Postal addresses  Classes featured in Google Rich Snippets • Still growth on high level (tens of thousands of data providers) • But nui(s)<0.95 A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Yellow Pages Search Engine Listings Collaboration with BBC and EBU Influence of CMS adoption Q&A Pages, such as Stackoverflow
  • 15.
    15 Top-down Adoption  Adoptionof domain/range changes • Again: rather low overall adoption  Adopted well for • Products (height, width, itemCondition, …) • Broadcasting domain (episode, actor, ...) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Search Engine Listings Collaboration with BBC and EBU
  • 16.
    16 Top-down Adoption  Adoptionof deprecations • Works well (29 out of 32 have a significantly low nui)  Exceptions • s:map (← s:hasMap) • s:maps (← s:hasMap) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 For Google Maps (lots of outdated tutorials)
  • 17.
    17 Bottom-up Evolution  MartinLuther • Started the protestant church • A success story, too (like schema.org) • (i.e., 800 million adopters worldwide)  Famous quote: • “Man muss […] dem gemeinen Mann aufs Maul schauen” • (roughly: “You have to listen to the way the common man really speaks.”) A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Martin Luther, 1483-1546 Disclaimer: I do not speak for the protestant church.
  • 18.
    18 Bottom-up Evolution  Arenew features in the schema first used “inofficially”? • New classes/properties • Domain/range changes  Instrument for measurement: ROC curves • True positives mapped against false positives • tp: elements used before • fp: elements not used before • Ranking by #PLDs A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 19.
    19 Bottom-up Evolution  Thereare some mild influences observable • Stronger for domain/range changes • especially range changes • Weaker for new classes/properties A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 2012→ 2013 2013→ 2014 2012→ 2014 classes properties domains ranges
  • 20.
    20 Bottom-up Evolution  Extensionmechanism • Allows for user-defined classes/properties • Those become subclasses implicitly  Analysis over time • No measurable impact on standard evolution • “Inofficial” use is likelier than use of extension mechanism A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 s:Product/ElectronicProduct s:price/reducedPrice
  • 21.
    21 Overall Convergence  Measuringconvergence • i.e., homogeneity of descriptions of classes • Example: two instances of s:LocalBusiness A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Birmingham” “Main Street 24” s:LocalBusiness s:PostalAddress _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress
  • 22.
    22 Overall Convergence  Recap •RDF from Microdata is a set of trees • i.e., we can enumerate all paths to leaf nodes (omitting literals)  Example: A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 _:1 _:2 “Liverpool” “Church Street 1” s:LocalBusiness s:PostalAddress rdf:type-s:LocalBusiness, s:address-rdf:type-s:PostalAddress, s:address-s:addressLocality, s:address-s:streetAddress
  • 23.
    23 Overall Convergence  Usingall paths, we can compute the entropy for each class as  A low entropy refers to a high homogeneity  We normalize both by maximum entropy and the total number of paths • i.e., we use normalized entropy rate as a measure for homogeneity A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 24.
    24 Overall Convergence  Observations •Overall entropy decreases over time  Classes with high convergence rates • WebSite, Blog, … • Hotel, Restaurant, … • Product, Offer, … • Rating, Review A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Influence of CMS adoption Yellow pages Google Rich Snippets ...all of the above
  • 25.
    25 Key Adoption Drivers Search Engine Optimization • Web site providers want to be high in Google rankings • Direct business incentive!  Tool adoption • Major CMSs use schema.org  Standard Agility • schema.org: 25 revisions in last three years • cf. FOAF: six revisions in last eight years A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 26.
    26 Summary  Both ways,top-down and bottom-up adoptions can be observed  Homogeneity of deployed schema increase over time  Described empirical data-driven study reveals valuable insights to understand how and why schema.org is a success story  Observed key drivers and obstacles can also help to understand and analysis adoption of other standards, e.g. LOD  More fine-grained insights might be revealed when extending the analysis corpus to the mailing list archive and issue tracker A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015
  • 27.
    27 Thank you! Questions?Feedback? Raw data can be found on the website of WebDataCommons: http://webdatacommons.org/structureddata/ More interesting datasets and analysis: http://webdatacommons.org/index.html A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time - WIMS 2015 Acknowledgement The extraction and analysis of the datasets was supported by AWS in Education Grant.