Validator and preview 
for the JobPosting data model 
of Schema.org 
Jindřich Mynarz 
Department of Information and Knowledge 
Engineering, 
University of Economics, Prague 
EC-WEB 2014, September 2, 2014
Motivation 
● Improving usability of vocabularies 
● Provide feedback on the use of 
vocabularies 
● Make vocabulary specification executable 
● Help ensure basic level of data quality 
● Capture application-specific requirements 
for data in validation rules
DámePráci.eu project 
“Matching jobs with unemployed 
through semantic data” 
Data model using Schema.org with 
an extension for the job market. 
Application for searching through job postings 
aggregated from distinct sources: 
www.damepraci.cz (in Czech)
Validation method 
● Rule-based, schema-aware 
validation 
● Operates in the RDF data model 
● Focuses on semantic errors, beyond well-formed 
markup 
● Partial open world assumption 
● Implemented as SPARQL 1.1 CONSTRUCT 
queries 
● Error reporting via SPIN RDF vocabulary
Background knowledge 
schema.org 
+ extension for job market (RDFS) 
+ external enumerations: 
● ISO 4217 currency codes (SKOS) 
● ISO 639-1 language codes (SKOS) 
Loaded in separate named graphs that the 
validation rules can reference.
Validation rules 
● Data completeness 
● Distinction between datatype and object 
properties 
● Conflicting data 
● Datatype violations 
● Invalid codes
Data completeness 
● At least 1 instance 
of schema:JobPosting 
● Other type information (class membership, 
datatypes) left optional 
● Empty literals 
● Conditionally required data (e.g., 
compensation + currency)
Distinction between datatype 
and object properties 
● Object properties with literal objects instead 
of URIs or blank nodes (and vice versa for 
datatype properties) 
● Simpler syntax of datatype 
properties 
○ Avoiding nested objects or difficulties with finding an 
object's URI 
● May be a symptom of incorrectly nested 
HTML elements
Conflicting data 
● Mutually-exclusive properties 
○ schema:jobLocation 
+ schema:isRemoteWork true 
● Cardinality violation for functional properties 
with > 1 object 
○ schema:startDate, schema:currency, schema: 
availableVacancies 
● Incompatible class membership inferences 
○ schema:domainIncludes, schema:rangeIncludes 
○ Incompatible class membership is instantiation of 2+ 
distinct classes that are not in rdfs:subClassOf 
relation.
Datatype violations 
● Regular expressions, casting errors 
of XPath datatype constructor functions 
● Date and time formats (xsd:date, xsd: 
duration) 
○ Not conforming to regular expressions 
○ Non-existent dates 
○ Dates from the future 
● Interval limits 
○ Positive integers for schema:availableVacancies
Invalid codes 
● Based on lookup in code lists enumerating 
every valid code 
● Includes language codes (ISO 639-1) and 
currency codes (ISO 4217)
Implementation 
Ruby on Rails web application 
backed by Jena Fuseki SPARQL 1.1 endpoint. 
● Validates both RDFa and HTML5 Microdata 
● Czech and English localization 
● Validation results in HTML or JSON-LD 
● RSpec tests for each validation rule 
● Open source: https://github.com/OPLZZ/job-posting-validator
Demo: bit.ly/broken-job-posting
Preview
Experimental validation 
of a JobPosting corpus 
● 1332 seed URLs from 752 distinct 
pay-level domains obtained via Google 
Custom Search Engine restricted to schema: 
JobPosting 
● Sample of 42 872 web pages obtained 
by crawling seed URLs 
● Each page validated, validation results 
in JSON-LD loaded to Elasticsearch 
for exploration
Most common errors
Datatype property used 
as object property 
Most common path to error: schema:title 
Possible cause: incorrect understanding of 
markup precedence rules: 
<a property="title" href="#title">SEO guru</a> 
[] schema:title <#title> . 
[] schema:title "SEO guru" .
Empty literal value 
Most common path to error: schema: 
addressRegion 
Possible cause: incomplete data used to 
generate HTML from fixed templates 
Less common in manually marked-up HTML
Incorrect character case 
in schema:Postaladdress 
Both RDFa and HTML5 Microdata are case-sensitive. 
Spread across 116 unique PLDs. 
“The default mode of authoring [Schema.org 
markup] is copy and edit.” — R.V. Guha
Object property used 
as datatype property 
Most common path to error: schema:jobLocation 
Common cause: simpler markup without intermediate 
resources 
<p property="jobLocation"> 
<p rel="jobLocation"> 
Munich 
<p rel="address"> 
</p> 
<p property= 
"addressLocality"> 
Munich 
</p> 
</p> 
</p>
Unsuccessful experiments 
Web Data Commons 
● Errors smoothed by extraction to RDF 
● Not suitable as a source of seed URLs: job 
postings disappear quickly 
Veterans Job Bank 
● Data from few PLDs, lacks variety 
● Severe restrictions on automated downloads 
through its API
Questions? 
Acknowledgements: 
The presented research was partially supported by the project 
of Operational Programme Human Resources and Employment no. CZ. 
1.04/5.1.01/77.00440. 
Image credits: 
Check List designed by Arthur Shlain from the thenounproject.com 
Puzzle designed by John from the thenounproject.com

EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org

  • 1.
    Validator and preview for the JobPosting data model of Schema.org Jindřich Mynarz Department of Information and Knowledge Engineering, University of Economics, Prague EC-WEB 2014, September 2, 2014
  • 2.
    Motivation ● Improvingusability of vocabularies ● Provide feedback on the use of vocabularies ● Make vocabulary specification executable ● Help ensure basic level of data quality ● Capture application-specific requirements for data in validation rules
  • 3.
    DámePráci.eu project “Matchingjobs with unemployed through semantic data” Data model using Schema.org with an extension for the job market. Application for searching through job postings aggregated from distinct sources: www.damepraci.cz (in Czech)
  • 4.
    Validation method ●Rule-based, schema-aware validation ● Operates in the RDF data model ● Focuses on semantic errors, beyond well-formed markup ● Partial open world assumption ● Implemented as SPARQL 1.1 CONSTRUCT queries ● Error reporting via SPIN RDF vocabulary
  • 5.
    Background knowledge schema.org + extension for job market (RDFS) + external enumerations: ● ISO 4217 currency codes (SKOS) ● ISO 639-1 language codes (SKOS) Loaded in separate named graphs that the validation rules can reference.
  • 6.
    Validation rules ●Data completeness ● Distinction between datatype and object properties ● Conflicting data ● Datatype violations ● Invalid codes
  • 7.
    Data completeness ●At least 1 instance of schema:JobPosting ● Other type information (class membership, datatypes) left optional ● Empty literals ● Conditionally required data (e.g., compensation + currency)
  • 8.
    Distinction between datatype and object properties ● Object properties with literal objects instead of URIs or blank nodes (and vice versa for datatype properties) ● Simpler syntax of datatype properties ○ Avoiding nested objects or difficulties with finding an object's URI ● May be a symptom of incorrectly nested HTML elements
  • 9.
    Conflicting data ●Mutually-exclusive properties ○ schema:jobLocation + schema:isRemoteWork true ● Cardinality violation for functional properties with > 1 object ○ schema:startDate, schema:currency, schema: availableVacancies ● Incompatible class membership inferences ○ schema:domainIncludes, schema:rangeIncludes ○ Incompatible class membership is instantiation of 2+ distinct classes that are not in rdfs:subClassOf relation.
  • 10.
    Datatype violations ●Regular expressions, casting errors of XPath datatype constructor functions ● Date and time formats (xsd:date, xsd: duration) ○ Not conforming to regular expressions ○ Non-existent dates ○ Dates from the future ● Interval limits ○ Positive integers for schema:availableVacancies
  • 11.
    Invalid codes ●Based on lookup in code lists enumerating every valid code ● Includes language codes (ISO 639-1) and currency codes (ISO 4217)
  • 12.
    Implementation Ruby onRails web application backed by Jena Fuseki SPARQL 1.1 endpoint. ● Validates both RDFa and HTML5 Microdata ● Czech and English localization ● Validation results in HTML or JSON-LD ● RSpec tests for each validation rule ● Open source: https://github.com/OPLZZ/job-posting-validator
  • 13.
  • 14.
  • 15.
    Experimental validation ofa JobPosting corpus ● 1332 seed URLs from 752 distinct pay-level domains obtained via Google Custom Search Engine restricted to schema: JobPosting ● Sample of 42 872 web pages obtained by crawling seed URLs ● Each page validated, validation results in JSON-LD loaded to Elasticsearch for exploration
  • 16.
  • 17.
    Datatype property used as object property Most common path to error: schema:title Possible cause: incorrect understanding of markup precedence rules: <a property="title" href="#title">SEO guru</a> [] schema:title <#title> . [] schema:title "SEO guru" .
  • 18.
    Empty literal value Most common path to error: schema: addressRegion Possible cause: incomplete data used to generate HTML from fixed templates Less common in manually marked-up HTML
  • 19.
    Incorrect character case in schema:Postaladdress Both RDFa and HTML5 Microdata are case-sensitive. Spread across 116 unique PLDs. “The default mode of authoring [Schema.org markup] is copy and edit.” — R.V. Guha
  • 20.
    Object property used as datatype property Most common path to error: schema:jobLocation Common cause: simpler markup without intermediate resources <p property="jobLocation"> <p rel="jobLocation"> Munich <p rel="address"> </p> <p property= "addressLocality"> Munich </p> </p> </p>
  • 21.
    Unsuccessful experiments WebData Commons ● Errors smoothed by extraction to RDF ● Not suitable as a source of seed URLs: job postings disappear quickly Veterans Job Bank ● Data from few PLDs, lacks variety ● Severe restrictions on automated downloads through its API
  • 22.
    Questions? Acknowledgements: Thepresented research was partially supported by the project of Operational Programme Human Resources and Employment no. CZ. 1.04/5.1.01/77.00440. Image credits: Check List designed by Arthur Shlain from the thenounproject.com Puzzle designed by John from the thenounproject.com