SlideShare a Scribd company logo
1 of 32
Superset Me—Not:
Why the JPTS Is Sufficient if You Use Appropriate Layer Validation
Alexander (“Sasha”) Schwarzman
American Geophysical Union (AGU)
JATS-Con
November 2, 2010
Summary
We have built a superset of the NLM Journal
Publishing Tag Set and a Schematron validator
to enforce business rules, data types, and
house style.
In retrospect, a JPTS subset—when used in
conjunction with the appropriate layer
validation technology, such as Schematron—
could have been sufficient to meet AGU's
needs.
Alexander (“Sasha”) Schwarzman 2Superset Me—Not JATS-Con Nov 2, 2010
Contents
• Why we built a JPTS superset
• DTD vs. Schematron
– Attribute values
– Number of element occurrences
– Element position and sequence
– References
• Lessons learned
Alexander (“Sasha”) Schwarzman 3Superset Me—Not JATS-Con Nov 2, 2010
Why we built a JPTS superset
• No generic book model, e.g., no book-series-
meta for a book, no xi:include for chapters, etc.
• Lack of familiarity with Schematron
• Lack of mature tool support (running SVRL not a
viable option in Production environment)
• Lack of expertise on using Schematron to
validate against external data sources and
relational DB
• JATS v2.3: no Compound Keywords, not all
content models parameterized
Alexander (“Sasha”) Schwarzman 4Superset Me—Not JATS-Con Nov 2, 2010
DTD vs. Schematron:
Attribute values
Requirement: Article type is required and can be one of three types:
a regular article (rga), a correction (cor), or an editorial (edt)
Strict DTD
<!ATTLIST article
article-type
(rga | cor | edt) #REQUIRED >
JPTS
<!ATTLIST article
article-type
CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 5
DTD vs. Schematron:
Attribute values (cont’d)
XML instance (contains non-allowed article type)
<article article-type='xxx'/>
Schematron
<rule context="article">
<assert test="@article-type=('rga','cor','edt')">
@article-type '<value-of select='@article-type'/>' not
allowed, must be 'rga', 'cor', or edt'</assert></rule>
Schematron message
@article-type 'xxx' not allowed, must be 'rga', 'cor', or
'edt'
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 6
DTD vs. Schematron:
Number of element occurrences
Requirement: Acknowledgments, if present, must contain exactly
one paragraph, except for two journals (journal code ‘ja’ and
‘rg’) where Acknowledgments must contain two paragraphs
Strict DTD
<!ELEMENT ack (p, p?) >
JPTS
<!ELEMENT ack (p*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 7
DTD vs. Schematron:
Number of occurrences (cont’d)
XML instance (wrong number of paragraphs)
<article>
...
<journal-id>jb</journal-id>
...
<ack>
<p>Blah</p>
<p>Blah-blah</p>
</ack>
</article>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 8
DTD vs. Schematron:
Number of occurrences (cont’d)
Schematron
<rule context="ack[ancestor::*/journal-id=('ja','rg')]">
<assert test="count(p) eq 2">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain exactly two paragraphs</assert></rule>
<rule context="ack">
<assert test="count(p) eq 1">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain only one paragraph</assert></rule>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 9
DTD vs. Schematron:
Number of occurrences (cont’d)
Schematron message
'ack' in 'jb' must contain only one paragraph
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 10
DTD vs. Schematron:
Element position & sequence
Requirement: If a journal uses subject grouping (a ToC category,
a disciplinary subset) and an article belongs to a special
collection (a special section, a theme), then subject grouping
metadata must precede special collection metadata
Strict DTD
<!ELEMENT article-categories
(subject-group*,
special-collection?) >
JPTS
<!ELEMENT article-categories
(subj-group*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 11
DTD vs. Schematron:
Element position & sequence (cont’d)
XML instance (wrong sequence of subject groups)
<article-categories>
<subj-group subj-group-type="special-section">
<subject content-type="EARLYWARN1">New Methods and
Applications of Earthquake Early Warning</subject>
</subj-group>
<subj-group subj-group-type="toc-category">
<subject content-type="SDE">Solid Earth</subject>
</subj-group>
</article-categories>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 12
DTD vs. Schematron:
Element position & sequence (cont’d)
Schematron
<rule context="article-categories/
subj-group[@subj-group-type=('special-section','theme')]">
<assert test="not(following-sibling::
subj-group[@subj-group-type=('toc-category','subset')])">
<name/>/@subj-group-type='<value-of select='@subj-group-
type'/>' must appear after a ToC Category or a Subset
when either is present</assert></rule>
Schematron message
subj-group/@subj-group-type='special-section' must appear
after a ToC Category or a Subset when either is present
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 13
DTD vs. Schematron:
References
Validating references is a challenge:
• Variety vs. the need to enforce editorial style
Strict DTD:
• Fixed element order, no mixed content
• Punctuation, spacing, face markup – on output
JPTS:
• Lots of elements, any order, mixed content
• Punctuation, spacing, face markup included
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 14
DTD vs. Schematron:
References (cont’d)
Strict DTD
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
<!ATTLIST book-standalone-citation
id ID #REQUIRED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 15
DTD vs. Schematron:
References (cont’d)
JPTS
<!ELEMENT mixed-citation
(#PCDATA | person-group | string-name |
year | source | edition | size |
elocation-id | publisher-name |
publisher-loc | ... | ...)* >
<!ATTLIST mixed-citation
id ID #IMPLIED
publication-type CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 16
DTD vs. Schematron:
References (cont’d)
Example:
Mood, A. M., and F. A. Graybill (1963),
Introduction to the Theory Statistics, 2nd ed.,
295 pp., McGraw-Hill, New York.
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 17
DTD vs. Schematron:
References (cont’d)
XML instance (strict DTD)
<book-standalone-citation id="mood63">
<person-group person-group-type="author">
<name><surname>Mood</surname>
<given-names>A. M.</given-names></name>
<name><surname>Graybill</surname>
<given-names>F. A.</given-names></name>
</person-group>
<year>1963</year>
<source>Introduction to the Theory Statistics</source>
<edition>2nd</edition>
<size units="page">295 pp<size/>
<publisher-name>McGraw-Hill</publisher-name>
<publisher-loc>New York</publisher-loc>
</book-standalone-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 18
DTD vs. Schematron:
References (cont’d)
XML instance (JPTS)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names> <surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<source><italic>Introduction to the
Theory Statistics</italic></source>,
<edition>2</edition>nd ed.,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 19
DTD vs. Schematron:
References (cont’d)
Before we proceed, please note:
- required elements
- edition, if present, follows source
- optional elements between source and publisher-name:
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 20
DTD vs. Schematron:
References (cont’d)
• Schematron can check that all required elements
are present:
<rule context="mixed-citation[@publication-type='book-
standalone']">
<assert test="(person-group | string-name) and year
and source and publisher-name
and publisher-loc">
required element missing</assert></rule>
• & that the elements are in the correct sequence:
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 21
DTD vs. Schematron:
References (cont’d)
XML instance (JPTS) (edition is in the wrong place)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names><surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<edition>2</edition>nd ed.,
<source><italic>Introduction to the Theory …</italic></source>,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 22
DTD vs. Schematron:
References (cont’d)
This Schematron uses positional predicate [1] to check that year is
immediately followed by source:
<rule context="mixed-citation[@publication-type=
'book-standalone']/year">
<assert test="following-sibling::*[1]/self::source">
'<name/>' must be followed by 'source', not by '<value-of
select='name(following-sibling::*[1])'/>'
</assert></rule>
Schematron message
'year' must be immediately followed by 'source', not by 'edition'
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 23
DTD vs. Schematron:
References (cont’d)
But how to check the sequence of required elements when there might
be optional elements interspersed between them?
This Schematron checks that required publisher-name is preceded by
required source, regardless of any optional elements that may
occur in-between:
<rule context="mixed-citation[@publication-type=
'book-standalone']/publisher-name">
<assert test="preceding-sibling::source">
'<name/>' must be preceded by 'source'</assert></rule>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 24
DTD vs. Schematron:
References (cont’d)
• Rick Jelliffe’s approach combines flexibility of JPTS
with benefits of a DTD-like fixed element order:
– Each element rewritten as a string of its element
names
– Content model represented as a regular expression
– Schematron checks the string of names against regex
– Schematron generates an error message if content
does not match the model
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 25
DTD vs. Schematron:
References (cont’d)
An XML file, e.g., citation-models.xml, specifies structured citation
models:
...
<model publication-type="book-standalone">
((string-name | person-group),
year,
source,
edition,
(string-name | person-group)?,
size?,
elocation-id?,
publisher-name,
publisher-loc)
</model>
...
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 26
DTD vs. Schematron:
References (cont’d)
• Advantages:
– XML is still DTD-valid
– Mixed content is permitted
– Type-sensitive handling of references is possible
• Caveat: XSLT 2.0!
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 27
Lessons learned
• AGU Tag Set + Schematron (200+ checks)
– Ensures data quality
– Ensures markup integrity
– Provides control over production processes
– Enforces business rules, data types, and house style
• AGU Tag Set is a superset of JPTS
– Based on JPTS
– Uses the same modularization principles
– Can be easily mapped to JPTS
• BUT: Were we to do this again we would have
built JPTS subset and a Schematron for it
Alexander (“Sasha”) Schwarzman 28Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d)
• Appropriate layer validation—advantages:
– Even the most “Prussian” DTD can’t enforce all
business rules, data types, and house style
– Rules-based checking needed anyway
– May as well use “Californian” JPTS (de facto
industry standard) adopted by publishers,
conversion & composition vendors, archives, etc.
– Can use tools developed for JATS: Preview XSLT
stylesheets, EPUBS conversion processes, etc.
• Paradigm shift: the crux of validation shifts
from XML parser to Schematron engine
Alexander (“Sasha”) Schwarzman 29Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d)
• This shift is not without costs:
– Content may be valid to JPTS but make no sense
– Dependency on Schematron for semantic integrity
– Preserving each Schematron release and adding
version info to the content’s metadata (?)
– Constraints on business partners: must be
Schematron-capable and have tools
– Schematron does not “fix” problems—people do.
Processes and procedures must be well-defined
Alexander (“Sasha”) Schwarzman 30Superset Me—Not JATS-Con Nov 2, 2010
Lessons learned (cont’d)
• Writing a simple Schematron is easy;
building a complex and efficient one is not:
– Elicit, document, convey, and clarify the Requirements
– Ensure Schematron fits into your workflow
– Modularize Schematron
– Ensure that individual Schematron rules aren’t in conflict
– Optimize Schematron performance
– Employ XSLT 2.0
– Test, test, test
– Cultivate Schematron & XSLT 2.0 expertise in-house
Alexander (“Sasha”) Schwarzman 31Superset Me—Not JATS-Con Nov 2, 2010
Conclusion
• What about content that is not like a journal
article, e.g., generic (non-NCBI) books and their
parts/chapters?
• When this deficiency is addressed, the NLM
Archiving and Interchange Tag Suite could truly
say:
“Superset Me—Not!”
Alexander (“Sasha”) Schwarzman 32Superset Me—Not JATS-Con Nov 2, 2010

More Related Content

More from aschwarzman

2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation
aschwarzman
 
2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman
aschwarzman
 
2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation
aschwarzman
 
2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman
aschwarzman
 
2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman
aschwarzman
 
2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman
aschwarzman
 
Schwarzman-CSE2011
Schwarzman-CSE2011Schwarzman-CSE2011
Schwarzman-CSE2011
aschwarzman
 
Extreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-SchwarzmanExtreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-Schwarzman
aschwarzman
 
XML2004-schwarzman
XML2004-schwarzmanXML2004-schwarzman
XML2004-schwarzman
aschwarzman
 
JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29
aschwarzman
 
Balisage_2011-08-03_Schwarzman
Balisage_2011-08-03_SchwarzmanBalisage_2011-08-03_Schwarzman
Balisage_2011-08-03_Schwarzman
aschwarzman
 
Balisage-2015-funding-poster
Balisage-2015-funding-posterBalisage-2015-funding-poster
Balisage-2015-funding-poster
aschwarzman
 
Balisage-2015-sup-mat-poster
Balisage-2015-sup-mat-posterBalisage-2015-sup-mat-poster
Balisage-2015-sup-mat-poster
aschwarzman
 

More from aschwarzman (19)

dineen2013
dineen2013dineen2013
dineen2013
 
XML-talk
XML-talkXML-talk
XML-talk
 
2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation2012-08-14-OSA-Pubs-IT_Presentation
2012-08-14-OSA-Pubs-IT_Presentation
 
2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman2012-05-20-CSE-2012_Schwarzman
2012-05-20-CSE-2012_Schwarzman
 
2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation2012-03-20-AGU-Librarians_Presentation
2012-03-20-AGU-Librarians_Presentation
 
2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman2011-11-14-CrossRef-Workshops_Schwarzman
2011-11-14-CrossRef-Workshops_Schwarzman
 
2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman2011-09-27-JATS-Con-Presentation_Schwarzman
2011-09-27-JATS-Con-Presentation_Schwarzman
 
2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman2011-Balisage-Poster-Schwarzman
2011-Balisage-Poster-Schwarzman
 
Schwarzman-CSE2011
Schwarzman-CSE2011Schwarzman-CSE2011
Schwarzman-CSE2011
 
Extreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-SchwarzmanExtreme-ML-2006-Poster-A-Schwarzman
Extreme-ML-2006-Poster-A-Schwarzman
 
XML2004
XML2004XML2004
XML2004
 
XML2004-schwarzman
XML2004-schwarzmanXML2004-schwarzman
XML2004-schwarzman
 
JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29JATS-Con-Schwarzman-slides_corr-2016-04-29
JATS-Con-Schwarzman-slides_corr-2016-04-29
 
Balisage_2011-08-03_Schwarzman
Balisage_2011-08-03_SchwarzmanBalisage_2011-08-03_Schwarzman
Balisage_2011-08-03_Schwarzman
 
Balisage-2015-funding-poster
Balisage-2015-funding-posterBalisage-2015-funding-poster
Balisage-2015-funding-poster
 
Balisage-2015-sup-mat-poster
Balisage-2015-sup-mat-posterBalisage-2015-sup-mat-poster
Balisage-2015-sup-mat-poster
 
Using Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case studyUsing Schematron for appropriate layer validation: A case study
Using Schematron for appropriate layer validation: A case study
 
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
 
NISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working GroupNISO-NFAIS Supplemental Journal Article Materials Working Group
NISO-NFAIS Supplemental Journal Article Materials Working Group
 

Schwarzman-JATS-Con-slides

  • 1. Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010
  • 2. Summary We have built a superset of the NLM Journal Publishing Tag Set and a Schematron validator to enforce business rules, data types, and house style. In retrospect, a JPTS subset—when used in conjunction with the appropriate layer validation technology, such as Schematron— could have been sufficient to meet AGU's needs. Alexander (“Sasha”) Schwarzman 2Superset Me—Not JATS-Con Nov 2, 2010
  • 3. Contents • Why we built a JPTS superset • DTD vs. Schematron – Attribute values – Number of element occurrences – Element position and sequence – References • Lessons learned Alexander (“Sasha”) Schwarzman 3Superset Me—Not JATS-Con Nov 2, 2010
  • 4. Why we built a JPTS superset • No generic book model, e.g., no book-series- meta for a book, no xi:include for chapters, etc. • Lack of familiarity with Schematron • Lack of mature tool support (running SVRL not a viable option in Production environment) • Lack of expertise on using Schematron to validate against external data sources and relational DB • JATS v2.3: no Compound Keywords, not all content models parameterized Alexander (“Sasha”) Schwarzman 4Superset Me—Not JATS-Con Nov 2, 2010
  • 5. DTD vs. Schematron: Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD <!ATTLIST article article-type (rga | cor | edt) #REQUIRED > JPTS <!ATTLIST article article-type CDATA #IMPLIED > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 5
  • 6. DTD vs. Schematron: Attribute values (cont’d) XML instance (contains non-allowed article type) <article article-type='xxx'/> Schematron <rule context="article"> <assert test="@article-type=('rga','cor','edt')"> @article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule> Schematron message @article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 6
  • 7. DTD vs. Schematron: Number of element occurrences Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs Strict DTD <!ELEMENT ack (p, p?) > JPTS <!ELEMENT ack (p*) > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 7
  • 8. DTD vs. Schematron: Number of occurrences (cont’d) XML instance (wrong number of paragraphs) <article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 8
  • 9. DTD vs. Schematron: Number of occurrences (cont’d) Schematron <rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule> <rule context="ack"> <assert test="count(p) eq 1"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 9
  • 10. DTD vs. Schematron: Number of occurrences (cont’d) Schematron message 'ack' in 'jb' must contain only one paragraph Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 10
  • 11. DTD vs. Schematron: Element position & sequence Requirement: If a journal uses subject grouping (a ToC category, a disciplinary subset) and an article belongs to a special collection (a special section, a theme), then subject grouping metadata must precede special collection metadata Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 11
  • 12. DTD vs. Schematron: Element position & sequence (cont’d) XML instance (wrong sequence of subject groups) <article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and Applications of Earthquake Early Warning</subject> </subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group> </article-categories> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 12
  • 13. DTD vs. Schematron: Element position & sequence (cont’d) Schematron <rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling:: subj-group[@subj-group-type=('toc-category','subset')])"> <name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule> Schematron message subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 13
  • 14. DTD vs. Schematron: References Validating references is a challenge: • Variety vs. the need to enforce editorial style Strict DTD: • Fixed element order, no mixed content • Punctuation, spacing, face markup – on output JPTS: • Lots of elements, any order, mixed content • Punctuation, spacing, face markup included Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 14
  • 15. DTD vs. Schematron: References (cont’d) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 15
  • 16. DTD vs. Schematron: References (cont’d) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 16
  • 17. DTD vs. Schematron: References (cont’d) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 17
  • 18. DTD vs. Schematron: References (cont’d) XML instance (strict DTD) <book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc> </book-standalone-citation> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 18
  • 19. DTD vs. Schematron: References (cont’d) XML instance (JPTS) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 19
  • 20. DTD vs. Schematron: References (cont’d) Before we proceed, please note: - required elements - edition, if present, follows source - optional elements between source and publisher-name: <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 20
  • 21. DTD vs. Schematron: References (cont’d) • Schematron can check that all required elements are present: <rule context="mixed-citation[@publication-type='book- standalone']"> <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing</assert></rule> • & that the elements are in the correct sequence: Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 21
  • 22. DTD vs. Schematron: References (cont’d) XML instance (JPTS) (edition is in the wrong place) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname> </string-name> (<year>1963</year>), <edition>2</edition>nd ed., <source><italic>Introduction to the Theory …</italic></source>, <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 22
  • 23. DTD vs. Schematron: References (cont’d) This Schematron uses positional predicate [1] to check that year is immediately followed by source: <rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of select='name(following-sibling::*[1])'/>' </assert></rule> Schematron message 'year' must be immediately followed by 'source', not by 'edition' Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 23
  • 24. DTD vs. Schematron: References (cont’d) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source"> '<name/>' must be preceded by 'source'</assert></rule> Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 24
  • 25. DTD vs. Schematron: References (cont’d) • Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: – Each element rewritten as a string of its element names – Content model represented as a regular expression – Schematron checks the string of names against regex – Schematron generates an error message if content does not match the model Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 25
  • 26. DTD vs. Schematron: References (cont’d) An XML file, e.g., citation-models.xml, specifies structured citation models: ... <model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc) </model> ... Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 26
  • 27. DTD vs. Schematron: References (cont’d) • Advantages: – XML is still DTD-valid – Mixed content is permitted – Type-sensitive handling of references is possible • Caveat: XSLT 2.0! Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 27
  • 28. Lessons learned • AGU Tag Set + Schematron (200+ checks) – Ensures data quality – Ensures markup integrity – Provides control over production processes – Enforces business rules, data types, and house style • AGU Tag Set is a superset of JPTS – Based on JPTS – Uses the same modularization principles – Can be easily mapped to JPTS • BUT: Were we to do this again we would have built JPTS subset and a Schematron for it Alexander (“Sasha”) Schwarzman 28Superset Me—Not JATS-Con Nov 2, 2010
  • 29. Lessons learned (cont’d) • Appropriate layer validation—advantages: – Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style – Rules-based checking needed anyway – May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. – Can use tools developed for JATS: Preview XSLT stylesheets, EPUBS conversion processes, etc. • Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Alexander (“Sasha”) Schwarzman 29Superset Me—Not JATS-Con Nov 2, 2010
  • 30. Lessons learned (cont’d) • This shift is not without costs: – Content may be valid to JPTS but make no sense – Dependency on Schematron for semantic integrity – Preserving each Schematron release and adding version info to the content’s metadata (?) – Constraints on business partners: must be Schematron-capable and have tools – Schematron does not “fix” problems—people do. Processes and procedures must be well-defined Alexander (“Sasha”) Schwarzman 30Superset Me—Not JATS-Con Nov 2, 2010
  • 31. Lessons learned (cont’d) • Writing a simple Schematron is easy; building a complex and efficient one is not: – Elicit, document, convey, and clarify the Requirements – Ensure Schematron fits into your workflow – Modularize Schematron – Ensure that individual Schematron rules aren’t in conflict – Optimize Schematron performance – Employ XSLT 2.0 – Test, test, test – Cultivate Schematron & XSLT 2.0 expertise in-house Alexander (“Sasha”) Schwarzman 31Superset Me—Not JATS-Con Nov 2, 2010
  • 32. Conclusion • What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? • When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: “Superset Me—Not!” Alexander (“Sasha”) Schwarzman 32Superset Me—Not JATS-Con Nov 2, 2010