NISO-NFAIS Supplemental Journal Article Materials Working Group
Schwarzman-JATS-Con-slides
1. Superset Me—Not:
Why the JPTS Is Sufficient if You Use Appropriate Layer Validation
Alexander (“Sasha”) Schwarzman
American Geophysical Union (AGU)
JATS-Con
November 2, 2010
2. Summary
We have built a superset of the NLM Journal
Publishing Tag Set and a Schematron validator
to enforce business rules, data types, and
house style.
In retrospect, a JPTS subset—when used in
conjunction with the appropriate layer
validation technology, such as Schematron—
could have been sufficient to meet AGU's
needs.
Alexander (“Sasha”) Schwarzman 2Superset Me—Not JATS-Con Nov 2, 2010
3. Contents
• Why we built a JPTS superset
• DTD vs. Schematron
– Attribute values
– Number of element occurrences
– Element position and sequence
– References
• Lessons learned
Alexander (“Sasha”) Schwarzman 3Superset Me—Not JATS-Con Nov 2, 2010
4. Why we built a JPTS superset
• No generic book model, e.g., no book-series-
meta for a book, no xi:include for chapters, etc.
• Lack of familiarity with Schematron
• Lack of mature tool support (running SVRL not a
viable option in Production environment)
• Lack of expertise on using Schematron to
validate against external data sources and
relational DB
• JATS v2.3: no Compound Keywords, not all
content models parameterized
Alexander (“Sasha”) Schwarzman 4Superset Me—Not JATS-Con Nov 2, 2010
5. DTD vs. Schematron:
Attribute values
Requirement: Article type is required and can be one of three types:
a regular article (rga), a correction (cor), or an editorial (edt)
Strict DTD
<!ATTLIST article
article-type
(rga | cor | edt) #REQUIRED >
JPTS
<!ATTLIST article
article-type
CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 5
6. DTD vs. Schematron:
Attribute values (cont’d)
XML instance (contains non-allowed article type)
<article article-type='xxx'/>
Schematron
<rule context="article">
<assert test="@article-type=('rga','cor','edt')">
@article-type '<value-of select='@article-type'/>' not
allowed, must be 'rga', 'cor', or edt'</assert></rule>
Schematron message
@article-type 'xxx' not allowed, must be 'rga', 'cor', or
'edt'
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 6
7. DTD vs. Schematron:
Number of element occurrences
Requirement: Acknowledgments, if present, must contain exactly
one paragraph, except for two journals (journal code ‘ja’ and
‘rg’) where Acknowledgments must contain two paragraphs
Strict DTD
<!ELEMENT ack (p, p?) >
JPTS
<!ELEMENT ack (p*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 7
8. DTD vs. Schematron:
Number of occurrences (cont’d)
XML instance (wrong number of paragraphs)
<article>
...
<journal-id>jb</journal-id>
...
<ack>
<p>Blah</p>
<p>Blah-blah</p>
</ack>
</article>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 8
9. DTD vs. Schematron:
Number of occurrences (cont’d)
Schematron
<rule context="ack[ancestor::*/journal-id=('ja','rg')]">
<assert test="count(p) eq 2">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain exactly two paragraphs</assert></rule>
<rule context="ack">
<assert test="count(p) eq 1">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain only one paragraph</assert></rule>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 9
10. DTD vs. Schematron:
Number of occurrences (cont’d)
Schematron message
'ack' in 'jb' must contain only one paragraph
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 10
11. DTD vs. Schematron:
Element position & sequence
Requirement: If a journal uses subject grouping (a ToC category,
a disciplinary subset) and an article belongs to a special
collection (a special section, a theme), then subject grouping
metadata must precede special collection metadata
Strict DTD
<!ELEMENT article-categories
(subject-group*,
special-collection?) >
JPTS
<!ELEMENT article-categories
(subj-group*) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 11
12. DTD vs. Schematron:
Element position & sequence (cont’d)
XML instance (wrong sequence of subject groups)
<article-categories>
<subj-group subj-group-type="special-section">
<subject content-type="EARLYWARN1">New Methods and
Applications of Earthquake Early Warning</subject>
</subj-group>
<subj-group subj-group-type="toc-category">
<subject content-type="SDE">Solid Earth</subject>
</subj-group>
</article-categories>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 12
13. DTD vs. Schematron:
Element position & sequence (cont’d)
Schematron
<rule context="article-categories/
subj-group[@subj-group-type=('special-section','theme')]">
<assert test="not(following-sibling::
subj-group[@subj-group-type=('toc-category','subset')])">
<name/>/@subj-group-type='<value-of select='@subj-group-
type'/>' must appear after a ToC Category or a Subset
when either is present</assert></rule>
Schematron message
subj-group/@subj-group-type='special-section' must appear
after a ToC Category or a Subset when either is present
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 13
14. DTD vs. Schematron:
References
Validating references is a challenge:
• Variety vs. the need to enforce editorial style
Strict DTD:
• Fixed element order, no mixed content
• Punctuation, spacing, face markup – on output
JPTS:
• Lots of elements, any order, mixed content
• Punctuation, spacing, face markup included
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 14
15. DTD vs. Schematron:
References (cont’d)
Strict DTD
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
<!ATTLIST book-standalone-citation
id ID #REQUIRED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 15
16. DTD vs. Schematron:
References (cont’d)
JPTS
<!ELEMENT mixed-citation
(#PCDATA | person-group | string-name |
year | source | edition | size |
elocation-id | publisher-name |
publisher-loc | ... | ...)* >
<!ATTLIST mixed-citation
id ID #IMPLIED
publication-type CDATA #IMPLIED >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 16
17. DTD vs. Schematron:
References (cont’d)
Example:
Mood, A. M., and F. A. Graybill (1963),
Introduction to the Theory Statistics, 2nd ed.,
295 pp., McGraw-Hill, New York.
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 17
18. DTD vs. Schematron:
References (cont’d)
XML instance (strict DTD)
<book-standalone-citation id="mood63">
<person-group person-group-type="author">
<name><surname>Mood</surname>
<given-names>A. M.</given-names></name>
<name><surname>Graybill</surname>
<given-names>F. A.</given-names></name>
</person-group>
<year>1963</year>
<source>Introduction to the Theory Statistics</source>
<edition>2nd</edition>
<size units="page">295 pp<size/>
<publisher-name>McGraw-Hill</publisher-name>
<publisher-loc>New York</publisher-loc>
</book-standalone-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 18
19. DTD vs. Schematron:
References (cont’d)
XML instance (JPTS)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names> <surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<source><italic>Introduction to the
Theory Statistics</italic></source>,
<edition>2</edition>nd ed.,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 19
20. DTD vs. Schematron:
References (cont’d)
Before we proceed, please note:
- required elements
- edition, if present, follows source
- optional elements between source and publisher-name:
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 20
21. DTD vs. Schematron:
References (cont’d)
• Schematron can check that all required elements
are present:
<rule context="mixed-citation[@publication-type='book-
standalone']">
<assert test="(person-group | string-name) and year
and source and publisher-name
and publisher-loc">
required element missing</assert></rule>
• & that the elements are in the correct sequence:
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 21
22. DTD vs. Schematron:
References (cont’d)
XML instance (JPTS) (edition is in the wrong place)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names><surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<edition>2</edition>nd ed.,
<source><italic>Introduction to the Theory …</italic></source>,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 22
23. DTD vs. Schematron:
References (cont’d)
This Schematron uses positional predicate [1] to check that year is
immediately followed by source:
<rule context="mixed-citation[@publication-type=
'book-standalone']/year">
<assert test="following-sibling::*[1]/self::source">
'<name/>' must be followed by 'source', not by '<value-of
select='name(following-sibling::*[1])'/>'
</assert></rule>
Schematron message
'year' must be immediately followed by 'source', not by 'edition'
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 23
24. DTD vs. Schematron:
References (cont’d)
But how to check the sequence of required elements when there might
be optional elements interspersed between them?
This Schematron checks that required publisher-name is preceded by
required source, regardless of any optional elements that may
occur in-between:
<rule context="mixed-citation[@publication-type=
'book-standalone']/publisher-name">
<assert test="preceding-sibling::source">
'<name/>' must be preceded by 'source'</assert></rule>
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 24
25. DTD vs. Schematron:
References (cont’d)
• Rick Jelliffe’s approach combines flexibility of JPTS
with benefits of a DTD-like fixed element order:
– Each element rewritten as a string of its element
names
– Content model represented as a regular expression
– Schematron checks the string of names against regex
– Schematron generates an error message if content
does not match the model
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 25
26. DTD vs. Schematron:
References (cont’d)
An XML file, e.g., citation-models.xml, specifies structured citation
models:
...
<model publication-type="book-standalone">
((string-name | person-group),
year,
source,
edition,
(string-name | person-group)?,
size?,
elocation-id?,
publisher-name,
publisher-loc)
</model>
...
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 26
27. DTD vs. Schematron:
References (cont’d)
• Advantages:
– XML is still DTD-valid
– Mixed content is permitted
– Type-sensitive handling of references is possible
• Caveat: XSLT 2.0!
Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010 27
28. Lessons learned
• AGU Tag Set + Schematron (200+ checks)
– Ensures data quality
– Ensures markup integrity
– Provides control over production processes
– Enforces business rules, data types, and house style
• AGU Tag Set is a superset of JPTS
– Based on JPTS
– Uses the same modularization principles
– Can be easily mapped to JPTS
• BUT: Were we to do this again we would have
built JPTS subset and a Schematron for it
Alexander (“Sasha”) Schwarzman 28Superset Me—Not JATS-Con Nov 2, 2010
29. Lessons learned (cont’d)
• Appropriate layer validation—advantages:
– Even the most “Prussian” DTD can’t enforce all
business rules, data types, and house style
– Rules-based checking needed anyway
– May as well use “Californian” JPTS (de facto
industry standard) adopted by publishers,
conversion & composition vendors, archives, etc.
– Can use tools developed for JATS: Preview XSLT
stylesheets, EPUBS conversion processes, etc.
• Paradigm shift: the crux of validation shifts
from XML parser to Schematron engine
Alexander (“Sasha”) Schwarzman 29Superset Me—Not JATS-Con Nov 2, 2010
30. Lessons learned (cont’d)
• This shift is not without costs:
– Content may be valid to JPTS but make no sense
– Dependency on Schematron for semantic integrity
– Preserving each Schematron release and adding
version info to the content’s metadata (?)
– Constraints on business partners: must be
Schematron-capable and have tools
– Schematron does not “fix” problems—people do.
Processes and procedures must be well-defined
Alexander (“Sasha”) Schwarzman 30Superset Me—Not JATS-Con Nov 2, 2010
31. Lessons learned (cont’d)
• Writing a simple Schematron is easy;
building a complex and efficient one is not:
– Elicit, document, convey, and clarify the Requirements
– Ensure Schematron fits into your workflow
– Modularize Schematron
– Ensure that individual Schematron rules aren’t in conflict
– Optimize Schematron performance
– Employ XSLT 2.0
– Test, test, test
– Cultivate Schematron & XSLT 2.0 expertise in-house
Alexander (“Sasha”) Schwarzman 31Superset Me—Not JATS-Con Nov 2, 2010
32. Conclusion
• What about content that is not like a journal
article, e.g., generic (non-NCBI) books and their
parts/chapters?
• When this deficiency is addressed, the NLM
Archiving and Interchange Tag Suite could truly
say:
“Superset Me—Not!”
Alexander (“Sasha”) Schwarzman 32Superset Me—Not JATS-Con Nov 2, 2010