NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

The Challenges of Describing Best
Tagging Practices for JATS
Jeffrey Beck, NCBI/NLM/NIH
NISO/NFAIS Joint Virtual Conference:
Connecting the Library to the Wider World:
Successful Applications of Linked Data
Wednesday, December 3, 2014

Intro to JATS
JATS refers to NISO Z39.96-2012 Journal Article
Tag Suite.
It is a NISO standard that describes XML
elements and attributes and three article
models in XML.

JATS was based on the “NLM DTDs”, which have
been used to describe journal articles since
2003.
The “NLM DTDs” grew out of work being done
on the NCBI PubMed Central (PMC) DTDs in
2002.

So, what is this DTD you speak of?
DTD is Document Type Definition
– One of many (3 really) schema languages for
defining XML documents
– Essentially a set of rules for what can be in your
document, what must be in your document, and the
order of things if you wish to enforce order
We’ll get to “Why DTD” later.

A Brief History
• NLM Version 1 was released in December
2002 with the Archiving and Interchange DTD
and the Journal Publishing DTD.
• Version 1 was based on work at NCBI to
upgrade the PubMed Central DTD and a
project at Harvard University funded by the
Mellon Foundation to address the problems of
archiving scholarly journals in electronic form
(E-journals).

• The initial meeting included participants from
NCBI, Harvard, and the Mellon Foundation
along with NCBI’s consultants, Mulberry
Technologies, and Harvard’s consultants,
Inera, Inc.
But there was confusion about what the model
should be.

Easy Target for Conversion?
• Should the new DTD be a broad, descriptive
target that would be easy to translate articles
from other SGML or XML models into?
A model like this would have many optional
elements with few things in a prescribed
order, and different ways to tag the same
object.

Easy model to create content in?
• Or should the new DTD be a narrower,
prescriptive target that would give creators of
new XML articles guidance about how to make
a valid article?
A model like this would have more required
elements with fewer choices on how to tag
the same object.

The DTD Spectrum
Optimized for Conversion to Optimized to Create Content in

The DTD Spectrum
Conversion Creation
Archive and Interchange DTD Journal Publishing DTD

Everything was fine, until
<x>

The two archiving strategies
Archiving the intellectual content of the article?
Or
Archiving the article file?

If you need to archive the entire file, you need a
way to keep those items in the file that the
Archiving and Interchange DTD did not worry
about.

Punctuation in Keywords.
Keyword Group in Archiving 1.0:
<!ELEMENT kwd-group (title?, kwd+) >
Keywords: DNA analysis; gene expression; parallel cloning; fluid
microarray. Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
<kwd-group>
<kwd>DNA analysis</kwd>
<kwd>gene expression</kwd>
<kwd>parallel cloning</kwd>
<kwd>fluid microarray</kwd>
</kwd-group>

Punctuation in Keywords.
Keyword Group in Archiving 2.0:
<!ELEMENT kwd-group (title?, (kwd | x)+)
>
Keywords: DNA analysis; gene expression; parallel cloning; fluid
microarray. Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
Keywords: DNA analysis; gene expression; parallel cloning;
fluid microarray.
<kwd-group>
<title>Keywords: </title>
<kwd>DNA analysis</kwd><x>; </x>
<kwd>gene expression</kwd><x>; </x>
<kwd>parallel cloning</kwd><x>; </x>
<kwd>fluid microarray</kwd><x>.</x>
</kwd-group>

The DTD Spectrum
Conversion Creation

The DTD Spectrum
Conversion Creation
Article Authoring DTD

JATS?
Journal Article Tag Suite
The Tag Suite is the collection of all Elements
and Attributes.
Each model (Archiving, Publishing, Authoring) is
a Tag Set.
Each schema (DTD, XSD, RELAX NG) represents a
model or Tag Set.

NLM DTDs v 2.1 September 2005
NLM DTDs v 2.0 November 2004
NLM DTDs v 1.0 March 2003
This was when the Article Archiving and Journal Publishing models
became more open and we added the Authoring model.

NLM DTDs v 2.2 June 2006

Decision to formalize standard with NISO
Laura Kelly suggested that this would be a
good time to clean up those little things that we
know are problems but we haven’t fixed
because we wanted all of the new models to be
backward-compatible.

Backward-compatibility
• Means that all existing XML instances will be
valid according to the new model.
• Mostly we had minor housekeeping issues that we
had been putting off.
• In version 1.0, the @id on <list-item> was
defined as CDATA (when it obviously should
have been defined as ID to allow ID/IDREF
functionality).
• So, any existing <list-item id=“45qrt”> would be
valid under version 1.0 but not valid when the
attribute was properly defined as type=ID.

Backward-incompatible release

NLM DTDs v 3.1
NLM DTD Working Group is dissolved, and
the NISO Journal Article Tag Suite Working
Group is created.

August 2012
NISO Z39.96-2012 is
official
NISO Z39.96 JATS v 0.4 March 2011

December 2013
August 2012
JATS v1.1d1
released
NISO Z39.96-2012 is
official
NISO Z39.96 JATS v 0.4 March 2011
JATS V1.1d2 - December
2014??

Maintained in DTD
• We deliver DTD, XSD, and RNG as non-normative
supporting material to the
standard.
• But the models are written and maintained in
DTD and the other schemas are derived from
them.

Q: But this means that you will not get any of
the advantages of the more modern schema
languages in JATS?
A: Yes. That is correct.
Q: And that is bad!
A: Not necessarily.
Q: But, but … data typing!!!

In defense of DTD
• First, DTD is still the schema language of
choice for most users of JATS – publishers and
tagging vendors.

But, but … data typing!!!
Data Typing gives the schema writer control
over the value of an element or attribute.
Like saying that a value must be an integer or
that a string of characters must be a date.
There is little datatyping in DTD.

Let’s consider dates
It is reasonable to say that when we are creating
content to publish, we want the values that are
written as dates to be dates.
• The 14th of Smoon
• January 7, 1
• 1947-02-30
Are all a little hinky and should not be published!

But what if they already exist?
If you are tagging a journal’s historical content in
XML and you come across an issue with a cover
date of February 30, 1947. What do you do?
A: Fix it!
Q: What is it “supposed” to be?

If a date can sometimes not be a date, then you can
not have a hard and fast rule built into your schema
that says it must be a date always.
{Thanks to Tommie Usdin of Mulberry Technologies
and Co-Chair of the JATS Standing Committee for
this wonderful example that I stole.}

So, how do you tag a … ?
• But sometimes people want to be told what to
do.

• The JATS Tag Sets - especially the Archiving
and Interchange and even the Journal
Publishing are very flexible models that allow
content to be tagged in different ways

A reasonable question
• (1) It seems from the element reference page for <chem-struct-wrap> that
one could omit explicit labels because "A <chem-struct-wrap> may also be
numbered, automatically by a formatting application or by preserving the
number inside a <label> element." Having seen this, but not found similar
comments about "automatic numbering" for other elements that may
typically be numbered/labelled, I would like to know what the assumption
is about omitting labels in general for these (e.g. chemical structures,
equations, figures, tables, etc.): is a formatting application expected by
default to generate a number/label? If so, is there a way to suppress
numbering for some occurrences?
• (2) Relatedly, what is the expected behaviour for an <xref> element that
has no content (e.g. one that (a) references an element for which
automatic numbering has been assumed and which therefore lacks a
<label>, or (b) one that references an element possessing a <label>)?
• Message from Simon Newton to jats-list@lists.mulberrytech.com on
September 7, 2011

A reasonable question
• (1) It seems from the element reference page for <chem-struct-wrap> that
one could omit explicit labels because "A <chem-struct-wrap> may also be
numbered, automatically by a formatting application or by preserving the
number inside a <label> element." Having seen this, but not found similar
comments about "automatic numbering" for other elements that may
typically be numbered/labelled, I would like to know what the
assumption is about omitting labels in general for these (e.g. chemical
structures, equations, figures, tables, etc.): is a formatting application
expected by default to generate a number/label? If so, is there a way to
suppress numbering for some occurrences?
• (2) Relatedly, what is the expected behaviour for an <xref> element that
has no content (e.g. one that (a) references an element for which
automatic numbering has been assumed and which therefore lacks a
<label>, or (b) one that references an element possessing a <label>)?
• Message from Simon Newton to jats-list@lists.mulberrytech.com on
September 7, 2011

• Simon was asking for “Best Practices”
• So I was thrilled to see the following response:

I don't think any assumptions are made regarding
when and exactly how numbering should be
automated; there is only a recognition that it
commonly done in publishing systems, and JATS is
designed to support this (or no numbering at all) or
not, depending on local policies.
Neither is there any expectation that by default, a
formatting application will number things.
This means you have both the opportunity and the
burden to define a policy that makes the most
sense for your data and workflow.
Message from Weldell Piez to jats-list@lists.mulberrytech.com on
September 8, 2011

Best Practices must be scoped
• They must make sense with your content.
… with your workflow
… and for any users of your content down the
line.

The Standing Committee position
The JATS Standing Committee makes an effort to
make the Tag Suite as useful as possible for all
users: creators of content, publishers, archives,
and other aggregators.
To do this “all reasonable practices” are
documented as much as possible in the non-normative
supporting information available at
http://jats.nlm.nih.gov.

But there are efforts to define tagging best
practices – or at least practices.

PMC Tagging Guidelines
We have the PMC Tagging Guidelines
(http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/
article/style.html) – which is essentially a
"Best Practices" for tagging articles in NLM XML for
submission to PMC.
These are still surprisingly open.

In response to the article “Inconsistent XML as a Barrier
to Reuse of Open Access Content”, which focused on
inconsistent tagging in the PMC Open Access articles
available for reuse, a group of mainly open access
publishers formed a group called JATS for Reuse to define
some best tagging practices.
See http://jats4r.github.io/
(http://www.ncbi.nlm.nih.gov/books/NBK159964/)

Questions?
Come to
http://jats.nlm.nih.gov/jats-con

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (7)

Similar to NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data

Similar to NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider World - Successful Applications of Linked Data