By now, you have heard how important structured content is. But, maybe you poked around with something like DITA and were baffled by the complexity. Or, maybe you still aren’t sure what XSLT stands for. This workshop will take participants back to the basics, to provide a foundation for higher-level concepts that have taken hold of our industry. Topics will include:
- What XML looks like, what it does, and how to create it.
- How to define a structure model, including whether to use a - DTD, Schema, etc.
- What XSLT looks like, what it does, and how to make it work.
- What DITA and DocBook really are and whether one is right for you.
Russell Ward is an experienced technical writer and structured technologies developer. He has spent many years working with structured content to maximize efficiency in the techcomm environment, both as an employee and as an independent consultant. He is also an experienced trainer and speaks periodically at conferences and other peer events.
2. Speaker contact information
Russ Ward
Senior Technical Writer at Spirent Communications in Frederick,
MD.
Owner of West Street Consulting, a part-time enterprise
specializing in Structured FrameMaker plugins and custom
development.
5280 Corporate Drive, Suite A100
Frederick, MD 21703
301.444.2489
russ.ward@spirent.com
www.spirent.com
357 W. North St.
Carlisle, PA 17013
717.240.2989
russ@weststreetconsulting.com
www.weststreetconsulting.com
3. Workshop purpose
Overall purpose
To give you a functional knowledge of markup at a basic level, so you
know where to get started.
Things we will cover in some depth
What markup really means (and clarify some of the jargon).
The essential structure of XML
How to make rules about the structure of XML (DTD, Schema, etc.)
How to do anything interesting with XML once you have it (XSLT, DOM,
etc.)
Things we will cover with less depth
Common applications for XML in the technical communications space,
such as DITA, DocBook, and other standards.
Other types of markup.
4. Disclaimers
You cannot go from a dummy to an expert in an afternoon! If you
really want to make markup work for you, plan for a dedicated pursuit
of knowledge which could last your entire career.
The concepts of markup are as big as the universe of technology. We will
attempt to focus on areas of interest within the technical
communications field.
If you want expertise with markup, you have to want it.
5. XML/markup myths (according to Russ Ward)
I’m a technical writer, so I’m concerned about words, not XML.
Markup is easy… it’s just a bunch of tags.
I only care about markup if I want to use a CMS / reuse content / (insert
specific reason here).
All I have to do is convert to XML and the magic starts to happen.
I don’t have the time to study this stuff / convert my content / (insert
task here)
I’m a lone writer, so it’s not worth my time to fuss with markup.
XML authoring = DITA.
6. XML!
eXensible Markup Language… one of many ways to mark up content in
text format. By far, it is the most widely-used format in technical
communication and the greater IT universe. Therefore, it is the primary
subject of this workshop.
XML markup follows a hierarchical tree format, similar to the inherent
structure of a written document. Therefore, not only does XML provide
technical advantages, the words of a document naturally fit into it.
By itself, XML is just a text file that does nothing! The magic happens
when you use XML-aware tools to read the markup and do cool stuff.
XML must be well-formed and normally should be validated.
Several validation formats are available, with DTD and Schema as the
most popular.
7. DTD vs. Schema
Two competing methods to define a structure and validate any
compliant document.
DTD:
• Older and originated with SGML.
• Uses a unique syntax.
• Still works just fine; that is, remains supported by any mainstream XML tool.
Schema:
• Newer, applicable to XML only.
• Uses an XML syntax which makes it easier for a computer to read, but also
more difficult for a human to read.
• More features than DTD.
8. DTD vs. Schema (cont’d)
The choice should be made based on what works for you and the tools
you intend to use. Here are some reasons you might choose Schema*:
• It is easier to describe allowable document content
• It is easier to validate the correctness of data
• It is easier to define data facets (restrictions on data)
• It is easier to define data patterns (data formats)
• It is easier to convert data between different data types
*(taken from https://www.w3schools.com/xml/schema_intro.asp)
9. RELAX NG
Another method to define a data structure, less common than DTD or Schema
but is used.
Stands for REgular LAnguage for XML Next Generation.
Can be written in XML like Schema or an alternative compact syntax.
Simple example:
• XML document:
<book>
<page>This is page one.</page>
<page>This is page two.</page>
</book>
• Corresponding RELAX NG schema in XML format:
<element name="book" xmlns="http://relaxng.org/ns/structure/1.0">
<oneOrMore>
<element name="page">
<text/>
</element>
</oneOrMore>
</element>
• Compact syntax:
element book {
element page { text }+
}
*(Some info taken from https://en.wikipedia.org/wiki/RELAX_NG)
10. Element vs. attribute markup
Data can be stored within element tags or attribute values. Why choose one
or the other? Here are some reasons why attribute data is more limited*:
• Attributes cannot contain multiple values (child elements can).
• Attributes are not easily expandable (for future changes).
• Attributes cannot describe structures (child elements can).
• Attributes are more difficult to manipulate by program code.
• Attribute values are not easy to test against a DTD.
*(taken from https://www.w3schools.com/xml/xml_dtd_el_vs_attr.asp)
In technical communication, the most common use of attributes is to store
formatting, filtering, and reuse information. The body of elements is
reserved for the literary content.
11. So now we have markup. What to do with it?
Markup is not just for fun, although some people think it is fun. Markup
should serve some useful purpose, like:
• Facilitate content reuse
• Direct automated formatting processes
• Enhance options for content storage and portability
ABOVE ALL ELSE, REMEMBER THIS: Markup provides a roadmap for your
content that a computer can read. That is, it makes your content look like
data. Once you make your content easily processable by a computer
algorithm, the computer can do more work for you. In other words, it can
automate things like:
• Repetitious busywork
• The movement of content for any nature of enhanced reuse or publishing
process
The more busywork that the computer does, the more reliable the results.
Furthermore, you have more time to WRITE THE CONTENT THAT YOUR
AUDIENCE NEEDS.
12. A brief intro to XSLT and publishing concepts
XML is not useful by itself! Nobody wants to read an XML file. Therefore,
some type of publishing process is necessary.
A publishing process should:
• Use the markup as the fundamental guide to generate output.
• Be as automated as possible.
• Have a close relationship with the original design of the structure definition,
typically having been developed in parallel.
Countless publishing processes exist in the world… some simple, some
complex… some based on OTS tools and others completely custom…
there is no right answer for every situation, although some consultants
and vendors will say otherwise!
Many publishing processes, OTS and custom, use XSLT as the foundation
for creating something new from an XML source. “Something new” might
include common human-readable formats such as HTML (in a browser)
and PDF (in a reader).
13. What is XSLT?
Stands for eXstensible Stylesheet Language Transformation
It is a mature standard for converting XML to some other text format,
such as HTML, CSV, other XML, or anything text.
Well-supported in the IT community through forums, tools, tutorials,
etc.
14. Components required to make XSLT happen
An XML file with the data to transform
An XSLT stylesheet with the instructions for the transformation
An engine that applies the stylesheet to the XML file; that is, does all
the work
Some mechanism to capture the output
XML
file
Engine
Style-
sheet
Output
15. Key concepts about the XSLT processing flow
By default (using an empty stylesheet), the output of a transform
is all the text node data of the original XML file. Effectively, the
process starts at the root element and automatically walks
through every branch of the tree.
When you want something specific to happen, your stylesheet
must effectively put up a “red light” to stop this flow at some
node. Once you stop the flow, you can start to customize the
output however you want.
The key element to stop the automatic flow is <xsl:template>. All
instructions for a customized output live in one of these
elements.
To resume the flow (if desired), the<xsl:apply-templates>
element is effectively a green light.
16. About processing engines
Many different processing engines are available. Some are free and
some are not. All are designed for operation within some particular
context. For example:
• XML editors – Any worthy XML editor will include XSLT processing. In this
context, it is often used for stylesheet design and testing. The output is
typically rendered in some window within the editor interface. To learn
more, Wikipedia has a decent article on XML editor comparisons.
• Programming and scripting languages – All mainstream languages have
some nature of built-in libraries for XSLT. When invoking XSLT with a
language, it typically means that you have your stylesheet ready and you are
looking to automate the process, for whatever reason.
• Web browsers – All major browsers can do XSLT. For a browser, the output
is normally the browser window. Therefore, the typical use case for XSLT in
a browser is to dynamically transform some kind of XML content into a
browser-ready format (HTML).
17. Off-the-shelf XML standards
DITA
• Stands for Darwin Information Typing Architecture.
• The newest and most mainstream structure standard.
• A downloadable package includes all DTDs to write your XML and a full
toolkit for publishing a variety of formats.
• Very technically complex, both in the structure definition and the publishing
components.
• Has a strong emphasis on topic-level authoring.
• Implements a clever mechanism that allows customization of DTDs but still
allowing any DITA-compliant tool to render the content, at least basically.
• Is an OASIS standard and has an organized committee that maintains it.
18. Off-the-shelf XML standards (cont’d)
DocBook
• Older than DITA, less popular but still used.
• Does not have a toolkit as advanced as DITA, but stylesheets and other
components for publishing are available.
• Maintained by Norman Walsh and a DocBook Project development team.
• Traditionally focused on full-document authoring and print publishing,
although supporters are quick to note that it is not limited to this
methodology.
• Sample file:
<?xml version="1.0" encoding="UTF-8"?>
<book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
<title>Very simple book</title>
<chapter xml:id="chapter_1">
<title>Chapter 1</title>
<para>Hello world!</para>
<para>I hope that your day is proceeding
<emphasis>splendidly</emphasis>!</para>
</chapter>
<chapter xml:id="chapter_2">
<title>Chapter 2</title>
<para>Hello again, world!</para>
</chapter>
</book>
19. Just for fun, another type of markup - AsciiDoc
A super-simplified markup language designed to get readable words on the page, as quickly as
possible. Sample from https://en.wikipedia.org/wiki/AsciiDoc:
20. Email from the IRS – A good reason to know XML
Dear Free File Taxpayer:
The IRS has rejected your federal return. This means that your return has not been filed.
. . .
Here's the reason for the rejection:
Issue : Business Rule X0000-005 - The XML data has failed schema validation. cvc-complex-
type.2.4.b. The content of element 'EgyPropCrMainHomeUSAddress' is not complete. One of
'{"http://www.irs.gov/efile":ZIPCd}' is expected.
The following information may help you determine the form at issue:
Field/Xpath:
/efile:Return[1]/efile:ReturnData[1]/efile:IRS5695[1]/efile:NonBusinessEgyEffcntPropCrGr
p[1]/efile:EgyPropCrMainHomeUSAddress[1]/efile:StateAbbreviationCd[1]
If you are unable to fix the issue, you will have to print the return and file by mail.