Dita ot pipeline webinar


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dita ot pipeline webinar

  1. 1. Understanding the DITA-OT Pipeline<br />Aryeh Sanders, Suite Solutions<br />
  2. 2. Who Are We?<br />Our Mission<br />To increase our customers’ profitability by significantly improving the efficiency of their information development and delivery processes.<br />Qualitative Advantage<br />Content Lifecycle Implementation (CLI) is Suite Solutions’ comprehensive approach – from concept to publication – to maximizing the value of your information assets.<br />Our professionals are with you at every phase, determining, recommending and implementing the most cost-effective, flexible and long term solution for your business.<br />
  3. 3. Clients and Partners<br />3<br />Private and Confidential<br />Suite Solutions©2009<br />
  4. 4. Introduction<br />We will discuss how the DITA-OT is constructed and why<br />We’ll give some insight into this:<br />
  5. 5. Overview<br />Problem Statement<br />Solution: The DITA-OT Pipeline<br />Overview of Preprocessing<br />Overview of XHTML Output<br />Overview of PDF Output<br />
  6. 6. A Sample DITA Topic<br /><?xmlversion='1.0' encoding='UTF-8'?><br /><!DOCTYPEtopic PUBLIC "-//OASIS//DTD DITA Topic//EN" "c:ditaotdtd opic.dtd"><br /><topicid="topic" xml:lang="fr-fr"><br /> <title>Sample Topic</title><br /> <body><br /> <pconref="conrefs.xml#conrefid/intropara"/><br /> <p>For more information, please see <xrefhref="infotable.xml#topicid/tableid"/>.</p><br /> <paudience="developers">Information that only developers want to know.</p><br /> </body><br /></topic><br />
  7. 7. Items That Need Preprocessing<br /><?xmlversion='1.0' encoding='UTF-8'?><br /><!DOCTYPEtopic PUBLIC "-//OASIS//DTD DITA Topic//EN" "c:ditaotdtd opic.dtd"><br /><topicid="topic" xml:lang="fr-fr"><br /> <title>Sample Topic</title><br /> <body><br /> <pconref="conrefs.xml#conrefid/intropara"/><br /> <p>For more information, please see <xrefhref="infotable.xml#topicid/tableid"/>.</p><br /> <paudience="developers">Informationthat only developers want to know.</p><br /> </body><br /></topic><br />1<br />2<br />3<br />
  8. 8. Items That Need Preprocessing<br />In the previous slide we saw:<br />Conrefs<br />Cross References<br />Conditional Text<br />All of these need some modification in our final output<br />But there’s more!<br />
  9. 9. Items That Need Preprocessing<br />Lots more!<br />Navtitle<br />Keyref<br />Moving metadata from the map<br />Reltables<br />Sibling and Parent / Child related links<br />Chunking<br />Copy-to<br />Coderef<br />And more…<br />
  10. 10. Problem: Process DITA Correctly<br />We want a solution that delivers on all of the features in the DITA spec<br />Our solution should make it easy to reason about correctness<br />Our solution should use common processing for most DITA features for many output formats<br />
  11. 11. Solution: DITA Pipeline <br />Our Solution: We should build the DITA-OT as a pipeline<br />Each step makes one type of change to the DITA, and the output is also DITA<br />Output from each step is the input to the next step<br />This makes it easier to reason about the correctness of the implementation: each step should do one thing well<br />It doesn’t solve all of our problems<br />We still need to make sure each step is correct, of course<br />We need ensure the order is correct<br />
  12. 12. Example of a Pipeline Step: Conref<br />Before:<br /> <p class="- topic/p " conref="conrefs.xml#conrefid/intropara" xtrf="C:cygwinhomeFamilywebinar est.xml" xtrc="p:1"></p><br />After:<br /><p class="- topic/p " xtrf="C:cygwinhomeFamilywebinarconrefs.xml" xtrc="p:1">This is an introductory paragraph.</p><br />Note if you happen to try this and compare the files: this isn’t the only difference between the two files, but it’s the only meaningful difference (e.g. <xref/> vs. <xref></xref>)<br />
  13. 13. Technical Details About the Log<br />The DITA-OT is mostly implemented in a mixture of three languages:<br />Ant<br />Java<br />XSLT<br />Ant drives the whole process<br />Not used the way that’s familiar to Java users – it is not used to manage dependencies on changed files, mostly just to run the steps in sequence<br />java –jar dost.jar is just a wrapper around Ant, which sets up clearer logging, provides an easier way to enter some parameters, and runs integrator<br />
  14. 14. Technical Details About the Log<br />If you run via Java, your log shows messages like:<br />Debug and filter input files...<br />Debug and filter input files...<br />Copy image files...<br />Copy html files...<br />Copy flag files...<br />Copy subsidiary files...<br />Copy generated files...<br />Resolve conref push...<br />Resolve conref in input files...<br />If you run via Ant, your log shows messages like:<br />debug-filter-flag-check:<br />debug:<br />debug-and-filter:<br />debug-filter:<br />copy-image-check:<br />copy-image:<br />copy-html-check:<br />
  15. 15. Technical Details About the Log<br />Both the Ant logs and the Java logs are reporting the same events<br />Ant is more verbose, since it logs each “target”, some of which aren’t very interesting<br />The Java version tries to give messages that are closer to English using the target description instead of the name<br />The “pipeline” that we’ve been discussing isn’t logged separately<br />The “pipeline” is implemented just as a set of steps in Ant<br />So there’s no direct way to view the pipeline itself<br />Confusing:<br />There’s an Ant task called <pipeline> that runs some parts of the pipeline, but not all<br />
  16. 16. Order in the Pipeline<br />Best current source of information, from Robert Anderson:<br />http://dita.xml.org/node/2469<br />Some steps must come before others:<br />If you want to figure out the text for a cross reference to a topic, you need to know the name of the topic – but if you use navtitle and locktitle in your map, the title will change<br />Therefore, extract the title after you process locktitle<br />Some steps should come before others:<br />Process conditional attributes early, so that you don’t have to waste time with other processing if you will remove the element anyway<br />At that link, you’ll find much more detailed information about each step<br />
  17. 17. Maintaining Valid DITA<br />Each step during preprocessing outputs valid DITA<br />This reduces the dependence between steps – if you skip a step, everything’s fine.<br />The OT does skip steps if it knows they’re not needed, e.g. it doesn’t do conref processing if there are no conrefs<br />This also helps catch errors, since the output gets validated at each step<br />The DITA specification has features to help:<br />xtrc, xtrf attributes for debugging – the toolkit fills these in at the beginning, and the values are maintained throughout processing.<br />related-links section in each topic to hold the links gathered from reltables and generated by relationships<br />Similar metadata that allows metadata from the map to be pushed into the topics<br />
  18. 18. Goal of Preprocessing<br />All DITA features should be processed into simpler but valid DITA<br />Example:Conrefs are eliminated<br />Example:Descriptions are filled in within cross references<br />Each DITA file should stand alone<br />All the information needed for output is now in the individual files<br />There’s a single DITA map that stands alone<br />All the information needed for output is in that map; all the submaps are joined together<br />New files created from chunk and copy-to are also “ready to go”<br />All that’s left to do with the DITA is switching to a new vocabulary – such as HTML or XSL-FO<br />
  19. 19. Performance Issues<br />In theory, the pipeline reads and processes files many times<br />DTDs (once for every step for every file)<br />XSLT Stylesheets (once for every time each is run)<br />The DITA files themselves (once for every step for every file)<br />Mitigation<br />Latest DITA-OTs have a patch from Eliot Kimber that only reads the DTDs once and caches them<br />Ant caches stylesheets<br />There’s a price to pay: it costs memory – if you run out, you can shut off this cache with dita.preprocess.reloadstylesheet=true<br />There’s no cache for the DITA files themselves yet<br />There have been some discussions on the developer group about changing the pipeline implementation, so this might be provided as part of that<br />
  20. 20. Overview of HTML Processing<br />HTML itself is “simple” – it translates the DITA into corresponding XHTML elements<br />The stylesheets do have to do work to change between structures that aren’t quite similar, such as between certain DITA tables and HTML tables<br />Most are straightforward:<br /><uicontrol> becomes <span class=“uicontrol”><br />Formatting is handled in the CSS file, not in processing<br />DITA topics are processed to make the HTML files<br />The merged map is processed to make the TOC<br />
  21. 21. Overview of PDF Processing<br />PDF processing in the toolkit is more complicated<br />You need at least one more step – convert DITA to an intermediate format that is straightforward to convert to PDF<br />In theory, it didn’t have to be so complicated, since you can generate PDFs from HTML and CSS<br />In practice, CSS used to be less sophisticated than it is now, and people have more demands from their PDFs<br />Index<br />Language specific font control<br />You might want your Chinese characters in a Chinese font, and your English characters in an English font – no font covers all languages<br />We’re not going to discuss all the things that PDF output is doing, just the steps it takes<br />
  22. 22. PDF Processing Steps (1)<br />Topicmerge – merge all the topics and the map into one big file<br />This is a good opportunity to do certain kinds of processing, such as creating fake topics that say “MISSING TOPIC” if a topic is missing<br />This step already does create fake topics to correspond to <topichead> in the map, since the main PDF processing is done topic by topic<br />Note: topicmerge is done in Java, then post-processed by topicmerge.xsl in the FO plugin<br />_MERGED.ditamap to stage1.xml<br />Collects all the indexterms from the topics, and puts them at the end for later processing to create the index<br />
  23. 23. PDF Processing Steps (2)<br />stage1.xml to stage2.fo<br />The PDF plugin converts DITA to another XML language called XSL-FO that’s an intermediate language that can be easily used to generate PDFs<br /><ul><li>This step does the main conversion – it converts DITA to XSL-FO. It also adds information about page sizes, font sizes, fonts, and more or less everything else needed to generate the PDF.</li></ul>Unlike HTML, XSL-FO is a very simple language. For instance, you must explicitly number steps instead of letting your web browser do it automatically. This step adds things like those numbers.<br />This step doesn’t actually use real fonts. Instead it uses “logical fonts” such as Sans or Monospace or FrontCoverFont, because…<br />
  24. 24. PDF Processing Steps (3)<br />stage2.fo to stage3.fo<br />Java step which looks for characters in other languages, which will be marked for font processing<br />stage3.fo to topic.fo<br />Substitutes logical fonts for real font names from font-mappings.xml<br />If you specify that Chinese should have a different Sans font, then English characters will have the English Sans font, and Chinese characters will have the Chinese Sans font, even if they both appear in the same element<br />topic.fo to PDF<br />Actual PDF is created by an FO Processor, probably one of:<br />Apache FOP (free, but doesn’t support indexes)<br />RenderX XEP<br />Antenna House Formatter<br />
  25. 25. Questions?<br />Any questions?<br />Be in touch!Aryeh Sandersaryehs@suite-sol.com<br />