What publishers need to know about digitization

5,886 views

Published on

Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology.

More information on Liza Daly and threepress can be found at http://www.threepress.org/

Published in: Technology, Education
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,886
On SlideShare
0
From Embeds
0
Number of Embeds
1,244
Actions
Shares
0
Downloads
296
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide

What publishers need to know about digitization

  1. 1. What publishers need to know about digitization Liza Daly Consultant, Threepress Consulting Inc. http://threepress.org/ Thursday, November 13, 2008
  2. 2. Introduction Liza Daly liza@threepress.org Software engineer and consultant specializing in web-based publishing applications Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications Online reference products for Oxford University Press and Columbia University Press Current: ebook applications and consulting Thursday, November 13, 2008
  3. 3. Introduction What I’ll cover 1. Digitization 101: from scanning to OCR to XML 2. Smart vendor selection 3. A gentle introduction to XML 4. I’ve got digital content: now what? ? Thursday, November 13, 2008
  4. 4. What we talk about when we talk about digitization Turning printed content... text ...or microfilm archives ...or documents in legacy systems ...into modern digital forms. (sometimes starting from print is easier) <text> Thursday, November 13, 2008
  5. 5. Digitization 101 Assume that we’re starting from a print archive. (If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!) Thursday, November 13, 2008
  6. 6. Scan From paper to digital images... Thursday, November 13, 2008
  7. 7. OCR ...to digital text... Thursday, November 13, 2008
  8. 8. XML ...to reusable markup. Thursday, November 13, 2008
  9. 9. Digitization 101 Scanning http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  10. 10. Digitization 101 Scanning Scan http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  11. 11. Digitization 101 Scanning methods Destructive scanning Pages are cut out of the binding and machine-fed into the scanner in batch. (Imagine a huge office copier.) Scanned copies are normally destroyed. Thursday, November 13, 2008
  12. 12. Digitization 101 Scanning methods Non-destructive scanning Pages kept in their original binding Manual page-turning Originals are returned to the source Primarily for rare or historical works Thursday, November 13, 2008
  13. 13. Digitization 101 Scanning methods High-volume, non-destructive automated scanning also exists. Thursday, November 13, 2008
  14. 14. Digitization 101 OCR Optical Character Recognition OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors. Common errors include wordsruntogether or speling mistakes. Thursday, November 13, 2008
  15. 15. Digitization 101 OCR OCR quality is sensitive to a number of factors. Is the document in good condition with clear type? Is the layout simple or complex? Is a custom dictionary required for proper names or obscure terms? Thursday, November 13, 2008
  16. 16. This is easy. Thursday, November 13, 2008
  17. 17. This is hard. Thursday, November 13, 2008
  18. 18. http://timesmachine.nytimes.com/ Thursday, November 13, 2008
  19. 19. Digitization 101 OCR Better OCR Worse OCR Multicolumn, Layout Simple text sidebars Vocabulary Common Specialized Damaged, dirty or Source quality Clean and legible partial Thursday, November 13, 2008
  20. 20. Digitization 101 OCR Limitations and cautions: Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries. Tables and equations aren’t suitable for OCR. A human check is always advisable. Thursday, November 13, 2008
  21. 21. If the goal of digitization is to make content findable on the web, the text needs to be correct. Thursday, November 13, 2008
  22. 22. SCAN the documents to convert to digital files Apply OCR to the scans to get computer-ready text Convert the text into XML X Thursday, November 13, 2008
  23. 23. Digitization 101 XML Not all digitization projects end with XML. Why? Thursday, November 13, 2008
  24. 24. Characters-per-page versus digitization cost/time 1,000 1,500 2,000 3,000+ XML Human-checked OCR Machine OCR Thursday, November 13, 2008
  25. 25. Vendor selection and costs Thursday, November 13, 2008
  26. 26. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  27. 27. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  28. 28. Vendor tips Send samples before considering any estimate ...and have the output evaluated. Compare not just cost-per-page but estimated time. Feel comfortable with their project management. Check references! Thursday, November 13, 2008
  29. 29. Should you partner? Thursday, November 13, 2008
  30. 30. ? Thursday, November 13, 2008
  31. 31. ? ? Thursday, November 13, 2008
  32. 32. It’s too early to say whether Google Books is right for all publishers. But you’re certainly giving up: 1. Control 2. Revenue share 3. Ownership Thursday, November 13, 2008
  33. 33. Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license. Thursday, November 13, 2008
  34. 34. XML 101 Thursday, November 13, 2008
  35. 35. XML 101 What’s XML? XML is just plain text, with markers to tell a computer what the text means and how it should be laid out. Thursday, November 13, 2008
  36. 36. XML 101 What’s XML? Text with “markup” is an old idea. This is a paragraph.¶ This is another paragraph. Thursday, November 13, 2008
  37. 37. XML 101 What’s XML? XML just changes the symbols around. <p>This is a paragraph.</p> <p>This is another paragraph.</p> Thursday, November 13, 2008
  38. 38. XML 101 What’s XML good for? 1. Everybody speaks it. 2. Once you have one kind of XML, it’s easy to turn it into another kind. Thursday, November 13, 2008
  39. 39. When you decide to digitize to XML, you’ll need to pick what kind of XML you want. Thursday, November 13, 2008
  40. 40. Kinds of XML Thursday, November 13, 2008
  41. 41. Kinds of XML DTD Thursday, November 13, 2008
  42. 42. Kinds of XML Language DTD Thursday, November 13, 2008
  43. 43. Kinds of XML Language DTD Format Thursday, November 13, 2008
  44. 44. Kinds of XML Language DTD Schema Format Thursday, November 13, 2008
  45. 45. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  46. 46. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  47. 47. XML 101 Schema vocabulary The schema defines the list of <tags> that appear in a document, and what they mean. A paragraph ¶ in one schema might be <p>, but in another it might be <para>. Thursday, November 13, 2008
  48. 48. METS/ DocBook ALTO ePub PRISM DAISY TEI Thursday, November 13, 2008
  49. 49. METS/ DocBook ALTO ePub XML PRISM DAISY TEI Thursday, November 13, 2008
  50. 50. XML 101 Choosing a schema Books DocBook, DAISY, ePub, TEI Magazines/ Newspapers METS/ALTO, PRISM Scholarly TEI, MathML Thursday, November 13, 2008
  51. 51. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  52. 52. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  53. 53. Complex schemas cost more... $$$ $ Low High ...but also provide more opportunity for product development. Thursday, November 13, 2008
  54. 54. Now what? Thursday, November 13, 2008
  55. 55. Monetizing XML conversion XML Thursday, November 13, 2008
  56. 56. Monetizing XML conversion XML web Thursday, November 13, 2008
  57. 57. XML web Thursday, November 13, 2008
  58. 58. XML web Thursday, November 13, 2008
  59. 59. UGC web Thursday, November 13, 2008
  60. 60. Remixing content XML allows content to be distributed, altered, and recontextualized in unexpected ways. http://flickr.com/photos/thomashawk/2492298772/ Thursday, November 13, 2008
  61. 61. Small Beer Press Thursday, November 13, 2008
  62. 62. Questions? Liza Daly Threepress Consulting Inc. +01 617 301 0552 liza@threepress.org Thursday, November 13, 2008

×