• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
What publishers need to know about digitization
 

What publishers need to know about digitization

on

  • 9,206 views

Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology. ...

Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology.

More information on Liza Daly and threepress can be found at http://www.threepress.org/

Statistics

Views

Total Views
9,206
Views on SlideShare
8,003
Embed Views
1,203

Actions

Likes
15
Downloads
288
Comments
0

31 Embeds 1,203

http://par-dela.blogspot.com 388
http://par-dela.blogspot.fr 197
http://toc.oreilly.com 194
http://blogs.oreilly.com 161
http://blog.threepress.org 120
http://caosordenado.com 21
http://www.slideshare.net 18
http://eduflabs.com 16
http://par-dela.blogspot.ca 12
http://didactice.wordpress.com 10
http://translate.googleusercontent.com 10
http://radar.oreilly.com 9
http://www.linkedin.com 7
http://didactice.me 7
http://par-dela.blogspot.be 5
http://par-dela.blogspot.pt 4
http://par-dela.blogspot.in 3
http://pintini.blogspirit.com 3
http://par-dela.blogspot.it 3
http://par-dela.blogspot.nl 3
http://par-dela.blogspot.de 2
http://par-dela.blogspot.ch 1
http://www.par-dela.blogspot.fr 1
http://localhost 1
http://publishing.mobiisin.com 1
http://pintiniblog.wordpress.com 1
http://feeds.feedburner.com 1
http://noticiasediciondigital.malaletra.com 1
http://indexmb.com 1
http://mobilnetexty.pl 1
http://blog.safaribooksonline.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    What publishers need to know about digitization What publishers need to know about digitization Presentation Transcript

    • What publishers need to know about digitization Liza Daly Consultant, Threepress Consulting Inc. http://threepress.org/ Thursday, November 13, 2008
    • Introduction Liza Daly liza@threepress.org Software engineer and consultant specializing in web-based publishing applications Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications Online reference products for Oxford University Press and Columbia University Press Current: ebook applications and consulting Thursday, November 13, 2008
    • Introduction What I’ll cover 1. Digitization 101: from scanning to OCR to XML 2. Smart vendor selection 3. A gentle introduction to XML 4. I’ve got digital content: now what? ? Thursday, November 13, 2008
    • What we talk about when we talk about digitization Turning printed content... text ...or microfilm archives ...or documents in legacy systems ...into modern digital forms. (sometimes starting from print is easier) <text> Thursday, November 13, 2008
    • Digitization 101 Assume that we’re starting from a print archive. (If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!) Thursday, November 13, 2008
    • Scan From paper to digital images... Thursday, November 13, 2008
    • OCR ...to digital text... Thursday, November 13, 2008
    • XML ...to reusable markup. Thursday, November 13, 2008
    • Digitization 101 Scanning http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
    • Digitization 101 Scanning Scan http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
    • Digitization 101 Scanning methods Destructive scanning Pages are cut out of the binding and machine-fed into the scanner in batch. (Imagine a huge office copier.) Scanned copies are normally destroyed. Thursday, November 13, 2008
    • Digitization 101 Scanning methods Non-destructive scanning Pages kept in their original binding Manual page-turning Originals are returned to the source Primarily for rare or historical works Thursday, November 13, 2008
    • Digitization 101 Scanning methods High-volume, non-destructive automated scanning also exists. Thursday, November 13, 2008
    • Digitization 101 OCR Optical Character Recognition OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors. Common errors include wordsruntogether or speling mistakes. Thursday, November 13, 2008
    • Digitization 101 OCR OCR quality is sensitive to a number of factors. Is the document in good condition with clear type? Is the layout simple or complex? Is a custom dictionary required for proper names or obscure terms? Thursday, November 13, 2008
    • This is easy. Thursday, November 13, 2008
    • This is hard. Thursday, November 13, 2008
    • http://timesmachine.nytimes.com/ Thursday, November 13, 2008
    • Digitization 101 OCR Better OCR Worse OCR Multicolumn, Layout Simple text sidebars Vocabulary Common Specialized Damaged, dirty or Source quality Clean and legible partial Thursday, November 13, 2008
    • Digitization 101 OCR Limitations and cautions: Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries. Tables and equations aren’t suitable for OCR. A human check is always advisable. Thursday, November 13, 2008
    • If the goal of digitization is to make content findable on the web, the text needs to be correct. Thursday, November 13, 2008
    • SCAN the documents to convert to digital files Apply OCR to the scans to get computer-ready text Convert the text into XML X Thursday, November 13, 2008
    • Digitization 101 XML Not all digitization projects end with XML. Why? Thursday, November 13, 2008
    • Characters-per-page versus digitization cost/time 1,000 1,500 2,000 3,000+ XML Human-checked OCR Machine OCR Thursday, November 13, 2008
    • Vendor selection and costs Thursday, November 13, 2008
    • Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
    • Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
    • Vendor tips Send samples before considering any estimate ...and have the output evaluated. Compare not just cost-per-page but estimated time. Feel comfortable with their project management. Check references! Thursday, November 13, 2008
    • Should you partner? Thursday, November 13, 2008
    • ? Thursday, November 13, 2008
    • ? ? Thursday, November 13, 2008
    • It’s too early to say whether Google Books is right for all publishers. But you’re certainly giving up: 1. Control 2. Revenue share 3. Ownership Thursday, November 13, 2008
    • Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license. Thursday, November 13, 2008
    • XML 101 Thursday, November 13, 2008
    • XML 101 What’s XML? XML is just plain text, with markers to tell a computer what the text means and how it should be laid out. Thursday, November 13, 2008
    • XML 101 What’s XML? Text with “markup” is an old idea. This is a paragraph.¶ This is another paragraph. Thursday, November 13, 2008
    • XML 101 What’s XML? XML just changes the symbols around. <p>This is a paragraph.</p> <p>This is another paragraph.</p> Thursday, November 13, 2008
    • XML 101 What’s XML good for? 1. Everybody speaks it. 2. Once you have one kind of XML, it’s easy to turn it into another kind. Thursday, November 13, 2008
    • When you decide to digitize to XML, you’ll need to pick what kind of XML you want. Thursday, November 13, 2008
    • Kinds of XML Thursday, November 13, 2008
    • Kinds of XML DTD Thursday, November 13, 2008
    • Kinds of XML Language DTD Thursday, November 13, 2008
    • Kinds of XML Language DTD Format Thursday, November 13, 2008
    • Kinds of XML Language DTD Schema Format Thursday, November 13, 2008
    • Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
    • Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
    • XML 101 Schema vocabulary The schema defines the list of <tags> that appear in a document, and what they mean. A paragraph ¶ in one schema might be <p>, but in another it might be <para>. Thursday, November 13, 2008
    • METS/ DocBook ALTO ePub PRISM DAISY TEI Thursday, November 13, 2008
    • METS/ DocBook ALTO ePub XML PRISM DAISY TEI Thursday, November 13, 2008
    • XML 101 Choosing a schema Books DocBook, DAISY, ePub, TEI Magazines/ Newspapers METS/ALTO, PRISM Scholarly TEI, MathML Thursday, November 13, 2008
    • XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
    • XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
    • Complex schemas cost more... $$$ $ Low High ...but also provide more opportunity for product development. Thursday, November 13, 2008
    • Now what? Thursday, November 13, 2008
    • Monetizing XML conversion XML Thursday, November 13, 2008
    • Monetizing XML conversion XML web Thursday, November 13, 2008
    • XML web Thursday, November 13, 2008
    • XML web Thursday, November 13, 2008
    • UGC web Thursday, November 13, 2008
    • Remixing content XML allows content to be distributed, altered, and recontextualized in unexpected ways. http://flickr.com/photos/thomashawk/2492298772/ Thursday, November 13, 2008
    • Small Beer Press Thursday, November 13, 2008
    • Questions? Liza Daly Threepress Consulting Inc. +01 617 301 0552 liza@threepress.org Thursday, November 13, 2008