SlideShare a Scribd company logo
What publishers need to
         know about digitization
         Liza Daly
         Consultant, Threepress Consulting Inc.
         http://threepress.org/




Thursday, November 13, 2008
Introduction
         Liza Daly                               liza@threepress.org


              Software engineer and consultant specializing in
              web-based publishing applications
              Digitization projects for Ford Foundation, Arnold
              Arboretum, Rosen Publishing and SAGE Publications
              Online reference products for Oxford University Press
              and Columbia University Press
              Current: ebook applications and consulting



Thursday, November 13, 2008
Introduction
         What I’ll cover


                       1. Digitization 101: from scanning to OCR to XML
                       2. Smart vendor selection
                       3. A gentle introduction to XML
                       4. I’ve got digital content: now what?



                                                                  ?
Thursday, November 13, 2008
What we talk about
          when we talk about digitization


              Turning printed content...                  text

              ...or microfilm archives
              ...or documents in legacy systems
              ...into modern digital forms.
              (sometimes starting from print is easier)
                                                          <text>



Thursday, November 13, 2008
Digitization 101

                    Assume that we’re starting from a print archive.
                    (If you’re starting from a digital file, congratulations,
                    your costs just went down -- but not to zero!)




Thursday, November 13, 2008
Scan

                              From paper to digital images...



Thursday, November 13, 2008
OCR

                              ...to digital text...



Thursday, November 13, 2008
XML

                              ...to reusable markup.

Thursday, November 13, 2008
Digitization 101
         Scanning




 http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008
Digitization 101
         Scanning



                                                        Scan




 http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008
Digitization 101
         Scanning methods

                              Destructive scanning
                              Pages are cut out of the binding and
                              machine-fed into the scanner in batch.
                              (Imagine a huge office copier.)
                              Scanned copies are normally destroyed.




Thursday, November 13, 2008
Digitization 101
         Scanning methods

          Non-destructive scanning
          Pages kept in their original binding
          Manual page-turning
          Originals are returned to the source
          Primarily for rare or historical works




Thursday, November 13, 2008
Digitization 101
         Scanning methods

                   High-volume,
                   non-destructive
                   automated
                   scanning also
                   exists.




Thursday, November 13, 2008
Digitization 101
         OCR
                  Optical Character Recognition
                  OCR software “guesses” the letters that appear in an
                  image. A dictionary is used to help correct errors.




                  Common errors include wordsruntogether or
                  speling mistakes.



Thursday, November 13, 2008
Digitization 101
         OCR

              OCR quality is sensitive to a number of factors.
              Is the document in good condition with clear type?
              Is the layout simple or complex?
              Is a custom dictionary required for proper names or
              obscure terms?




Thursday, November 13, 2008
This is easy.



Thursday, November 13, 2008
This is hard.



Thursday, November 13, 2008
http://timesmachine.nytimes.com/


Thursday, November 13, 2008
Digitization 101
         OCR

                                           Better OCR    Worse OCR

                                                         Multicolumn,
                                 Layout    Simple text
                                                          sidebars

                              Vocabulary   Common        Specialized

                                                   Damaged, dirty or
                  Source quality Clean and legible
                                                       partial


Thursday, November 13, 2008
Digitization 101
         OCR

              Limitations and cautions:
              Documents with specialized jargon, such as medical
              journals or archaic texts, will require custom
              dictionaries.
              Tables and equations aren’t suitable for OCR.
              A human check is always advisable.



Thursday, November 13, 2008
If the goal of digitization is to
         make content findable on
         the web, the text needs to
         be correct.


Thursday, November 13, 2008
SCAN the documents to
                                 convert to digital files

                         Apply OCR to the scans to get
                                  computer-ready text



                              Convert the text into XML    X

Thursday, November 13, 2008
Digitization 101
         XML


                         Not all digitization projects end with XML.

                         Why?




Thursday, November 13, 2008
Characters-per-page versus digitization cost/time




                          1,000   1,500         2,000     3,000+
                                      XML
                                      Human-checked OCR
                                      Machine OCR




Thursday, November 13, 2008
Vendor selection
                              and costs



Thursday, November 13, 2008
Consider:                  But also:
                  Quantity of material       Project management
                  Quality of the originals   Shipping
                  Layout complexity          Heterogeneous content
                  Vocabulary                 Front/back matter &
                                             indexes




Thursday, November 13, 2008
Consider:                  But also:
                  Quantity of material       Project management
                  Quality of the originals   Shipping
                  Layout complexity          Heterogeneous content
                  Vocabulary                 Front/back matter &
                                             indexes




Thursday, November 13, 2008
Vendor tips
                   Send samples before considering any estimate
                       ...and have the output evaluated.
                   Compare not just cost-per-page but estimated time.
                   Feel comfortable with their project management.
                   Check references!




Thursday, November 13, 2008
Should you partner?




Thursday, November 13, 2008
?
Thursday, November 13, 2008
?

                              ?
Thursday, November 13, 2008
It’s too early to say whether
                              Google Books is right for all
                              publishers.


                              But you’re certainly giving up:
                                1. Control
                                2. Revenue share
                                3. Ownership



Thursday, November 13, 2008
Creative partnerships
                                      Consider whether some of
                                      your backlist is public
                                      domain or can be released
                                      under a Creative
                                      Commons license.




Thursday, November 13, 2008
XML 101




Thursday, November 13, 2008
XML 101
         What’s XML?


                      XML is just plain text, with markers to
                      tell a computer what the text means
                      and how it should be laid out.




Thursday, November 13, 2008
XML 101
         What’s XML?

         Text with “markup” is an old idea.



                              This is a paragraph.¶
                              This is another paragraph.




Thursday, November 13, 2008
XML 101
         What’s XML?

         XML just changes the symbols around.



                              <p>This is a paragraph.</p>
                              <p>This is another paragraph.</p>




Thursday, November 13, 2008
XML 101
         What’s XML good for?


                          1. Everybody speaks it.

                          2. Once you have one kind of XML,
                             it’s easy to turn it into another kind.




Thursday, November 13, 2008
When you decide to digitize to XML,
             you’ll need to pick what kind of XML you want.




Thursday, November 13, 2008
Kinds of XML




Thursday, November 13, 2008
Kinds of XML

                              DTD




Thursday, November 13, 2008
Kinds of XML

                                    Language
                              DTD




Thursday, November 13, 2008
Kinds of XML

                                    Language
                              DTD




                                        Format




Thursday, November 13, 2008
Kinds of XML

                                         Language
                              DTD


                                    Schema
                                             Format




Thursday, November 13, 2008
Kinds of XML

                                          Language
                               DTD


                                     Schema
                                              Format
                              XSD




Thursday, November 13, 2008
Kinds of XML

                                          Language
                               DTD


                                     Schema
                                              Format
                              XSD




Thursday, November 13, 2008
XML 101
         Schema vocabulary


              The schema defines the list of <tags> that appear in a
              document, and what they mean.
              A paragraph ¶ in one schema might be <p>, but in
              another it might be <para>.




Thursday, November 13, 2008
METS/
                                     DocBook
                                               ALTO



                              ePub                     PRISM




                                      DAISY     TEI




Thursday, November 13, 2008
METS/
                                     DocBook
                                               ALTO



                              ePub       XML           PRISM




                                      DAISY     TEI




Thursday, November 13, 2008
XML 101
         Choosing a schema

                                Books     DocBook, DAISY, ePub, TEI


                         Magazines/
                        Newspapers           METS/ALTO, PRISM


                              Scholarly         TEI, MathML



Thursday, November 13, 2008
XML 101
         DIY schemas

                              Creating your own schema
                                should be a last resort.

                     Expensive to build and maintain.
                     High training and hiring costs.
                     Reduced opportunities for interoperability.
                     Regulatory compliance.

Thursday, November 13, 2008
XML 101
         DIY schemas

                              Creating your own schema
                                should be a last resort.

                     Expensive to build and maintain.
                     High training and hiring costs.
                     Reduced opportunities for interoperability.
                     Regulatory compliance.

Thursday, November 13, 2008
Complex schemas cost more...

                                 $$$




                                   $
                                       Low              High

                              ...but also provide more opportunity
                              for product development.
Thursday, November 13, 2008
Now what?




Thursday, November 13, 2008
Monetizing
         XML conversion



                              XML


Thursday, November 13, 2008
Monetizing
         XML conversion



                              XML   web


Thursday, November 13, 2008
XML        web


Thursday, November 13, 2008
XML        web


Thursday, November 13, 2008
UGC                 web


Thursday, November 13, 2008
Remixing content


         XML allows content
    to be distributed, altered,
        and recontextualized
        in unexpected ways.




                                       http://flickr.com/photos/thomashawk/2492298772/
Thursday, November 13, 2008
Small Beer Press




Thursday, November 13, 2008
Questions?

                     Liza Daly
                     Threepress Consulting Inc.
                     +01 617 301 0552
                     liza@threepress.org




Thursday, November 13, 2008

More Related Content

Similar to What publishers need to know about digitization

Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
Technical Debt
Technical DebtTechnical Debt
Technical Debt
Kmanthei
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5
Tim Wright
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
Christine Connors
 
Node at artsy
Node at artsyNode at artsy
Node at artsy
Craig Spaeth
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with Python
Enthought, Inc.
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
cwensel
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demand
zsvoboda
 
Content Management Selection and Strategy
Content Management Selection and StrategyContent Management Selection and Strategy
Content Management Selection and Strategy
Ivo Jansch
 
Ibuildings Cms Talk
Ibuildings Cms TalkIbuildings Cms Talk
Ibuildings Cms Talk
dean1985
 
Non-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for LibrariesNon-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for Libraries
Crossref
 

Similar to What publishers need to know about digitization (11)

Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Technical Debt
Technical DebtTechnical Debt
Technical Debt
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Node at artsy
Node at artsyNode at artsy
Node at artsy
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with Python
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demand
 
Content Management Selection and Strategy
Content Management Selection and StrategyContent Management Selection and Strategy
Content Management Selection and Strategy
 
Ibuildings Cms Talk
Ibuildings Cms TalkIbuildings Cms Talk
Ibuildings Cms Talk
 
Non-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for LibrariesNon-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for Libraries
 

More from Liza Daly

pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-daly
Liza Daly
 
liza-daly-cultivate-2015
liza-daly-cultivate-2015liza-daly-cultivate-2015
liza-daly-cultivate-2015
Liza Daly
 
Streaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentationStreaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentation
Liza Daly
 
EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3
Liza Daly
 
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading enginesBnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
Liza Daly
 
Networked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current EreadersNetworked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current Ereaders
Liza Daly
 
ePub: The open ebook format
ePub: The open ebook formatePub: The open ebook format
ePub: The open ebook format
Liza Daly
 
Survey Of Current E-Readers
Survey Of Current E-ReadersSurvey Of Current E-Readers
Survey Of Current E-Readers
Liza Daly
 

More from Liza Daly (8)

pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-daly
 
liza-daly-cultivate-2015
liza-daly-cultivate-2015liza-daly-cultivate-2015
liza-daly-cultivate-2015
 
Streaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentationStreaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentation
 
EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3
 
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading enginesBnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
 
Networked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current EreadersNetworked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current Ereaders
 
ePub: The open ebook format
ePub: The open ebook formatePub: The open ebook format
ePub: The open ebook format
 
Survey Of Current E-Readers
Survey Of Current E-ReadersSurvey Of Current E-Readers
Survey Of Current E-Readers
 

Recently uploaded

(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
Priyanka Aash
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
Priyanka Aash
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 

Recently uploaded (20)

(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 

What publishers need to know about digitization

  • 1. What publishers need to know about digitization Liza Daly Consultant, Threepress Consulting Inc. http://threepress.org/ Thursday, November 13, 2008
  • 2. Introduction Liza Daly liza@threepress.org Software engineer and consultant specializing in web-based publishing applications Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications Online reference products for Oxford University Press and Columbia University Press Current: ebook applications and consulting Thursday, November 13, 2008
  • 3. Introduction What I’ll cover 1. Digitization 101: from scanning to OCR to XML 2. Smart vendor selection 3. A gentle introduction to XML 4. I’ve got digital content: now what? ? Thursday, November 13, 2008
  • 4. What we talk about when we talk about digitization Turning printed content... text ...or microfilm archives ...or documents in legacy systems ...into modern digital forms. (sometimes starting from print is easier) <text> Thursday, November 13, 2008
  • 5. Digitization 101 Assume that we’re starting from a print archive. (If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!) Thursday, November 13, 2008
  • 6. Scan From paper to digital images... Thursday, November 13, 2008
  • 7. OCR ...to digital text... Thursday, November 13, 2008
  • 8. XML ...to reusable markup. Thursday, November 13, 2008
  • 9. Digitization 101 Scanning http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  • 10. Digitization 101 Scanning Scan http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  • 11. Digitization 101 Scanning methods Destructive scanning Pages are cut out of the binding and machine-fed into the scanner in batch. (Imagine a huge office copier.) Scanned copies are normally destroyed. Thursday, November 13, 2008
  • 12. Digitization 101 Scanning methods Non-destructive scanning Pages kept in their original binding Manual page-turning Originals are returned to the source Primarily for rare or historical works Thursday, November 13, 2008
  • 13. Digitization 101 Scanning methods High-volume, non-destructive automated scanning also exists. Thursday, November 13, 2008
  • 14. Digitization 101 OCR Optical Character Recognition OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors. Common errors include wordsruntogether or speling mistakes. Thursday, November 13, 2008
  • 15. Digitization 101 OCR OCR quality is sensitive to a number of factors. Is the document in good condition with clear type? Is the layout simple or complex? Is a custom dictionary required for proper names or obscure terms? Thursday, November 13, 2008
  • 16. This is easy. Thursday, November 13, 2008
  • 17. This is hard. Thursday, November 13, 2008
  • 19. Digitization 101 OCR Better OCR Worse OCR Multicolumn, Layout Simple text sidebars Vocabulary Common Specialized Damaged, dirty or Source quality Clean and legible partial Thursday, November 13, 2008
  • 20. Digitization 101 OCR Limitations and cautions: Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries. Tables and equations aren’t suitable for OCR. A human check is always advisable. Thursday, November 13, 2008
  • 21. If the goal of digitization is to make content findable on the web, the text needs to be correct. Thursday, November 13, 2008
  • 22. SCAN the documents to convert to digital files Apply OCR to the scans to get computer-ready text Convert the text into XML X Thursday, November 13, 2008
  • 23. Digitization 101 XML Not all digitization projects end with XML. Why? Thursday, November 13, 2008
  • 24. Characters-per-page versus digitization cost/time 1,000 1,500 2,000 3,000+ XML Human-checked OCR Machine OCR Thursday, November 13, 2008
  • 25. Vendor selection and costs Thursday, November 13, 2008
  • 26. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  • 27. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  • 28. Vendor tips Send samples before considering any estimate ...and have the output evaluated. Compare not just cost-per-page but estimated time. Feel comfortable with their project management. Check references! Thursday, November 13, 2008
  • 29. Should you partner? Thursday, November 13, 2008
  • 31. ? ? Thursday, November 13, 2008
  • 32. It’s too early to say whether Google Books is right for all publishers. But you’re certainly giving up: 1. Control 2. Revenue share 3. Ownership Thursday, November 13, 2008
  • 33. Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license. Thursday, November 13, 2008
  • 35. XML 101 What’s XML? XML is just plain text, with markers to tell a computer what the text means and how it should be laid out. Thursday, November 13, 2008
  • 36. XML 101 What’s XML? Text with “markup” is an old idea. This is a paragraph.¶ This is another paragraph. Thursday, November 13, 2008
  • 37. XML 101 What’s XML? XML just changes the symbols around. <p>This is a paragraph.</p> <p>This is another paragraph.</p> Thursday, November 13, 2008
  • 38. XML 101 What’s XML good for? 1. Everybody speaks it. 2. Once you have one kind of XML, it’s easy to turn it into another kind. Thursday, November 13, 2008
  • 39. When you decide to digitize to XML, you’ll need to pick what kind of XML you want. Thursday, November 13, 2008
  • 40. Kinds of XML Thursday, November 13, 2008
  • 41. Kinds of XML DTD Thursday, November 13, 2008
  • 42. Kinds of XML Language DTD Thursday, November 13, 2008
  • 43. Kinds of XML Language DTD Format Thursday, November 13, 2008
  • 44. Kinds of XML Language DTD Schema Format Thursday, November 13, 2008
  • 45. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  • 46. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  • 47. XML 101 Schema vocabulary The schema defines the list of <tags> that appear in a document, and what they mean. A paragraph ¶ in one schema might be <p>, but in another it might be <para>. Thursday, November 13, 2008
  • 48. METS/ DocBook ALTO ePub PRISM DAISY TEI Thursday, November 13, 2008
  • 49. METS/ DocBook ALTO ePub XML PRISM DAISY TEI Thursday, November 13, 2008
  • 50. XML 101 Choosing a schema Books DocBook, DAISY, ePub, TEI Magazines/ Newspapers METS/ALTO, PRISM Scholarly TEI, MathML Thursday, November 13, 2008
  • 51. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  • 52. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  • 53. Complex schemas cost more... $$$ $ Low High ...but also provide more opportunity for product development. Thursday, November 13, 2008
  • 55. Monetizing XML conversion XML Thursday, November 13, 2008
  • 56. Monetizing XML conversion XML web Thursday, November 13, 2008
  • 57. XML web Thursday, November 13, 2008
  • 58. XML web Thursday, November 13, 2008
  • 59. UGC web Thursday, November 13, 2008
  • 60. Remixing content XML allows content to be distributed, altered, and recontextualized in unexpected ways. http://flickr.com/photos/thomashawk/2492298772/ Thursday, November 13, 2008
  • 61. Small Beer Press Thursday, November 13, 2008
  • 62. Questions? Liza Daly Threepress Consulting Inc. +01 617 301 0552 liza@threepress.org Thursday, November 13, 2008