June 14, 2013
Page 1
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
CCS
Content Conversion Specialists
europeana newspapers
Workshop Refinement and Quality Assessment, Belgrade 14.6.2013
OLR at CCS
From unstructured to structured newspaper data and the role
of content providers in the overall process
June 14, 2013
Page 2
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Agenda
 About CCS
 General workflow for mass digitization of newspapers
 OLR – Layout and structure analysis
 ENP OLR workflow (involvement of CP‘s)
 Quality assurance
 Output - METS/ALTO package
 Demo of first results
June 14, 2013
Page 3
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
About CCS
 CCS Content Conversion Specialists GmbH (Hamburg), as technical project
partner, will provide its expertise and docWorks technology to set up and
operate a mass digitisation workflow to create high quality structured content
from 2 million scanned newspaper pages provided by 5 library partners
 Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
 The distributed OLR workflow enables the contribution of project partners
(content providers) to the integrated quality assurance process
 CCS will also contribute to the specification of the metadata model
June 14, 2013
Page 4
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
General workflow for mass digitization
Re-Scan
Conversion
Imaging
Layout
Analysis
OCR
ISR
Reject
Condition
Delivery
QA
random
Final
Output
Scanning
Image
Metadata
Database
----------------
Repository
Automated QA
Document
UID
Barcode
Item Tracking
Manual QA
•in-house
•near-shore
•off-shore
•multiple locations
Manual QA
•in-house
•near-shore
Check in
Check out
Scanner
•Robot-
•Book-
•Document-
•Microfilm-
QA+Correcti
onQA+Correcti
on
QA +
Correction
Z 39.50
Metadata
June 14, 2013
Page 5
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Layout and structure analysis
 Layout analysis based on „bottom up“ approach
 General rule system enables recognition of words, text
lines, text blocks, columns and classification of text
blocks, illustrations, advertisements, tables and the
following page types:
- title page (the title page of an issue)
- content page (a page that consists of content/text only)
- illustration page (a page that has at least one illustration)
- advertisement page (a page that contains adverts only)
 Structure analysis through classification of headlines
and grouping of zones into articles
(incl. article continuation)
June 14, 2013
Page 6
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
ENP OLR workflow | Conversion without scanning
Digital Image
Metadata
Delivery
Digital Image
Metadata
Delivery
Digital Object
Return
Digital Object
Return
Inspection /
Automatic QA
Inspection /
Automatic QA
Doc DeliveryDoc Delivery
RejectReject
Conversion facility
Material location
Conversion
MD Recording
June 14, 2013
Page 7
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,
final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,
final QA at the library by backup shipment
June 14, 2013
Page 8
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Scenario B | Remote QA at library
Internet
StorageStorage
IN
OUTPOOL
dW Share
Master
Offshore
Processing
@ CCS
OUTPUT
METS ALTO
StorageStorage
POOL
dW Share
RQA
QA on-site
@ Library
INPUT
June 14, 2013
Page 9
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Quality assurance
 @ CCS | Automated markup and basic manual correction:
- headlines, illustrations, tables, captions, advertisements, etc.
- article segmentation and grouping of zones into articles (incl. continuation)
 @ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“
- Article segmentation: correct identification of headlines/text blocks/captions
- Grouping: correct gouping of blocks (text, illustration) to articles
- Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types
- Page numbers: correct page sequence
- OCR: perform text correction of specific zones (e.g. headlines, captions)
June 14, 2013
Page 10
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Output | METS/ALTO package
 METS/ALTO metadata schemas to describe the structured digital ouput object
 A newspaper issue processed in docWorks is converted into one METS XML
file. It reflects the whole physical and logical structure, manages all links to the
image files and the related ALTO XML files. ALTO is based on a standardized
page description schema and contains all information of a page (print space,
margins, coordinates, OCR results).
 Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices
- automated article classification and clustering through data/text mining and
linguistic technologies
- user engagement for manual online text correction, article classification,
annotation, building personal collections, etc.
- sharing articles via social media platforms like Facebook, Twitter, etc.
_______________
METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
June 14, 2013
Page 11
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Access and Presentation
 Access through Europeana as well as content provider portals
 Existing newspaper presentation systems at National Library of Australia
(Trove), Library of Congress/NDNP (Chronicling America), Dutch National
Library (DDD), National Library of Luxembourg (eLuxemburgensia), ...
 Veridian demo:
Example of a newspaper presentation system to demonstrate access to
already processed ENP newspaper issues
June 14, 2013
Page 12
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Questions + answers
June 14, 2013
Page 13
Content Conversion Specialists
WS Refinement and Quality Assessment
Claus Gravenhorst
Director Strategic Initiatives
Contact
Claus Gravenhorst
Director Strategic Initiatives
CCS Content Conversion Specialists GmbH
Weidestr. 134
22083 Hamburg
Germany
c.gravenhorst@content-conversion.com
www.content-conversion.com

ENP Belgrade WS OLR @ CCS

  • 1.
    June 14, 2013 Page1 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives CCS Content Conversion Specialists europeana newspapers Workshop Refinement and Quality Assessment, Belgrade 14.6.2013 OLR at CCS From unstructured to structured newspaper data and the role of content providers in the overall process
  • 2.
    June 14, 2013 Page2 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Agenda  About CCS  General workflow for mass digitization of newspapers  OLR – Layout and structure analysis  ENP OLR workflow (involvement of CP‘s)  Quality assurance  Output - METS/ALTO package  Demo of first results
  • 3.
    June 14, 2013 Page3 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives About CCS  CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitisation workflow to create high quality structured content from 2 million scanned newspaper pages provided by 5 library partners  Page volume: BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k  The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process  CCS will also contribute to the specification of the metadata model
  • 4.
    June 14, 2013 Page4 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives General workflow for mass digitization Re-Scan Conversion Imaging Layout Analysis OCR ISR Reject Condition Delivery QA random Final Output Scanning Image Metadata Database ---------------- Repository Automated QA Document UID Barcode Item Tracking Manual QA •in-house •near-shore •off-shore •multiple locations Manual QA •in-house •near-shore Check in Check out Scanner •Robot- •Book- •Document- •Microfilm- QA+Correcti onQA+Correcti on QA + Correction Z 39.50 Metadata
  • 5.
    June 14, 2013 Page5 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Layout and structure analysis  Layout analysis based on „bottom up“ approach  General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types: - title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)  Structure analysis through classification of headlines and grouping of zones into articles (incl. article continuation)
  • 6.
    June 14, 2013 Page6 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives ENP OLR workflow | Conversion without scanning Digital Image Metadata Delivery Digital Image Metadata Delivery Digital Object Return Digital Object Return Inspection / Automatic QA Inspection / Automatic QA Doc DeliveryDoc Delivery RejectReject Conversion facility Material location Conversion MD Recording
  • 7.
    June 14, 2013 Page7 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Possible conversion scenarios A) Conversion at library (on-site) B) Conversion off-shore at CCS data center, final QA at the library via internet transfer (remote QA solution) C) Conversion off-shore at CCS, final QA at the library by backup shipment
  • 8.
    June 14, 2013 Page8 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Scenario B | Remote QA at library Internet StorageStorage IN OUTPOOL dW Share Master Offshore Processing @ CCS OUTPUT METS ALTO StorageStorage POOL dW Share RQA QA on-site @ Library INPUT
  • 9.
    June 14, 2013 Page9 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Quality assurance  @ CCS | Automated markup and basic manual correction: - headlines, illustrations, tables, captions, advertisements, etc. - article segmentation and grouping of zones into articles (incl. continuation)  @ Content Provider (Library) Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
  • 10.
    June 14, 2013 Page10 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Output | METS/ALTO package  METS/ALTO metadata schemas to describe the structured digital ouput object  A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).  Benefits of structural markup: - better browsing and more precise text search - better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard ALTO = Analyzed Layout and Text Object
  • 11.
    June 14, 2013 Page11 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Access and Presentation  Access through Europeana as well as content provider portals  Existing newspaper presentation systems at National Library of Australia (Trove), Library of Congress/NDNP (Chronicling America), Dutch National Library (DDD), National Library of Luxembourg (eLuxemburgensia), ...  Veridian demo: Example of a newspaper presentation system to demonstrate access to already processed ENP newspaper issues
  • 12.
    June 14, 2013 Page12 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Questions + answers
  • 13.
    June 14, 2013 Page13 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives Contact Claus Gravenhorst Director Strategic Initiatives CCS Content Conversion Specialists GmbH Weidestr. 134 22083 Hamburg Germany c.gravenhorst@content-conversion.com www.content-conversion.com

Editor's Notes

  • #12 DDD = Databank of Digital Daily newspapers