ENP Belgrade WS OLR @ CCS

  • 457 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
457
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • DDD = Databank of Digital Daily newspapers

Transcript

  • 1. June 14, 2013Page 1Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesCCSContent Conversion Specialistseuropeana newspapersWorkshop Refinement and Quality Assessment, Belgrade 14.6.2013OLR at CCSFrom unstructured to structured newspaper data and the roleof content providers in the overall process
  • 2. June 14, 2013Page 2Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesAgenda About CCS General workflow for mass digitization of newspapers OLR – Layout and structure analysis ENP OLR workflow (involvement of CP‘s) Quality assurance Output - METS/ALTO package Demo of first results
  • 3. June 14, 2013Page 3Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesAbout CCS CCS Content Conversion Specialists GmbH (Hamburg), as technical projectpartner, will provide its expertise and docWorks technology to set up andoperate a mass digitisation workflow to create high quality structured contentfrom 2 million scanned newspaper pages provided by 5 library partners Page volume:BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k The distributed OLR workflow enables the contribution of project partners(content providers) to the integrated quality assurance process CCS will also contribute to the specification of the metadata model
  • 4. June 14, 2013Page 4Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesGeneral workflow for mass digitizationRe-ScanConversionImagingLayoutAnalysisOCRISRRejectConditionDeliveryQArandomFinalOutputScanningImageMetadataDatabase----------------RepositoryAutomated QADocumentUIDBarcodeItem TrackingManual QA•in-house•near-shore•off-shore•multiple locationsManual QA•in-house•near-shoreCheck inCheck outScanner•Robot-•Book-•Document-•Microfilm-QA+CorrectionQA+CorrectionQA +CorrectionZ 39.50Metadata
  • 5. June 14, 2013Page 5Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesLayout and structure analysis Layout analysis based on „bottom up“ approach General rule system enables recognition of words, textlines, text blocks, columns and classification of textblocks, illustrations, advertisements, tables and thefollowing page types:- title page (the title page of an issue)- content page (a page that consists of content/text only)- illustration page (a page that has at least one illustration)- advertisement page (a page that contains adverts only) Structure analysis through classification of headlinesand grouping of zones into articles(incl. article continuation)
  • 6. June 14, 2013Page 6Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesENP OLR workflow | Conversion without scanningDigital ImageMetadataDeliveryDigital ImageMetadataDeliveryDigital ObjectReturnDigital ObjectReturnInspection /Automatic QAInspection /Automatic QADoc DeliveryDoc DeliveryRejectRejectConversion facilityMaterial locationConversionMD Recording
  • 7. June 14, 2013Page 7Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesPossible conversion scenariosA) Conversion at library (on-site)B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)C) Conversion off-shore at CCS,final QA at the library by backup shipment
  • 8. June 14, 2013Page 8Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesScenario B | Remote QA at libraryInternetStorageStorageINOUTPOOLdW ShareMasterOffshoreProcessing@ CCSOUTPUTMETS ALTOStorageStoragePOOLdW ShareRQAQA on-site@ LibraryINPUT
  • 9. June 14, 2013Page 9Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesQuality assurance @ CCS | Automated markup and basic manual correction:- headlines, illustrations, tables, captions, advertisements, etc.- article segmentation and grouping of zones into articles (incl. continuation) @ Content Provider (Library)Recommended:- Zoning: correct classification of blocks as „text“ or „illustration“- Article segmentation: correct identification of headlines/text blocks/captions- Grouping: correct gouping of blocks (text, illustration) to articles- Metadata: correct title, issue date and issue numberOptional:- Page types: correct page types- Page numbers: correct page sequence- OCR: perform text correction of specific zones (e.g. headlines, captions)
  • 10. June 14, 2013Page 10Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesOutput | METS/ALTO package METS/ALTO metadata schemas to describe the structured digital ouput object A newspaper issue processed in docWorks is converted into one METS XMLfile. It reflects the whole physical and logical structure, manages all links to theimage files and the related ALTO XML files. ALTO is based on a standardizedpage description schema and contains all information of a page (print space,margins, coordinates, OCR results). Benefits of structural markup:- better browsing and more precise text search- better access and display on tablet and mobile devices- automated article classification and clustering through data/text mining andlinguistic technologies- user engagement for manual online text correction, article classification,annotation, building personal collections, etc.- sharing articles via social media platforms like Facebook, Twitter, etc._______________METS = Metadada Encoding and Transmission StandardALTO = Analyzed Layout and Text Object
  • 11. June 14, 2013Page 11Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesAccess and Presentation Access through Europeana as well as content provider portals Existing newspaper presentation systems at National Library of Australia(Trove), Library of Congress/NDNP (Chronicling America), Dutch NationalLibrary (DDD), National Library of Luxembourg (eLuxemburgensia), ... Veridian demo:Example of a newspaper presentation system to demonstrate access toalready processed ENP newspaper issues
  • 12. June 14, 2013Page 12Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesQuestions + answers
  • 13. June 14, 2013Page 13Content Conversion SpecialistsWS Refinement and Quality AssessmentClaus GravenhorstDirector Strategic InitiativesContactClaus GravenhorstDirector Strategic InitiativesCCS Content Conversion Specialists GmbHWeidestr. 13422083 HamburgGermanyc.gravenhorst@content-conversion.comwww.content-conversion.com