On the two sides of the pond


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

On the two sides of the pond

  1. 1. On the Two Sides of the Pond By Hans-Jörg Lieder, Head of the Department of Bibliographic Services – Union Catalogue of Serials Staatsbibliothek zu Berlin - Preußischer Kulturbesitz; Dr. Katalin Radics, Distinguished Librarian; Librarian of the West European Collections and Classics Young Research Library, University of California, Los Angeles
  3. 3. Partnership between the UCLA Library and Staatsbibliothek zu Berlin
  4. 4. Newspapers on the way to discoloring and disintegration Storage facility of the University of California Libraries on the UCLA campus
  5. 5. - Leaflets 13”x18.5” or 33cm x 47cm - Imprint indicating the title, date, the number of the issue; warning -Published four or five times a day
  6. 6. UCLA stamps including receiving dates Packed in wrapping paper probably after 1940, packages of 700-800 sheets No documentation (ordering or receiving records) in the library archives; no correspondence Normal serial subscription scheme (?) Very minimal cataloging record – very low use
  7. 7. Towards a Weeding Decision Brittle condition Check for other holdings in California, US and World libraries OCLC – no other holdings at the time of checking Nine 1938 issues at BNF No holding at the German National Library (Deutsche Nationalbibliothek) Contact with head of Zeitungsabteilung, Staatsbibliothek – no holding in Germany UNIQUE!!! Decision: keep and preserve the UCLA holdings.
  8. 8. Keep and Preserve 9600 pages 1936-1940 with gaps Acid-free boxes The most fragile pages in mylar
  9. 9. Digitization Project Funding for digitization Highest quality resolution: 600 dpi RGB Add minimal metadata
  10. 10. Title Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 1581, 1938 October 1, Erste Morgen-Ausgabe Alt ID 3813183_1938-10-01_1581 [Local] AltTitle Erste Morgen-Ausgabe [Descriptive] Deutsches Nachrichtenbüro [Descriptive] Date October 1, 1938 [Publication] 1938-10-01 [Normalized] Format 1 p. [Extent] Language ger Name University of California, Los Angeles. Library. Dept. of Special Collections [Repository] Type newspapers [Genre] text [Type Of Resource]
  11. 11. Digitized copies: part of UCLA Digital Library at http://digital2.library.ucla.edu/ -- freely accessible Searchable only by date More sophisticated searching capability needed – day by day chronicle of the Third Reich for a short period of time -events -names -institutions etc. Deutsches Nachrichten Büro – December 5, 1933network of 36 local services (Landesdienste)
  12. 12. Indexing needed Fraktur – major problem Transliteration into Latin characters OCR (Optical Character Recognition) – has to be made in Germany Looking for a German Partner
  13. 13. Not a problem … here we are!
  14. 14. … but who are “we”? • Project: Europeana Newspapers: http://www.europeana-newspapers.eu/ • 18 partners from 12 countries • Tasks: • Provide OCR for 18 million pages • Provide OLR for 2 million pages • Provide NER experimentally in assorted languages • Provide best practice recommendations for newspaper metadata • Provide quality prediction tools • Aggregate content and make it available to TEL and Europeana OCR = Optical Character Recognition OLR = Optical Layout Recognition NER = Named Entities Recognition
  15. 15. A Dance of Acronyms: UCLA, SBB and CCS UCLA sent data on hard drive SBB • Checked data for correctness and moved images into directory structure • Sent data to CCS in Hamburg for OCR and OLR CCS (Content Conversion Specialists) • Created full texts per article • Stuck data in NZ web service for preliminary presentation purposes SBB • Will perform QA of OCR and OLR results • Will provide all data to UCLA for further use • Will present data in ZEFYS, its own newspaper portal; to the Deutsche Digitale Bibliothek; to TEL (The European Library) and to Europeana
  16. 16. Layout and structure analysis  recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types: - title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)  Structure analysis through classification of headlines and grouping of zones into articles (incl. article continuation)
  17. 17. ENP OLR workflow | Conversion without scanning Digital Image Digital Image Metadata Metadata Delivery Delivery Digital Object Digital Object Return Return Material location Conversion facility Inspection // Inspection Automatic QA Automatic QA Conversion MD Recording Reject Reject Doc Delivery Doc Delivery
  18. 18. Quality assurance  @ CCS | Automated markup and basic manual correction: - headlines, illustrations, tables, captions, advertisements, etc. - article segmentation and grouping of zones into articles (incl. continuation)  @ Content Provider (Library) Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
  19. 19. Output | METS/ALTO package  METS/ALTO metadata schemas to describe the structured digital output object  A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).  Benefits of structural markup: - better browsing and more precise text search - better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard ALTO = Analyzed Layout and Text Object