ADOCO:Facilitating Quality Control inMass DigitisationGeorg PetzLIBRARIES IN THE DIGITAL AGE (LIDA) 201218 - 22 June 2012,...
Austrian Books OnlineAustrian Books Online(Public Private Partnership with Google)www.onb.ac.at/ev/austrianbooksonline/   ...
Austrian Books OnlineKey Data Austrian Books Online (ABO)• Digitization ~ 600.000 Volumes / ca. 200 Mio. pages• Only publi...
Austrian Books OnlineDivision of cost and work loadGoogle                  ONB•   Transport          •   Provision of Meta...
Austrian Books Online                                Digitisation                      Data DownloadADOCO             Stor...
Austrian Books OnlineSymlink Tree                                                                      6/17               ...
Austrian Books OnlineDownload and Quality Assurance – ADOCO• Method   –   QA started July 2011   –   Searching for systema...
Austrian Books OnlineQC in typical inhouse project vs. ABO• Inhouse   – manual quality control   – rescan• ABO   – automat...
Austrian Books OnlineADOCO Technology Stack                           Jersey RESTful       JSF (Primefaces)               ...
Austrian Books Online      Book Viewer                                       Book ViewerCatalogue /“Quick Search”         ...
Austrian Books OnlineData Access• JPEG-2000 Master-Files stored redundantly• Access-Copies generated on the fly• Digitised...
Austrian Books Online                           s t            c a         e n    c r eS                                  ...
Austrian Books Online                                                      13/17                                          ...
Austrian Books Online• co-funded by the European Union under FP7• develop scalable services for planning and execution of ...
Austrian Books Online•   framework for the distributed processing of large data sets across clusters    of computers•   ov...
Austrian Books OnlineScreencast: Loading Books from PairTree into HDFS•   fs    The FileSystem (FS) shell is invoked by bi...
Austrian Books Online        Thank You!georg.petz@onb.ac.atwww.onb.ac.at/austrianbooksonlinetwitter.com/abooksonlinePhotog...
Upcoming SlideShare
Loading in …5
×

LIDA 2012: ADOCO

858 views

Published on

ADOCO: Facilitating Quality Control in Mass Digitisation

Published in: Technology, Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
858
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Processing consists of cleaning, cropping, and digitally "flattening" pages, in additon to optical character recognition. Volumes are processed shortly after they are scanned, and reprocessed infrequently after that. Analysis consists of organizing processed pages into a complete volume, including the selection of higher-quality pages in cases where there are more than one candidate page, and putting the pages of a volume in the correct order.
  • Special software (ADOCO – ABO Download and Control) was implemented and is continuously developed to meet the needs of the quality auditing process. ADOCO enables simultaneous, multithreaded downloads. It is based on Primefaces and Spring Webflow, using Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) and uses a MySQL-Database for technical and bibliographic metadata. It allows for various searches and views on the relevant volumes. Primefaces: Java-based Ajax framework with JSF components ( http://primefaces.org/ ) used for the implementation of the GUI Jersey RESTful WebService: JAX-RS (JSR 311) Reference Implementation for building RESTful Web services used to communicate with other ONB internal systems (e.g. fulltextsearch) ( http://jersey.java.net/ ) Spring: application development framework for enterprise Java™ Hibernate: Java persistence framework to perform object relational mapping and query databases using HQL and SQL. ADOCO uses SQL instead of HQL when performance is an issue ( http://www.hibernate.org/ ) Wrapped CLI-Tools: Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) MySQL: Database for technical and bibliographic metadata ( http://www.mysql.com/ ) NetApp Filer: stores jp2, hocr, mets and txt files in PairTree Redhat Linux: Linux distribution
  • SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
  • LIDA 2012: ADOCO

    1. 1. ADOCO:Facilitating Quality Control inMass DigitisationGeorg PetzLIBRARIES IN THE DIGITAL AGE (LIDA) 201218 - 22 June 2012, Zadar, Croatiageorg.petz@onb.ac.at
    2. 2. Austrian Books OnlineAustrian Books Online(Public Private Partnership with Google)www.onb.ac.at/ev/austrianbooksonline/ 2/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    3. 3. Austrian Books OnlineKey Data Austrian Books Online (ABO)• Digitization ~ 600.000 Volumes / ca. 200 Mio. pages• Only public domain material• Project start – Planning and Preparation Phase: July – Dec 2010 – Operational Project start (Manipulation): Dec 2011 – Operational Project start (Digitization): March 2011• ~70 project team members, 20+ in core team• 7 work packages• ~65K physical volumes scanned so far 3/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    4. 4. Austrian Books OnlineDivision of cost and work loadGoogle ONB• Transport • Provision of Metadata• Insurance • Selection• Scanning • Internal logistics• OCR • Conservational assessment• Image processing • Barcoding• Quality control • Metadata adjustments• Google Books • Data download and Quality control • Data storage & digital preservation • Digital Library 4/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    5. 5. Austrian Books Online Digitisation Data DownloadADOCO Storage in Pair Tree(Austrian Books https://confluence.ucop.edu/display/Curation/PairTree Online Download & Control) Quality Control Access 5/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    6. 6. Austrian Books OnlineSymlink Tree 6/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    7. 7. Austrian Books OnlineDownload and Quality Assurance – ADOCO• Method – QA started July 2011 – Searching for systematic, not individual errors – Mix of automatic and manual methods – Manually impossible to check amount of pages• Tool: ADOCO – Downloading volumes – Internal viewer with possibility for error annotations – Clustering of errors and suggestions of suspicious files for manual audit – Reporting module and statistics (currently in MySQL) – SCAPE collaboration 7/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    8. 8. Austrian Books OnlineQC in typical inhouse project vs. ABO• Inhouse – manual quality control – rescan• ABO – automatic and manual quality control – no rescan but reprocessing 8/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    9. 9. Austrian Books OnlineADOCO Technology Stack Jersey RESTful JSF (Primefaces) WebService Spring Framework Wrapped Hibernate CLI-TOOLS Apache Tomcat MySQL NetApp Filer Redhat Linux 9/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    10. 10. Austrian Books Online Book Viewer Book ViewerCatalogue /“Quick Search” [Mobile Apps] Full text Search 10/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    11. 11. Austrian Books OnlineData Access• JPEG-2000 Master-Files stored redundantly• Access-Copies generated on the fly• Digitised Books linked with online catalogue• URN-Resolver for permanent identification underway (OBVSG - Austrian Library Network)• Searchable and accessible via • TEL http://search.theeuropeanlibrary.org/portal/en/index.html • Europeana http://www.europeana.eu/portal/ 11/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    12. 12. Austrian Books Online s t c a e n c r eS 12/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    13. 13. Austrian Books Online 13/17 Georg PetzLIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    14. 14. Austrian Books Online• co-funded by the European Union under FP7• develop scalable services for planning and execution of institutional preservation strategies• SCAPE Preservation Platform makes use of Hadoop 14/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    15. 15. Austrian Books Online• framework for the distributed processing of large data sets across clusters of computers• overcome limitations SQL oriented databases• MapReduce paradigm• Sequence files: possibly compressed, containing pairs of writable key/values 15/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    16. 16. Austrian Books OnlineScreencast: Loading Books from PairTree into HDFS• fs The FileSystem (FS) shell is invoked by bin/hadoop fs <args>.• jar Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.• load hocr files into SequenceFile in HDFS: hadoop jar seqfileutility.jar -m -d /home/onbscs/testdata/abo/samples/small -e html -c NONE• source code: https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator 16/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
    17. 17. Austrian Books Online Thank You!georg.petz@onb.ac.atwww.onb.ac.at/austrianbooksonlinetwitter.com/abooksonlinePhotographs: Ingrid Oentrich 17/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012

    ×