Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
LIDA 2012: ADOCO
1. ADOCO:
Facilitating Quality Control in
Mass Digitisation
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012
18 - 22 June 2012, Zadar, Croatia
georg.petz@onb.ac.at
2. Austrian Books Online
Austrian Books Online
(Public Private Partnership with Google)
www.onb.ac.at/ev/austrianbooksonline/
2/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
3. Austrian Books Online
Key Data Austrian Books Online (ABO)
• Digitization ~ 600.000 Volumes / ca. 200 Mio. pages
• Only public domain material
• Project start
– Planning and Preparation Phase: July – Dec 2010
– Operational Project start (Manipulation): Dec 2011
– Operational Project start (Digitization): March 2011
• ~70 project team members, 20+ in core team
• 7 work packages
• ~65K physical volumes scanned so far
3/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
4. Austrian Books Online
Division of cost and work load
Google ONB
• Transport • Provision of Metadata
• Insurance • Selection
• Scanning • Internal logistics
• OCR • Conservational assessment
• Image processing • Barcoding
• Quality control • Metadata adjustments
• Google Books • Data download and Quality
control
• Data storage & digital
preservation
• Digital Library
4/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
5. Austrian Books Online
Digitisation
Data Download
ADOCO Storage in Pair Tree
(Austrian Books https://confluence.ucop.edu/display/Curation/PairTree
Online
Download
& Control) Quality Control
Access
5/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
7. Austrian Books Online
Download and Quality Assurance – ADOCO
• Method
– QA started July 2011
– Searching for systematic, not individual errors
– Mix of automatic and manual methods
– Manually impossible to check amount of pages
• Tool: ADOCO
– Downloading volumes
– Internal viewer with possibility for error annotations
– Clustering of errors and suggestions of suspicious files for
manual audit
– Reporting module and statistics (currently in MySQL)
– SCAPE collaboration
7/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
8. Austrian Books Online
QC in typical inhouse project vs. ABO
• Inhouse
– manual quality control
– rescan
• ABO
– automatic and manual quality control
– no rescan but reprocessing
8/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
9. Austrian Books Online
ADOCO Technology Stack
Jersey RESTful
JSF (Primefaces)
WebService
Spring Framework
Wrapped
Hibernate
CLI-TOOLS
Apache Tomcat
MySQL NetApp Filer
Redhat Linux
9/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
10. Austrian Books Online
Book Viewer
Book Viewer
Catalogue /
“Quick Search” [Mobile Apps]
Full text Search
10/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
11. Austrian Books Online
Data Access
• JPEG-2000 Master-Files stored redundantly
• Access-Copies generated on the fly
• Digitised Books linked with online catalogue
• URN-Resolver for permanent identification underway
(OBVSG - Austrian Library Network)
• Searchable and accessible via
• TEL http://search.theeuropeanlibrary.org/portal/en/index.html
• Europeana http://www.europeana.eu/portal/
11/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
12. Austrian Books Online
s t
c a
e n
c r e
S 12/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
13. Austrian Books Online
13/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
14. Austrian Books Online
• co-funded by the European Union under FP7
• develop scalable services for planning and execution of
institutional preservation strategies
• SCAPE Preservation Platform makes use of Hadoop
14/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
15. Austrian Books Online
• framework for the distributed processing of large data sets across clusters
of computers
• overcome limitations SQL oriented databases
• MapReduce paradigm
• Sequence files:
possibly compressed,
containing pairs of writable
key/values
15/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
16. Austrian Books Online
Screencast: Loading Books from PairTree into HDFS
• fs
The FileSystem (FS) shell is invoked by bin/hadoop fs <args>.
• jar
Runs a jar file. Users can bundle their Map Reduce code in a jar file and
execute it using this command.
• load hocr files into SequenceFile in HDFS:
hadoop jar seqfileutility.jar -m -d
/home/onbscs/testdata/abo/samples/small -e
html -c NONE
• source code:
https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator
16/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
17. Austrian Books Online
Thank You!
georg.petz@onb.ac.at
www.onb.ac.at/austrianbooksonline
twitter.com/abooksonline
Photographs: Ingrid Oentrich
17/17
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Editor's Notes
Processing consists of cleaning, cropping, and digitally &quot;flattening&quot; pages, in additon to optical character recognition. Volumes are processed shortly after they are scanned, and reprocessed infrequently after that. Analysis consists of organizing processed pages into a complete volume, including the selection of higher-quality pages in cases where there are more than one candidate page, and putting the pages of a volume in the correct order.
Special software (ADOCO – ABO Download and Control) was implemented and is continuously developed to meet the needs of the quality auditing process. ADOCO enables simultaneous, multithreaded downloads. It is based on Primefaces and Spring Webflow, using Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) and uses a MySQL-Database for technical and bibliographic metadata. It allows for various searches and views on the relevant volumes. Primefaces: Java-based Ajax framework with JSF components ( http://primefaces.org/ ) used for the implementation of the GUI Jersey RESTful WebService: JAX-RS (JSR 311) Reference Implementation for building RESTful Web services used to communicate with other ONB internal systems (e.g. fulltextsearch) ( http://jersey.java.net/ ) Spring: application development framework for enterprise Java™ Hibernate: Java persistence framework to perform object relational mapping and query databases using HQL and SQL. ADOCO uses SQL instead of HQL when performance is an issue ( http://www.hibernate.org/ ) Wrapped CLI-Tools: Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) MySQL: Database for technical and bibliographic metadata ( http://www.mysql.com/ ) NetApp Filer: stores jp2, hocr, mets and txt files in PairTree Redhat Linux: Linux distribution
SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.