SlideShare a Scribd company logo
1 of 17
ADOCO:
Facilitating Quality Control in
Mass Digitisation
Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012
18 - 22 June 2012, Zadar, Croatia
georg.petz@onb.ac.at
Austrian Books Online




Austrian Books Online
(Public Private Partnership with Google)

www.onb.ac.at/ev/austrianbooksonline/


                                                                                  2/17
                                                                            Georg Petz
                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Key Data Austrian Books Online (ABO)

• Digitization ~ 600.000 Volumes / ca. 200 Mio. pages
• Only public domain material
• Project start
   – Planning and Preparation Phase: July – Dec 2010
   – Operational Project start (Manipulation): Dec 2011
   – Operational Project start (Digitization): March 2011
• ~70 project team members, 20+ in core team
• 7 work packages
• ~65K physical volumes scanned so far


                                                                                                3/17
                                                                                          Georg Petz
                                         LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Division of cost and work load
Google                  ONB

•   Transport          •   Provision of Metadata
•   Insurance          •   Selection
•   Scanning           •   Internal logistics
•   OCR                •   Conservational assessment
•   Image processing   •   Barcoding
•   Quality control    •   Metadata adjustments
•   Google Books       •   Data download and Quality
                           control
                       •   Data storage & digital
                           preservation
                       •   Digital Library

                                                                                        4/17
                                                                                  Georg Petz
                                 LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                                Digitisation


                      Data Download
ADOCO             Storage in Pair Tree
(Austrian Books   https://confluence.ucop.edu/display/Curation/PairTree
     Online
   Download
   & Control)           Quality Control

                                    Access
                                                                                                  5/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Symlink Tree




                                                                      6/17
                                                                Georg Petz
               LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online



Download and Quality Assurance – ADOCO

• Method
   –   QA started July 2011
   –   Searching for systematic, not individual errors
   –   Mix of automatic and manual methods
   –   Manually impossible to check amount of pages

• Tool: ADOCO
   – Downloading volumes
   – Internal viewer with possibility for error annotations
   – Clustering of errors and suggestions of suspicious files for
     manual audit
   – Reporting module and statistics (currently in MySQL)
   – SCAPE collaboration

                                                                                          7/17
                                                                                    Georg Petz
                                   LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




QC in typical inhouse project vs. ABO

• Inhouse
   – manual quality control
   – rescan


• ABO
   – automatic and manual quality control
   – no rescan but reprocessing




                                                                                              8/17
                                                                                        Georg Petz
                                       LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




ADOCO Technology Stack



                           Jersey RESTful
       JSF (Primefaces)
                            WebService
                Spring Framework
                              Wrapped
          Hibernate
                             CLI-TOOLS
                 Apache Tomcat

           MySQL             NetApp Filer

                   Redhat Linux

                                                                                         9/17
                                                                                   Georg Petz
                                  LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




      Book Viewer




                                       Book Viewer
Catalogue /
“Quick Search”                        [Mobile Apps]

   Full text Search

                                                                            10/17
                                                                       Georg Petz
                      LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online



Data Access
• JPEG-2000 Master-Files stored redundantly

• Access-Copies generated on the fly

• Digitised Books linked with online catalogue

• URN-Resolver for permanent identification underway
 (OBVSG - Austrian Library Network)

• Searchable and accessible via
    • TEL http://search.theeuropeanlibrary.org/portal/en/index.html
    • Europeana http://www.europeana.eu/portal/



                                                                                                 11/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                           s t
            c a
         e n
    c r e
S                                                                    12/17
                                                                Georg Petz
               LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




                                                      13/17
                                                 Georg Petz
LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




• co-funded by the European Union under FP7

• develop scalable services for planning and execution of
  institutional preservation strategies

• SCAPE Preservation Platform makes use of Hadoop




                                                                                        14/17
                                                                                   Georg Petz
                                  LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




•   framework for the distributed processing of large data sets across clusters
    of computers

•   overcome limitations SQL oriented databases

•   MapReduce paradigm

•   Sequence files:
    possibly compressed,
    containing pairs of writable
    key/values
                                                                                                 15/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




Screencast: Loading Books from PairTree into HDFS

•   fs
    The FileSystem (FS) shell is invoked by bin/hadoop fs <args>.
•   jar
    Runs a jar file. Users can bundle their Map Reduce code in a jar file and
    execute it using this command.

•   load hocr files into SequenceFile in HDFS:
    hadoop jar seqfileutility.jar -m -d
    /home/onbscs/testdata/abo/samples/small -e
    html -c NONE

•   source code:
    https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator

                                                                                                 16/17
                                                                                            Georg Petz
                                           LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
Austrian Books Online




        Thank You!




georg.petz@onb.ac.at
www.onb.ac.at/austrianbooksonline
twitter.com/abooksonline

Photographs: Ingrid Oentrich

                                                                                          17/17
                                                                                     Georg Petz
                                    LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012

More Related Content

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

LIDA 2012: ADOCO

  • 1. ADOCO: Facilitating Quality Control in Mass Digitisation Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012 18 - 22 June 2012, Zadar, Croatia georg.petz@onb.ac.at
  • 2. Austrian Books Online Austrian Books Online (Public Private Partnership with Google) www.onb.ac.at/ev/austrianbooksonline/ 2/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 3. Austrian Books Online Key Data Austrian Books Online (ABO) • Digitization ~ 600.000 Volumes / ca. 200 Mio. pages • Only public domain material • Project start – Planning and Preparation Phase: July – Dec 2010 – Operational Project start (Manipulation): Dec 2011 – Operational Project start (Digitization): March 2011 • ~70 project team members, 20+ in core team • 7 work packages • ~65K physical volumes scanned so far 3/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 4. Austrian Books Online Division of cost and work load Google ONB • Transport • Provision of Metadata • Insurance • Selection • Scanning • Internal logistics • OCR • Conservational assessment • Image processing • Barcoding • Quality control • Metadata adjustments • Google Books • Data download and Quality control • Data storage & digital preservation • Digital Library 4/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 5. Austrian Books Online Digitisation Data Download ADOCO Storage in Pair Tree (Austrian Books https://confluence.ucop.edu/display/Curation/PairTree Online Download & Control) Quality Control Access 5/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 6. Austrian Books Online Symlink Tree 6/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 7. Austrian Books Online Download and Quality Assurance – ADOCO • Method – QA started July 2011 – Searching for systematic, not individual errors – Mix of automatic and manual methods – Manually impossible to check amount of pages • Tool: ADOCO – Downloading volumes – Internal viewer with possibility for error annotations – Clustering of errors and suggestions of suspicious files for manual audit – Reporting module and statistics (currently in MySQL) – SCAPE collaboration 7/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 8. Austrian Books Online QC in typical inhouse project vs. ABO • Inhouse – manual quality control – rescan • ABO – automatic and manual quality control – no rescan but reprocessing 8/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 9. Austrian Books Online ADOCO Technology Stack Jersey RESTful JSF (Primefaces) WebService Spring Framework Wrapped Hibernate CLI-TOOLS Apache Tomcat MySQL NetApp Filer Redhat Linux 9/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 10. Austrian Books Online Book Viewer Book Viewer Catalogue / “Quick Search” [Mobile Apps] Full text Search 10/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 11. Austrian Books Online Data Access • JPEG-2000 Master-Files stored redundantly • Access-Copies generated on the fly • Digitised Books linked with online catalogue • URN-Resolver for permanent identification underway (OBVSG - Austrian Library Network) • Searchable and accessible via • TEL http://search.theeuropeanlibrary.org/portal/en/index.html • Europeana http://www.europeana.eu/portal/ 11/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 12. Austrian Books Online s t c a e n c r e S 12/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 13. Austrian Books Online 13/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 14. Austrian Books Online • co-funded by the European Union under FP7 • develop scalable services for planning and execution of institutional preservation strategies • SCAPE Preservation Platform makes use of Hadoop 14/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 15. Austrian Books Online • framework for the distributed processing of large data sets across clusters of computers • overcome limitations SQL oriented databases • MapReduce paradigm • Sequence files: possibly compressed, containing pairs of writable key/values 15/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 16. Austrian Books Online Screencast: Loading Books from PairTree into HDFS • fs The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. • jar Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command. • load hocr files into SequenceFile in HDFS: hadoop jar seqfileutility.jar -m -d /home/onbscs/testdata/abo/samples/small -e html -c NONE • source code: https://github.com/openplanets/scape/tree/master/tb-lsdr-seqfilecreator 16/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012
  • 17. Austrian Books Online Thank You! georg.petz@onb.ac.at www.onb.ac.at/austrianbooksonline twitter.com/abooksonline Photographs: Ingrid Oentrich 17/17 Georg Petz LIBRARIES IN THE DIGITAL AGE (LIDA) 2012, 18 - 22 June 2012

Editor's Notes

  1. Processing consists of cleaning, cropping, and digitally &amp;quot;flattening&amp;quot; pages, in additon to optical character recognition. Volumes are processed shortly after they are scanned, and reprocessed infrequently after that. Analysis consists of organizing processed pages into a complete volume, including the selection of higher-quality pages in cases where there are more than one candidate page, and putting the pages of a volume in the correct order.
  2. Special software (ADOCO – ABO Download and Control) was implemented and is continuously developed to meet the needs of the quality auditing process. ADOCO enables simultaneous, multithreaded downloads. It is based on Primefaces and Spring Webflow, using Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) and uses a MySQL-Database for technical and bibliographic metadata. It allows for various searches and views on the relevant volumes. Primefaces: Java-based Ajax framework with JSF components ( http://primefaces.org/ ) used for the implementation of the GUI Jersey RESTful WebService: JAX-RS (JSR 311) Reference Implementation for building RESTful Web services used to communicate with other ONB internal systems (e.g. fulltextsearch) ( http://jersey.java.net/ ) Spring: application development framework for enterprise Java™ Hibernate: Java persistence framework to perform object relational mapping and query databases using HQL and SQL. ADOCO uses SQL instead of HQL when performance is an issue ( http://www.hibernate.org/ ) Wrapped CLI-Tools: Linux command line tools wrapped in JAVA (wget, tar, gpg, exiftool for image metadata, md5sum,..) MySQL: Database for technical and bibliographic metadata ( http://www.mysql.com/ ) NetApp Filer: stores jp2, hocr, mets and txt files in PairTree Redhat Linux: Linux distribution
  3. SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.