Application scenarios of the SCAPE project at the Austrian National Library

Sven Schlarb
Österreichische Nationalbibliothek
LIBER Satellite Event: APARSEN & SCAPE Workshop
21 May 2014, Austrian National Library, Vienna
Application scenarios of the SCAPE project at the Austrian
National Library

• Examples of Big Data in memory institutions
• What are the SCAPE Testbeds?
• Motivation for the Austrian National Library
• Hadoop in a nutshell
• SCAPE Platform setup at the Austrian National Library
• Selected SCAPE tools
• Application scenarios
• Web Archiving
• Austrian Books Online
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

• Google Books Project: 30 Million digital books
• http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched
• Europeana: Metadata about over 24 million objects
• Europeana annual report and accounts 2012, Europeana Foundation, April 2013
• Hathi Trust: 10 million volumes (over 5,6 million titles)
comprising over 3,7 billion book page images
• http://www.hathitrust.org/statistics_info
• Internet Archive: 364 billion pages, about 10 Petabyte.
• http://archive.org und http://archive.org/web/petabox.php
Books, Journals, Newspapers, Websites. Big data?

Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
Platform
•Automation
•Workflows
•Parallelization
•Virtualization
SCAPE Project Overview

• Good:
• Storing structured data
• Expressive query language
• ACID, type safety
• But:
• SQL Joins not efficient at scale
• ÖNB 2011: Failed creating a complete web-archive index
using single-instance-MySQL (write performance!)
• Solution?
• Scaling vertically  Bigger servers  hardware costs!
• Scaling horizontally  Sharding  maintenance costs!
Pushing the boundaries of RDBMs (e.g. MySQL)

• Hadoop means a cost-
advantage because
• It usually runs on relatively
inexpensive (commodity)
hardware
• No binding to specific
vendors
• Open-Source-Software
Comparison of storage costs
Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen
für Entscheider 2014, S. 39

• Required to move data
• From NAS to Server
• To Cloud
• Multi-Terabyte senarios?
Dealing with large amounts of data
• Immediate processing
• Unified storage and
processing capabilities
• Distributed I/O

• When dealing with large data sets it is usually easier to
bring the processor to the data than the data to the
processor
• Fine-granular parallelisation: All processing cores of
the cluster are used as processors
• Designed for failure. In large clusters hardware failure
is the norm rather than the exception
• Redundancy : Redundant storage of data blocks
(default: 3 copies)
• Data locality: Free nodes with direct access to data do
the processing
Some Basic hadoop assumptions

What is Hadoop (physically)?
Distributed processing (MapReduce)
Distributed Storage (HDFS)
Hadoop = MapReduce + HDFS
2 x Quad-Core-CPUs:
10 Map (parallelisation)
4 Reduce (aggregation)
4 x 1 TB hard disks with redundancy 3:
1,33 TB effective

Configuration per CPU
Configuration of one Quad-Core-CPU (= 1 node)
4 physical cores
8 hyperthreading-cores (System „sees“ 8 cores)
OS
Map
Map
Map
Map
Map
Reduce
Reduce

Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.
 25 processing cores for Map tasks
 10 processing cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective

Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
What is Hadoop (conceptually)?
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result

Platform instance architecture at the Austrian National Library
• Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
Taverna Workflow engine
REST API

Scalable
Command
Line
Processing
ToMaR
Large-
scale
content
profiling
C3PO
JPEG2000
file format
validation
Jpylyzer
Duplicate
image
detection
Matchbox
Selected SCAPE tools in various application scenarios

• Web Archiving
• Web Archive Mime Type Identification
• Characterisation of web archive data
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Overview about application scenarios

Webarchiving
• Storage: ca. 45TB
• ca. 1.7 Billion Objekts
• Domain harvesting
• Entire top-level-domain .at
every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)

File format identification in web archives

(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
Basiert auf
HERITRIX
Web Crawler
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
File format identification in web archives
Software-Integration Durchsatz(GB/min)
TIKA detector API in Map Phase 6,17 GB/min
FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min
TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min
Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min

Characterisation of web archive data

• Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 200K physical volumes scanned so far
• ~ 60 Mio pages
Austrian Books Online

ADOCO (Austrian Books Online Download & Control)
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO

• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the
JPEG2000 file format obsolescense
risk
• Different preservation tool
categories:
• Validation
• Migration
• Quality assurance
Quality assured image file format migration

Comparison of book derivatives
• Compare different versions of the same book
• Images come from different scanning sources
• Images have been manipulated (cropped, rotated)

• 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Using MapReduce for Quality Assurance

Bildbreite
Blockbreite
Assumption: „Significant“ difference between average blockwidth and image width
is an indicator for possible text loss due to cropping error.
Cropping errorCorrect cropping

• Create input text files w.
file paths (JP2 & HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query

• Possibility for libraries to build cost-efficient solutions
for storing large data collections
• HDFS as storage master or staging area?
• Local cluster vs. cloud?
• Apache Hadoop offers a stable core for building a large
scale processing platform; ready to be used in
production
• Carefully select additional components from the Apache
Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn,
Ambari, etc.) that fit your needs
Summary

These slides on Slideshare:
• http://de.slideshare.net/SvenSchlarb/
application-scenarios-of-the-scape-project-at-the-austrian-national-library
Further information
• Project website: www.scape-project.eu
• Github repository: www.github.com/openplanets
• Project Wiki: www.wiki.opf-labs.org/display/SP/Home
SCAPE tools mentioned
• ToMaR: http://openplanets.github.io/ToMaR/#
• Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer
• Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa-
matchbox
• C3PO: http://ifs.tuwien.ac.at/imp/c3po
Thank you! Questions?

Application scenarios of the SCAPE project at the Austrian National Library

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Application scenarios of the SCAPE project at the Austrian National Library

Similar to Application scenarios of the SCAPE project at the Austrian National Library (20)

Recently uploaded

Recently uploaded (20)

Application scenarios of the SCAPE project at the Austrian National Library