Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.
S
Sven SchlarbScientist at AIT Austrian Institute of Technology GmbH
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
1. Sven Schlarb
Austrian National Library
Elag 2013
Gent, Belgium, May 29, 2013
An open source infrastructure for preserving
large collections of digital objects
The SCAPE project at the Austrian National Library
2. • SCAPE project overview
• Application areas at the Austrian National Library
• Web Archiving
• Austrian Books Online
• SCAPE at the Austrian National Library
• Hardware set-up
• Open source software architecture
• Application Scenarios
• Lessons learnt
2
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
3. • Ability to process large
and complex data sets in
preservation scenarios
• Increasing amount of data
in data centers and
memory institutions
Motivation
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Volume, Velocity, and Variety
of data
1970 2000 2030
cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.
available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
4. • “Big data” is a buzzword, just a vague idea
• No definitive GB, TB, PB, (…) threshold where data
becomes “big data”, depends on the institutional
context
• Massive growth of data to be stored and processed
• Situation where solutions that are usually employed do
not fulfill new scalability requirements
• Pushing the limit of conventional data base solutions
• Simple batch processing becomes tedious or even impossible
„Big Data“ in a Library context?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
5. 5
• EU-funded FP7 project,
lead by Austrian Institute
of Technology
• Consortium: 16 Partners
• National Libraries
• Data Centers and Memory
Institutions
• Research institutes and
Universities
• Commercial partners
• Started 2011, runs until
mid/end-2014
SCAPE Consortium
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
7. MapReduce/Hadoop in a nutshell
7
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Task1
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Aggregated
Result
Aggregated
Result
8. Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective
•Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.
25 processing cores for Map tasks and
10 cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
9. • Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
9
Platform Architecture
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Taverna Workflow engine
REST API
10. • Web Archiving
• Scenario 1: Web Archive Mime Type Identification
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Application scenarios
11. • Physical storage 19 TB
• Raw data 32 TB
• Number of objects
1.241.650.566
• Domain harvesting
• Entire top-level-domain
.at every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)
Key Data Web Archiving
12. (W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
based on
HERITRIX
Web crawler
read/write (W)ARC
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
Scenario 1: Web Archive Mime Type Identification
14. • Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 130K physical volumes scanned so far
• ~ 40 Mio pages
Key Data Austrian Books Online
15. 15
ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
16. • Task: Image file format migration
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the JPEG2000 file
format obsolescense risk
• Challenges:
• Integrating validation, migration, and
quality assurance
• Computing intensive quality
assurance
Scenario 2: Image file format migration
17. • Task: Compare different versions of the same book
• Images have been manipulated (cropped, rotated) and stored
in different locations
• Images come from different scanning sources or were subject
to different modification procedures
• Challenges:
• Computing intensive (Average runtime per book on a single
quad-core server ~ 4,5 hours)
• 130.000 books, ~320 pages each
• SCAPE tool: Matchbox
Scenario 2: Comparison of book derivatives
18. • ETL Processing of 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Scenario 3: MapReduce in Quality Assurance
19. 19
• Create input text files
containing file paths (JP2
& HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
Scenario 3: MapReduce in Quality Assurance
25. • Emergence of new options for creating large-scale
storage and processing infrastructures
• HDFS as storage master or staging area?
• Create a local cluster or rent a cloud infrastructure?
• Apache Hadoop offers a stable core for building a large
scale processing platform that is ready to be used in
production
• Important to select carefully additional components
from the Apache Hadoop Ecosystem (HBase, Hive, Pig,
Oozie, Yarn, Ambari, etc.) that fit your needs
25
Lessons learnt
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).