Sven Schlarb
Österreichische Nationalbibliothek
LIBER Satellite Event: APARSEN & SCAPE Workshop
21 May 2014, Austrian Nati...
• Examples of Big Data in memory institutions
• What are the SCAPE Testbeds?
• Motivation for the Austrian National Librar...
• Google Books Project: 30 Million digital books
• http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-p...
Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
Platform
•Automation
•Workflows
•...
• Good:
• Storing structured data
• Expressive query language
• ACID, type safety
• But:
• SQL Joins not efficient at scal...
• Hadoop means a cost-
advantage because
• It usually runs on relatively
inexpensive (commodity)
hardware
• No binding to ...
• Required to move data
• From NAS to Server
• To Cloud
• Multi-Terabyte senarios?
Dealing with large amounts of data
• Im...
• When dealing with large data sets it is usually easier to
bring the processor to the data than the data to the
processor...
What is Hadoop (physically)?
Distributed processing (MapReduce)
Distributed Storage (HDFS)
Hadoop = MapReduce + HDFS
2 x Q...
Configuration per CPU
Configuration of one Quad-Core-CPU (= 1 node)
4 physical cores
8 hyperthreading-cores (System „sees“...
Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 1...
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input spli...
Platform instance architecture at the Austrian National Library
• Access via REST API
• Workflow engine for complex
jobs
•...
Scalable
Command
Line
Processing
ToMaR
Large-
scale
content
profiling
C3PO
JPEG2000
file format
validation
Jpylyzer
Duplic...
• Web Archiving
• Web Archive Mime Type Identification
• Characterisation of web archive data
• Austrian Books Online
• Sc...
Webarchiving
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union ...
File format identification in web archives
This work was partially supported by the SCAPE Project.
The SCAPE project is co...
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
Basiert auf
HERITRIX
Web Crawler
MapReduce
JPG...
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funde...
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funde...
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funde...
• Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~...
ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project...
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migra...
Comparison of book derivatives
• Compare different versions of the same book
• Images come from different scanning sources...
• 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different ty...
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐fund...
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐fund...
• Possibility for libraries to build cost-efficient solutions
for storing large data collections
• HDFS as storage master ...
These slides on Slideshare:
• http://de.slideshare.net/SvenSchlarb/
application-scenarios-of-the-scape-project-at-the-aust...
Upcoming SlideShare
Loading in...5
×

LIBER Satellite Event, SCAPE by Sven Schlarb

176
-1

Published on

Sven Schlarb from the Austrian National Libraries gave an overview of the different application scenarios at the Austrian National Libraries related to Web Archiving and the Austrian Books Online project.
The presentation was given at the LIBER Satellite Event on Long term accessibility of digital resources in theory and practice, https://liber2014.univie.ac.at/satellite-event/, in Vienna on 21 May 2014.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
176
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

LIBER Satellite Event, SCAPE by Sven Schlarb

  1. 1. Sven Schlarb Österreichische Nationalbibliothek LIBER Satellite Event: APARSEN & SCAPE Workshop 21 May 2014, Austrian National Library, Vienna Application scenarios of the SCAPE project at the Austrian National Library
  2. 2. • Examples of Big Data in memory institutions • What are the SCAPE Testbeds? • Motivation for the Austrian National Library • Hadoop in a nutshell • SCAPE Platform setup at the Austrian National Library • Selected SCAPE tools • Application scenarios • Web Archiving • Austrian Books Online Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  3. 3. • Google Books Project: 30 Million digital books • http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched • Europeana: Metadata about over 24 million objects • Europeana annual report and accounts 2012, Europeana Foundation, April 2013 • Hathi Trust: 10 million volumes (over 5,6 million titles) comprising over 3,7 billion book page images • http://www.hathitrust.org/statistics_info • Internet Archive: 364 billion pages, about 10 Petabyte. • http://archive.org und http://archive.org/web/petabox.php Books, Journals, Newspapers, Websites. Big data? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  4. 4. Takeup •Stakeholders and Communities •Dissemination •Training Activities •Sustainability Platform •Automation •Workflows •Parallelization •Virtualization SCAPE Project Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  5. 5. • Good: • Storing structured data • Expressive query language • ACID, type safety • But: • SQL Joins not efficient at scale • ÖNB 2011: Failed creating a complete web-archive index using single-instance-MySQL (write performance!) • Solution? • Scaling vertically  Bigger servers  hardware costs! • Scaling horizontally  Sharding  maintenance costs! Pushing the boundaries of RDBMs (e.g. MySQL) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  6. 6. • Hadoop means a cost- advantage because • It usually runs on relatively inexpensive (commodity) hardware • No binding to specific vendors • Open-Source-Software Comparison of storage costs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen für Entscheider 2014, S. 39
  7. 7. • Required to move data • From NAS to Server • To Cloud • Multi-Terabyte senarios? Dealing with large amounts of data • Immediate processing • Unified storage and processing capabilities • Distributed I/O This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  8. 8. • When dealing with large data sets it is usually easier to bring the processor to the data than the data to the processor • Fine-granular parallelisation: All processing cores of the cluster are used as processors • Designed for failure. In large clusters hardware failure is the norm rather than the exception • Redundancy : Redundant storage of data blocks (default: 3 copies) • Data locality: Free nodes with direct access to data do the processing Some Basic hadoop assumptions This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  9. 9. What is Hadoop (physically)? Distributed processing (MapReduce) Distributed Storage (HDFS) Hadoop = MapReduce + HDFS 2 x Quad-Core-CPUs: 10 Map (parallelisation) 4 Reduce (aggregation) 4 x 1 TB hard disks with redundancy 3: 1,33 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  10. 10. Configuration per CPU Configuration of one Quad-Core-CPU (= 1 node) 4 physical cores 8 hyperthreading-cores (System „sees“ 8 cores) OS Map Map Map Map Map Reduce Reduce This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  11. 11. Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.  25 processing cores for Map tasks  10 processing cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  12. 12. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 What is Hadoop (conceptually)? Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  13. 13. Platform instance architecture at the Austrian National Library • Access via REST API • Workflow engine for complex jobs • Hive as the frontend for analytic queries • MapReduce/Pig for Extraction, Transform, and Load (ETL) • „Small“ objects in HDFS or HBase • „Large “ Digital objects stored on NetApp Filer Taverna Workflow engine REST API This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  14. 14. Scalable Command Line Processing ToMaR Large- scale content profiling C3PO JPEG2000 file format validation Jpylyzer Duplicate image detection Matchbox Selected SCAPE tools in various application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  15. 15. • Web Archiving • Web Archive Mime Type Identification • Characterisation of web archive data • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance Overview about application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  16. 16. Webarchiving This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Storage: ca. 45TB • ca. 1.7 Billion Objekts • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Important websites that change regularly • Event harvesting • Special occasions and events (e.g. elections)
  17. 17. File format identification in web archives This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  18. 18. (W)ARC Container JPG GIF HTM HTM MID (W)ARC InputFormat (W)ARC RecordReader Basiert auf HERITRIX Web Crawler MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 File format identification in web archives Software-Integration Durchsatz(GB/min) TIKA detector API in Map Phase 6,17 GB/min FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  19. 19. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  20. 20. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  21. 21. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  22. 22. • Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 200K physical volumes scanned so far • ~ 60 Mio pages Austrian Books Online This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  23. 23. ADOCO (Austrian Books Online Download & Control) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  24. 24. • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Different preservation tool categories: • Validation • Migration • Quality assurance Quality assured image file format migration This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  25. 25. Comparison of book derivatives • Compare different versions of the same book • Images come from different scanning sources • Images have been manipulated (cropped, rotated) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  26. 26. • 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big- data-processing-chaining-hadoop-jobs-using-taverna Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  27. 27. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Bildbreite Blockbreite Assumption: „Significant“ difference between average blockwidth and image width is an indicator for possible text loss due to cropping error. Cropping errorCorrect cropping
  28. 28. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Create input text files w. file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query
  29. 29. • Possibility for libraries to build cost-efficient solutions for storing large data collections • HDFS as storage master or staging area? • Local cluster vs. cloud? • Apache Hadoop offers a stable core for building a large scale processing platform; ready to be used in production • Carefully select additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Summary
  30. 30. These slides on Slideshare: • http://de.slideshare.net/SvenSchlarb/ application-scenarios-of-the-scape-project-at-the-austrian-national-library Further information • Project website: www.scape-project.eu • Github repository: www.github.com/openplanets • Project Wiki: www.wiki.opf-labs.org/display/SP/Home SCAPE tools mentioned • ToMaR: http://openplanets.github.io/ToMaR/# • Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer • Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa- matchbox • C3PO: http://ifs.tuwien.ac.at/imp/c3po Thank you! Questions? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×