An Open Source Infrastructure for Preserving Large collections of Digital Objects

274 views

Published on

Today’s libraries are curating large digital collections, indexing millions of full-text documents, and preserving Terabytes of data for future generations. This means that libraries must adopt new methods for the processing of large amounts of data. And this is exactly where the SCAPE project (www.scape-project-eu) comes into play. The SCAPE project offers an open source infrastructure, as well as a variety of tools and services for the distributed processing of large data sets with a focus on long-term preservation.
In this project context, we are here presenting an open source infrastructure for preserving large collections of digital objects created at the Austrian National Library for quality assurance tasks as part of the management of a large digital book collection. We describe the experimental cluster hardware and the software components used for creating the infrastructure. More concretely, we will show a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

Published in: Technology, Art & Photos
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
274
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Open Source Infrastructure for Preserving Large collections of Digital Objects

  1. 1. Sven Schlarb Austrian National Library Elag 2013 Gent, Belgium, May 29, 2013 An open source infrastructure for preserving large collections of digital objects The SCAPE project at the Austrian National Library
  2. 2. • SCAPE project overview • Application areas at the Austrian National Library • Web Archiving • Austrian Books Online • SCAPE at the Austrian National Library • Hardware set-up • Open source software architecture • Application Scenarios • Lessons learnt 2 Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  3. 3. • Ability to process large and complex data sets in preservation scenarios • Increasing amount of data in data centers and memory institutions Motivation This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Volume, Velocity, and Variety of data 1970 2000 2030 cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge. available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
  4. 4. • “Big data” is a buzzword, just a vague idea • No definitive GB, TB, PB, (…) threshold where data becomes “big data”, depends on the institutional context • Massive growth of data to be stored and processed • Situation where solutions that are usually employed do not fulfill new scalability requirements • Pushing the limit of conventional data base solutions • Simple batch processing becomes tedious or even impossible „Big Data“ in a Library context? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  5. 5. 5 • EU-funded FP7 project, lead by Austrian Institute of Technology • Consortium: 16 Partners • National Libraries • Data Centers and Memory Institutions • Research institutes and Universities • Commercial partners • Started 2011, runs until mid/end-2014 SCAPE Consortium This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  6. 6. Testbeds •Data sets •Integration •Evaluation Preservation Components •Quality Assurance •Scalable Components •Automation-ready Tools Planning and Watch •Institutional Policies •Technical Watch •Automated Planning Takeup •Stakeholders and Communities •Dissemination •Training Activities •Sustainability SCAPE Project Overview Platform •Automation •Workflows •Parallelization •Virtualization
  7. 7. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 MapReduce/Hadoop in a nutshell 7This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result
  8. 8. Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.  25 processing cores for Map tasks and  10 cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
  9. 9. • Access via REST API • Workflow engine for complex jobs • Hive as the frontend for analytic queries • MapReduce/Pig for Extraction, Transform, and Load (ETL) • „Small“ objects in HDFS or HBase • „Large “ Digital objects stored on NetApp Filer 9 Platform Architecture This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Digital Objects Storage hOCR/Text/METS/(W)ARC in HDFS MapReduce Hive (SQL) Pig (ETL) HBase Taverna Workflow engine REST API
  10. 10. • Web Archiving • Scenario 1: Web Archive Mime Type Identification • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance Application scenarios
  11. 11. • Physical storage 19 TB • Raw data 32 TB • Number of objects 1.241.650.566 • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Important websites that change regularly • Event harvesting • Special occasions and events (e.g. elections) Key Data Web Archiving
  12. 12. (W)ARC Container JPG GIF HTM HTM MID (W)ARC InputFormat (W)ARC RecordReader based on HERITRIX Web crawler read/write (W)ARC MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 Scenario 1: Web Archive Mime Type Identification Tool integration pattern Throughput (GB/min) TIKA detector API call in Map phase 6,17 GB/min FILE called as command line tool from map/reduce 1,70 GB/min TIKA JAR command line tool called from map/reduce 0,01 GB/min Amount of data Number of ARC files Throughput (GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min
  13. 13. TIKA 1.0DROID 6.01 Scenario 1: Web Archive Mime Type Identification
  14. 14. • Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 130K physical volumes scanned so far • ~ 40 Mio pages Key Data Austrian Books Online
  15. 15. Digitisation Download & Storage Quality Control Access 15 ADOCO (Austrian Books Online Download & Control) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  16. 16. • Task: Image file format migration • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Challenges: • Integrating validation, migration, and quality assurance • Computing intensive quality assurance Scenario 2: Image file format migration
  17. 17. • Task: Compare different versions of the same book • Images have been manipulated (cropped, rotated) and stored in different locations • Images come from different scanning sources or were subject to different modification procedures • Challenges: • Computing intensive (Average runtime per book on a single quad-core server ~ 4,5 hours) • 130.000 books, ~320 pages each • SCAPE tool: Matchbox Scenario 2: Comparison of book derivatives
  18. 18. • ETL Processing of 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big- data-processing-chaining-hadoop-jobs-using-taverna Scenario 3: MapReduce in Quality Assurance
  19. 19. 19 • Create input text files containing file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query Scenario 3: MapReduce in Quality Assurance
  20. 20. 20 find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB 60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h Jp2PathCreator HadoopStreamingExiftoolRead Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ... Reading image metadata
  21. 21. 21 find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) 60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h HtmlPathCreator SequenceFileCreator SequenceFile creation
  22. 22. 22 Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 ... Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map Reduce HadoopAvBlockWidthMapReduce SequenceFile Textfile Calculate average block width using MapReduce 60.000 books (24 Million pages): ~ 6 h
  23. 23. 23 HiveLoadExifData & HiveLoadHocrData jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidth jp2width Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT) CREATE TABLE htmlwidth (hid STRING, hwidth INT) Analytic Queries
  24. 24. 24 HiveSelect jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidthjp2width jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Analytic Queries
  25. 25. • Emergence of new options for creating large-scale storage and processing infrastructures • HDFS as storage master or staging area? • Create a local cluster or rent a cloud infrastructure? • Apache Hadoop offers a stable core for building a large scale processing platform that is ready to be used in production • Important to select carefully additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs 25 Lessons learnt This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  26. 26. Further information • Project website: www.scape-project.eu • Github repository: www.github.com/openplanets • Project Wiki: www.wiki.opf-labs.org/display/SP/Home SCAPE tools mentioned • SCAPE Platform • http://www.scape-project.eu/publication/an-architectural-overview- of-the-scape-preservation-platform • Jpylyzer – Jpeg2000 validation • http://www.openplanetsfoundation.org/software/jpylyzer • Matchbox – Image comparison • https://github.com/openplanets/scape/tree/master/pc-qa-matchbox Thank you! Questions?

×