Sven SchlarbAustrian National LibraryElag 2013Gent, Belgium, May 29, 2013An open source infrastructure for preservinglarge...
• SCAPE project overview• Application areas at the Austrian National Library• Web Archiving• Austrian Books Online• SCAPE ...
• Ability to process largeand complex data sets inpreservation scenarios• Increasing amount of datain data centers andmemo...
• “Big data” is a buzzword, just a vague idea• No definitive GB, TB, PB, (…) threshold where databecomes “big data”, depen...
5• EU-funded FP7 project,lead by Austrian Instituteof Technology• Consortium: 16 Partners• National Libraries• Data Center...
Takeup•Stakeholders and Communities•Dissemination•Training Activities•SustainabilitySCAPE Project OverviewPlatform•Automat...
MapReduce/Hadoop in a nutshell7This work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by th...
Experimental ClusterJob TrackerTask TrackersData NodesName NodeCPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)RAM: ...
• Access via REST API• Workflow engine for complexjobs• Hive as the frontend foranalytic queries• MapReduce/Pig forExtract...
• Web Archiving• Scenario 1: Web Archive Mime Type Identification• Austrian Books Online• Scenario 2: Image File Format Mi...
• Physical storage 19 TB• Raw data 32 TB• Number of objects1.241.650.566• Domain harvesting• Entire top-level-domain.at ev...
(W)ARC ContainerJPGGIFHTMHTMMID(W)ARC InputFormat(W)ARC RecordReaderbased onHERITRIXWeb crawlerread/write (W)ARCMapReduceJ...
TIKA 1.0DROID 6.01Scenario 1: Web Archive Mime Type Identification
• Public private partnership with Google• Only public domain• Objective to scan ~ 600.000 Volumes• ~ 200 Mio. pages• ~ 70 ...
15ADOCO (Austrian Books Online Download & Control)This work was partially supported by the SCAPE Project.The SCAPE project...
• Task: Image file format migration• TIFF to JPEG2000 migration• Objective: Reduce storage costs byreducing the size of th...
• Task: Compare different versions of the same book• Images have been manipulated (cropped, rotated) and storedin differen...
• ETL Processing of 60.000 books, ~ 24 Million pages• Using Taverna‘s „Tool service“ (remote ssh execution)• Orchestration...
19• Create input text filescontaining file paths (JP2& HTML)• Read image metadatausing Exiftool (HadoopStreaming API)• Cre...
20find/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NA...
21find/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.htm...
22Z119585409/00000001Z119585409/00000002Z119585409/00000003Z119585409/00000004Z119585409/00000005...Z119585409/00000001 21...
23HiveLoadExifData & HiveLoadHocrDatajid jwidthZ119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z11...
24HiveSelectjid jwidthZ119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119...
• Emergence of new options for creating large-scalestorage and processing infrastructures• HDFS as storage master or stagi...
Further information•Project website: www.scape-project.eu•Github repository: www.github.com/openplanets•Project Wiki: www....
Upcoming SlideShare
Loading in …5
×

SCAPE Presentation at the Elag2013 conference in Gent/Belgium

1,077 views

Published on

Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,077
On SlideShare
0
From Embeds
0
Number of Embeds
207
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

SCAPE Presentation at the Elag2013 conference in Gent/Belgium

  1. 1. Sven SchlarbAustrian National LibraryElag 2013Gent, Belgium, May 29, 2013An open source infrastructure for preservinglarge collections of digital objectsThe SCAPE project at the Austrian National Library
  2. 2. • SCAPE project overview• Application areas at the Austrian National Library• Web Archiving• Austrian Books Online• SCAPE at the Austrian National Library• Hardware set-up• Open source software architecture• Application Scenarios• Lessons learnt2OverviewThis work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  3. 3. • Ability to process largeand complex data sets inpreservation scenarios• Increasing amount of datain data centers andmemory institutionsMotivationThis work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).Volume, Velocity, and Varietyof data1970 2000 2030cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
  4. 4. • “Big data” is a buzzword, just a vague idea• No definitive GB, TB, PB, (…) threshold where databecomes “big data”, depends on the institutionalcontext• Massive growth of data to be stored and processed• Situation where solutions that are usually employed donot fulfill new scalability requirements• Pushing the limit of conventional data base solutions• Simple batch processing becomes tedious or even impossible„Big Data“ in a Library context?This work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  5. 5. 5• EU-funded FP7 project,lead by Austrian Instituteof Technology• Consortium: 16 Partners• National Libraries• Data Centers and MemoryInstitutions• Research institutes andUniversities• Commercial partners• Started 2011, runs untilmid/end-2014SCAPE ConsortiumThis work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  6. 6. Takeup•Stakeholders and Communities•Dissemination•Training Activities•SustainabilitySCAPE Project OverviewPlatform•Automation•Workflows•Parallelization•Virtualization
  7. 7. MapReduce/Hadoop in a nutshell7This work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).Task1Task 2Task 3Output dataAggregatedResultAggregatedResultAggregatedResultAggregatedResult
  8. 8. Experimental ClusterJob TrackerTask TrackersData NodesName NodeCPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)RAM: 16GBDISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective•Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system. 25 processing cores for Map tasks and 10 cores for Reduce tasksCPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)RAM: 24GBDISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
  9. 9. • Access via REST API• Workflow engine for complexjobs• Hive as the frontend foranalytic queries• MapReduce/Pig forExtraction, Transform, andLoad (ETL)• „Small“ objects in HDFS orHBase• „Large “ Digital objects storedon NetApp Filer9Platform ArchitectureThis work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).Taverna Workflow engineREST API
  10. 10. • Web Archiving• Scenario 1: Web Archive Mime Type Identification• Austrian Books Online• Scenario 2: Image File Format Migration• Scenario 3: Comparison of Book Derivatives• Scenario 4: MapReduce in Digitised Book Quality AssuranceApplication scenarios
  11. 11. • Physical storage 19 TB• Raw data 32 TB• Number of objects1.241.650.566• Domain harvesting• Entire top-level-domain.at every 2 years• Selective harvesting• Important websites thatchange regularly• Event harvesting• Special occasions andevents (e.g. elections)Key Data Web Archiving
  12. 12. (W)ARC ContainerJPGGIFHTMHTMMID(W)ARC InputFormat(W)ARC RecordReaderbased onHERITRIXWeb crawlerread/write (W)ARCMapReduceJPGApache Tikadetect MIMEMapReduceimage/jpgimage/jpg 1image/gif 1text/html 2audio/midi 1Scenario 1: Web Archive Mime Type Identification
  13. 13. TIKA 1.0DROID 6.01Scenario 1: Web Archive Mime Type Identification
  14. 14. • Public private partnership with Google• Only public domain• Objective to scan ~ 600.000 Volumes• ~ 200 Mio. pages• ~ 70 project team members• 20+ in core team• ~ 130K physical volumes scanned so far• ~ 40 Mio pagesKey Data Austrian Books Online
  15. 15. 15ADOCO (Austrian Books Online Download & Control)This work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).https://confluence.ucop.edu/display/Curation/PairTreeGooglePublic Private PartnershipADOCO
  16. 16. • Task: Image file format migration• TIFF to JPEG2000 migration• Objective: Reduce storage costs byreducing the size of the images• JPEG2000 to TIFF migration• Objective: Mitigation of the JPEG2000 fileformat obsolescense risk• Challenges:• Integrating validation, migration, andquality assurance• Computing intensive qualityassuranceScenario 2: Image file format migration
  17. 17. • Task: Compare different versions of the same book• Images have been manipulated (cropped, rotated) and storedin different locations• Images come from different scanning sources or were subjectto different modification procedures• Challenges:• Computing intensive (Average runtime per book on a singlequad-core server ~ 4,5 hours)• 130.000 books, ~320 pages each• SCAPE tool: MatchboxScenario 2: Comparison of book derivatives
  18. 18. • ETL Processing of 60.000 books, ~ 24 Million pages• Using Taverna‘s „Tool service“ (remote ssh execution)• Orchestration of different types of hadoop jobs• Hadoop-Streaming-API• Hadoop Map/Reduce• Hive• Workflow available on myExperiment:http://www.myexperiment.org/workflows/3105• See Blogpost:http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-tavernaScenario 3: MapReduce in Quality Assurance
  19. 19. 19• Create input text filescontaining file paths (JP2& HTML)• Read image metadatausing Exiftool (HadoopStreaming API)• Create sequence filecontaining all HTML files• Calculate average blockwidth using MapReduce• Load data in Hive tables• Execute SQL test queryScenario 3: MapReduce in Quality Assurance
  20. 20. 20find/NAS/Z119585409/00000001.jp2/NAS/Z119585409/00000002.jp2/NAS/Z119585409/00000003.jp2…/NAS/Z117655409/00000001.jp2/NAS/Z117655409/00000002.jp2/NAS/Z117655409/00000003.jp2…/NAS/Z119585987/00000001.jp2/NAS/Z119585987/00000002.jp2/NAS/Z119585987/00000003.jp2…/NAS/Z119584539/00000001.jp2/NAS/Z119584539/00000002.jp2/NAS/Z119584539/00000003.jp2…/NAS/Z119599879/00000001.jp2l/NAS/Z119589879/00000002.jp2/NAS/Z119589879/00000003.jp2......NASreading files from NAS1,4 GB 1,2 GB60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 hJp2PathCreator HadoopStreamingExiftoolReadZ119585409/00000001 2345Z119585409/00000002 2340Z119585409/00000003 2543…Z117655409/00000001 2300Z117655409/00000002 2300Z117655409/00000003 2345…Z119585987/00000001 2300Z119585987/00000002 2340Z119585987/00000003 2432…Z119584539/00000001 5205Z119584539/00000002 2310Z119584539/00000003 2134…Z119599879/00000001 2312Z119589879/00000002 2300Z119589879/00000003 2300...Reading image metadata
  21. 21. 21find/NAS/Z119585409/00000707.html/NAS/Z119585409/00000708.html/NAS/Z119585409/00000709.html…/NAS/Z138682341/00000707.html/NAS/Z138682341/00000708.html/NAS/Z138682341/00000709.html…/NAS/Z178791257/00000707.html/NAS/Z178791257/00000708.html/NAS/Z178791257/00000709.html…/NAS/Z967985409/00000707.html/NAS/Z967985409/00000708.html/NAS/Z967985409/00000709.html…/NAS/Z196545409/00000707.html/NAS/Z196545409/00000708.html/NAS/Z196545409/00000709.html...Z119585409/00000707Z119585409/00000708Z119585409/00000709Z119585409/00000710Z119585409/00000711Z119585409/00000712NASreading files from NAS1,4 GB 997 GB (uncompressed)60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 hHtmlPathCreator SequenceFileCreatorSequenceFile creation
  22. 22. 22Z119585409/00000001Z119585409/00000002Z119585409/00000003Z119585409/00000004Z119585409/00000005...Z119585409/00000001 2100Z119585409/00000001 2200Z119585409/00000001 2300Z119585409/00000001 2400Z119585409/00000002 2100Z119585409/00000002 2200Z119585409/00000002 2300Z119585409/00000002 2400Z119585409/00000003 2100Z119585409/00000003 2200Z119585409/00000003 2300Z119585409/00000003 2400Z119585409/00000004 2100Z119585409/00000004 2200Z119585409/00000004 2300Z119585409/00000004 2400Z119585409/00000005 2100Z119585409/00000005 2200Z119585409/00000005 2300Z119585409/00000005 2400Z119585409/00000001 2250Z119585409/00000002 2250Z119585409/00000003 2250Z119585409/00000004 2250Z119585409/00000005 2250Map ReduceHadoopAvBlockWidthMapReduceSequenceFile TextfileCalculate average block width using MapReduce60.000 books (24 Million pages): ~ 6 h
  23. 23. 23HiveLoadExifData & HiveLoadHocrDatajid jwidthZ119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250hid hwidthZ119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700htmlwidthjp2widthZ119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700Z119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250CREATE TABLE jp2width(hid STRING, jwidth INT)CREATE TABLE jp2width(hid STRING, jwidth INT)CREATE TABLE htmlwidth(hid STRING, hwidth INT)CREATE TABLE htmlwidth(hid STRING, hwidth INT)Analytic Queries
  24. 24. 24HiveSelectjid jwidthZ119585409/00000001 2250Z119585409/00000002 2150Z119585409/00000003 2125Z119585409/00000004 2125Z119585409/00000005 2250hid hwidthZ119585409/00000001 1870Z119585409/00000002 2100Z119585409/00000003 2015Z119585409/00000004 1350Z119585409/00000005 1700htmlwidthjp2widthjid jwidth hwidthZ119585409/00000001 2250 1870Z119585409/00000002 2150 2100Z119585409/00000003 2125 2015Z119585409/00000004 2125 1350Z119585409/00000005 2250 1700select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hidAnalytic Queries
  25. 25. • Emergence of new options for creating large-scalestorage and processing infrastructures• HDFS as storage master or staging area?• Create a local cluster or rent a cloud infrastructure?• Apache Hadoop offers a stable core for building a largescale processing platform that is ready to be used inproduction• Important to select carefully additional componentsfrom the Apache Hadoop Ecosystem (HBase, Hive, Pig,Oozie, Yarn, Ambari, etc.) that fit your needs25Lessons learntThis work was partially supported by the SCAPE Project.The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  26. 26. Further information•Project website: www.scape-project.eu•Github repository: www.github.com/openplanets•Project Wiki: www.wiki.opf-labs.org/display/SP/HomeSCAPE tools mentioned•SCAPE Platform• http://www.scape-project.eu/publication/an-architectural-overview-of-the-scape-preservation-platform•Jpylyzer – Jpeg2000 validation• http://www.openplanetsfoundation.org/software/jpylyzer•Matchbox – Image comparison• https://github.com/openplanets/scape/tree/master/pc-qa-matchboxThank you! Questions?

×