SlideShare a Scribd company logo
William Palmer 
Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library) 
SCAPE Information Day 
British Library, UK, 14th July 2014 
Large Scale Processing with Hadoop
2 
Large Scale Processing Methodologies 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
•Traditional 
•One central large processor capability 
•One+ central storage instance 
•Data stored away from processor 
•Paradigm: “Move the data to the processor” 
•Hadoop 
•Many smaller commodity computers/CPUs 
•Storage capacity in all computers, federated together 
•Easily expandable 
•Paradigm: “Move the processor to the data”
•The New York Times + Hadoop on Amazon Web Services 
•11 million articles (1851-1980) that need to be converted to PDF 
•4TB TIFF data 
•24 hours wall time to complete the migration 
•Cost: $240 (not including bandwidth) 
•http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super- computing-fun/ 
•http://cse.unl.edu/~byrav/INFOCOM2011/workshops/papers/p1099-xiao.pdf 
3 
Example 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
4 
Hadoop Ecosystem: The Zoo 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
HDFS – data locality MapReduce 
•••
5 
MapReduce 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
MAP 
REDUCE
6 
MapReduce in detail 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
Map 
Reduce 
Sort 
Shuffle 
Merge 
Input 
Input Split 
Record 
Record 
Record 
… 
Input Split 
Record 
Record 
Record 
… 
Input Split 
Record 
Record 
Record 
… 
Map Output 
Map Output 
Reducer Output 
…
7 
Hadoop In Action 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
•Designed for processing text 
•Capacity can be reduced/expanded 
•Comes with HDFS filesystem, with federation and redundancy (three copies of data by default) 
•Using commodity hardware node failures are expected 
•A node being down should not affect the cluster 
•Data locality is considered when distributing computation, processing data where it is stored, reducing the need to transfer it 
•Very large community and ecosystem
8 
(Obligatory) Hadoop Screenshots 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
14/02/13 11:22:33 INFO gzchecker.GZChecker: Loading paths... 
14/02/13 11:22:36 INFO gzchecker.GZChecker: Setting paths... 
14/02/13 11:22:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. … 
14/02/13 11:22:39 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/02/13 11:22:40 INFO mapred.JobClient: Running job: job_201401131502_0058 
14/02/13 11:22:41 INFO mapred.JobClient: map 0% reduce 0% 
…
9 
Hadoop In Action 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
•We are using Hadoop/MapReduce for parallelisation 
•Non standard use case 
•As a parallelisation method costs are associated … 
•… but get a lot of well supported features for free 
•HDFS 
•Administration 
•Support 
•Once a MapReduce program is developed scalability just happens 
•Can theoretically prototype on a Raspberry Pi and run on a 3000 node super cluster
10 
Hadoop In Action 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
•Do I have to copy data to HDFS for processing? 
•1TB of data took 8 hours to copy from NAS to HDFS 
•Image format migration (TIFF-JP2) took ~57hours 
•… still got to get the data back to the NAS 
•What if I don’t? 
•Same image format migration code accessing/posting data directly from/to Repository took ~58hours 
•No copying data before/after 
•More efficient as processing time is greater per file 
•Won’t necessarily hold for different preservation actions (see: “small files problem”)
11 
Hadoop at The British Library 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
•Two Hadoop clusters: 
•Digital Preservation Team Cluster 
•Virtualised hardware 
•1 management node, 1 master node 
•28 worker nodes (1 core/1 CPU, 6GB RAM each) 
•14TB raw storage, 5TB useable @ replication of 3 
•Cloudera Hadoop (CDH4) 
•For testing/R&D 
•Web Archiving Team Cluster 
•Physical hardware 
•80 nodes (8 cores/2CPUs, 16GB RAM) 
•700TB raw storage, 233TB useable @ replication of 3 
•Cloudera Hadoop (CDH3) 
•In production use
•TIFF->JP2 migration with QA 
•Single node @ 26 files/hour (with OpenJPEG) 
•28 nodes @ 735 files/hour (with OpenJPEG) 
•2409 files/hour with Kakadu 
•Detecting DRM in PDF files 
•28 nodes @ 51869 files/hour 
•Identifying web content 
•5.3million files/hour 
12 
SCAPE Workflow Results 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
•SCAPE tools are treated as individual components and should be reusable on other large scale execution platforms (all tools described today are, at least) 
•British Library Digital Library System (DLS) has a bespoke workflow execution system where SCAPE tools have been integrated 
•Other platforms: GNU Parallel … 
•Tools can be integrated with your own systems 
13 
Other Large Scale Execution Platforms 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

More Related Content

What's hot

Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
Nishant Gandhi
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Hortonworks
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
inside-BigData.com
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
cneudecker
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Big Data Joe™ Rossi
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
The HDF-EOS Tools and Information Center
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
The HDF-EOS Tools and Information Center
 
Data Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from VenusData Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from Venus
The HDF-EOS Tools and Information Center
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Sharad Pandey
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 

What's hot (20)

Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Data Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from VenusData Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from Venus
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 

Viewers also liked

Mild reminder
Mild reminderMild reminder
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
Pravin Kumar Singh, PMP, PSM
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
DataWorks Summit
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Yahoo Developer Network
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Mahdi Esmailoghli
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
Sonatype
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (11)

Mild reminder
Mild reminderMild reminder
Mild reminder
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to SCAPE Information Day at BL - Large Scale Processing with Hadoop

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Sven Schlarb
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Project
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
Aditya Sakhuja
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Webinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduceWebinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduce
inovex GmbH
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Abdul Nasir
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Project
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
John Sing
 

Similar to SCAPE Information Day at BL - Large Scale Processing with Hadoop (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Webinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduceWebinar - Big Data: Einführung in Hadoop und MapReduce
Webinar - Big Data: Einführung in Hadoop und MapReduce
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 

More from SCAPE Project

C sz z6
C sz z6C sz z6
C sz z6
SCAPE Project
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
SCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
SCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
SCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
SCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
SCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
SCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
SCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
SCAPE Project
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
SCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
SCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
SCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
SCAPE Project
 

More from SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 

SCAPE Information Day at BL - Large Scale Processing with Hadoop

  • 1. William Palmer Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library) SCAPE Information Day British Library, UK, 14th July 2014 Large Scale Processing with Hadoop
  • 2. 2 Large Scale Processing Methodologies This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). •Traditional •One central large processor capability •One+ central storage instance •Data stored away from processor •Paradigm: “Move the data to the processor” •Hadoop •Many smaller commodity computers/CPUs •Storage capacity in all computers, federated together •Easily expandable •Paradigm: “Move the processor to the data”
  • 3. •The New York Times + Hadoop on Amazon Web Services •11 million articles (1851-1980) that need to be converted to PDF •4TB TIFF data •24 hours wall time to complete the migration •Cost: $240 (not including bandwidth) •http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super- computing-fun/ •http://cse.unl.edu/~byrav/INFOCOM2011/workshops/papers/p1099-xiao.pdf 3 Example This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 4. 4 Hadoop Ecosystem: The Zoo This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). HDFS – data locality MapReduce •••
  • 5. 5 MapReduce This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). MAP REDUCE
  • 6. 6 MapReduce in detail This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Map Reduce Sort Shuffle Merge Input Input Split Record Record Record … Input Split Record Record Record … Input Split Record Record Record … Map Output Map Output Reducer Output …
  • 7. 7 Hadoop In Action This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). •Designed for processing text •Capacity can be reduced/expanded •Comes with HDFS filesystem, with federation and redundancy (three copies of data by default) •Using commodity hardware node failures are expected •A node being down should not affect the cluster •Data locality is considered when distributing computation, processing data where it is stored, reducing the need to transfer it •Very large community and ecosystem
  • 8. 8 (Obligatory) Hadoop Screenshots This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 14/02/13 11:22:33 INFO gzchecker.GZChecker: Loading paths... 14/02/13 11:22:36 INFO gzchecker.GZChecker: Setting paths... 14/02/13 11:22:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. … 14/02/13 11:22:39 INFO mapred.FileInputFormat: Total input paths to process : 1 14/02/13 11:22:40 INFO mapred.JobClient: Running job: job_201401131502_0058 14/02/13 11:22:41 INFO mapred.JobClient: map 0% reduce 0% …
  • 9. 9 Hadoop In Action This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). •We are using Hadoop/MapReduce for parallelisation •Non standard use case •As a parallelisation method costs are associated … •… but get a lot of well supported features for free •HDFS •Administration •Support •Once a MapReduce program is developed scalability just happens •Can theoretically prototype on a Raspberry Pi and run on a 3000 node super cluster
  • 10. 10 Hadoop In Action This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). •Do I have to copy data to HDFS for processing? •1TB of data took 8 hours to copy from NAS to HDFS •Image format migration (TIFF-JP2) took ~57hours •… still got to get the data back to the NAS •What if I don’t? •Same image format migration code accessing/posting data directly from/to Repository took ~58hours •No copying data before/after •More efficient as processing time is greater per file •Won’t necessarily hold for different preservation actions (see: “small files problem”)
  • 11. 11 Hadoop at The British Library This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). •Two Hadoop clusters: •Digital Preservation Team Cluster •Virtualised hardware •1 management node, 1 master node •28 worker nodes (1 core/1 CPU, 6GB RAM each) •14TB raw storage, 5TB useable @ replication of 3 •Cloudera Hadoop (CDH4) •For testing/R&D •Web Archiving Team Cluster •Physical hardware •80 nodes (8 cores/2CPUs, 16GB RAM) •700TB raw storage, 233TB useable @ replication of 3 •Cloudera Hadoop (CDH3) •In production use
  • 12. •TIFF->JP2 migration with QA •Single node @ 26 files/hour (with OpenJPEG) •28 nodes @ 735 files/hour (with OpenJPEG) •2409 files/hour with Kakadu •Detecting DRM in PDF files •28 nodes @ 51869 files/hour •Identifying web content •5.3million files/hour 12 SCAPE Workflow Results This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 13. •SCAPE tools are treated as individual components and should be reusable on other large scale execution platforms (all tools described today are, at least) •British Library Digital Library System (DLS) has a bespoke workflow execution system where SCAPE tools have been integrated •Other platforms: GNU Parallel … •Tools can be integrated with your own systems 13 Other Large Scale Execution Platforms This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).