SlideShare a Scribd company logo
1 of 22
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop - Hadoop-on-Demand
on Traditional HPC Resources
Sriram Krishnan, Ph.D.
sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Acknowledgements
• MahidharTatineni
• ChaitanyaBaru
• Jim Hayes
• ShavaSmallen
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Outline
• Motivations
• Technical Challenges
• Implementation Details
• Performance Evaluation
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Motivations
• An open source tool for running Hadoop jobs on
HPC resources
• Easy to configure and use for the end-user
• Play nice with existing batch systems on HPC resources
• Why do we need such a tool?
• End-users: I already have Hadoop code – and I only have
access to regular HPC-style resources
• Computer Scientists: I want to study the implications of
using Hadoop on HPC resources
• And I don’t have root access to these resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Some Ground Rules
• What this presentation is:
• A “how-to” for running Hadoop jobs on HPC resources
using myHadoop
• A description of the performance implications of using
myHadoop
• What this presentation is not:
• A propaganda for the use of Hadoop on HPC resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Main Challenges
• Shared-nothing (Hadoop) versus HPC-style
architectures
• In terms of philosophies and implementation
• Control and co-existence of Hadoop and HPC
batch systems
• Typically both Hadoop and HPC batch systems
(viz., SGE, PBS) need completely control over the
resources for scheduling purposes
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Traditional HPC Architecture
PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH
MINIMAL LOCAL
STORAGE
Shared-nothing (MapReduce-style) Architectures
COMPUTE/DATA CLUSTER
WITH LOCAL STOARGE
ETHERNET
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hadoop and HPC Batch Systems
• Access to HPC resources is typically via batch systems
– viz. PBS, SGE, Condor, etc
• These systems have complete control over the compute resources
• Users typically can’t log in directly to the compute nodes (via ssh) to
start various daemons
• Hadoop manages its resources using its own set of
daemons
• NameNode&DataNodefor Hadoop Distributed File System (HDFS)
• JobTracker&TaskTrackerfor MapReduce jobs
• Hadoop daemons and batch systems can’t co-exist
seamlessly
• Will interfere with each other’s scheduling algorithms
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop Requirements
1. Enabling execution of Hadoop jobs on shared HPC
resources via traditional batch systems
a) Working with a variety of batch systems (PBS, SGE, etc)
2. Allowing users to run Hadoop jobs without needing
root-level access
3. Enabling multiple users to simultaneously execute
Hadoop jobs on the shared resource
4. Allowing users to either run a fresh Hadoop instance
each time (a), or store HDFS state for future runs (b)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
COMPUTE NODES
PERSISTENT MODE NON-PERSISTENT MODE
BATCH PROCESSING SYSTEM (PBS, SGE)
PARALLEL FILE SYSTEM
HADOOP DAEMONS
myHadoop Architecture
[2, 3]
[1]
[4(a)][4(b)]
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Implementation Details: PBS, SGE
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
User Workflow
BOOTSTRAP
TEARDOWN
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Performance Evaluation
• Goals and non-goals
• Study the performance overhead and implication of myHadoop
• Not to optimize/improve existing Hadoop code
• Software and Hardware
• Triton Compute Cluster (http://tritonresource.sdsc.edu/)
• Triton Data Oasis (Lustre-based parallel file system) for data storage, and for
HDFS in “persistent mode”
• Apache Hadoop version 0.20.2
• Various parameters tuned for performance on Triton
• Applications
• Compute-intensive: HadoopBlast (Indiana University)
• Modest-sized inputs – 128 query sequences (70K each)
• Compared against NR database – 200MB in size
• Data-intensive: Data Selections (OpenTopography Facility at SDSC)
• Input size from 1GB to 100GB
• Sub-selecting around 10% of the entire dataset
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HadoopBlast
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Selections
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Related Work
• Recipe for running Hadoop over PBS in blogosphere
• http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using-
pbs.html
• myHadoop is “inspired” by their approach – but is more general-
purpose and configurable
• Apache Hadoop On Demand (HOD)
• http://hadoop.apache.org/common/docs/r0.17.0/hod.html
• Only PBS support, needs external HDFS, harder to use, and has
trouble with multiple concurrent Hadoop instances
• CloudBatch – batch queuing system on clouds
• Use of Hadoop to run batch systems like PBS
• Exact opposite of our goals – but similar approach
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Center for Large-Scale Data Systems Research (CLDS)
CLDS
Industry Advisory
Board
Academic Advisory
Board
Benchmarking,Perfor
mance Evaluation
and Systems
Development
Projects
Industry
Forums and
Professional
Education
Industry-University Consortium on Software for Large-scale Data
Systems
How Much
Information?
Project
Public
Private
Personal
Visiting Fellows
Information Metrology
Data Growth, Information Mgt
Cloud Storage
Architecture
Cloud Storage and
Performance Benchmarking
Industry Interchange
Mgt, Technical Forums
• Student internships
• Joint collaborations
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Summary
• myHadoop – an open source tool for running Hadoop
jobs on HPC resources
• Without need for root-level access
• Co-exists with traditional batch systems
• Allows “persistent” and “non-persistent” modes to save HDFS state
across runs
• Tested on SDSC Triton, TeraGrid and UC Grid resources
• More information
• Software: https://sourceforge.net/projects/myhadoop/
• SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR-
2011-2-Hadoop.pdf
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Questions?
• Email me at sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Appendix
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
io.file.buffer.size 131072 Size of read/write buffer
fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs
io.sort.mb 650 Memory limit for sorting data
core-site.xml:
dfs.replication 2 Number of times data is replicated
dfs.block.size 134217728 HDFS block size in bytes
dfs.datanode.handler.count 64 Number of handlers to serve block requests
hdfs-site.xml:
mapred.reduce.parallel.copies 4 Number of parallel copies run by
reducers
mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously
mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously
mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks
mapred.child.java.opts -Xmx1024m Large heap size for child JVMs
hdfs-site.xml:
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data SelectCounts on Dash

More Related Content

What's hot

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNDataWorks Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 

What's hot (20)

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
ebay
ebayebay
ebay
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARN
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 

Similar to myHadoop - Hadoop-on-Demand on Traditional HPC Resources

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoopKnowledgehut
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 

Similar to myHadoop - Hadoop-on-Demand on Traditional HPC Resources (20)

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

  • 1. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. sriram@sdsc.edu
  • 2. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Acknowledgements • MahidharTatineni • ChaitanyaBaru • Jim Hayes • ShavaSmallen
  • 3. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Outline • Motivations • Technical Challenges • Implementation Details • Performance Evaluation
  • 4. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Motivations • An open source tool for running Hadoop jobs on HPC resources • Easy to configure and use for the end-user • Play nice with existing batch systems on HPC resources • Why do we need such a tool? • End-users: I already have Hadoop code – and I only have access to regular HPC-style resources • Computer Scientists: I want to study the implications of using Hadoop on HPC resources • And I don’t have root access to these resources
  • 5. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Some Ground Rules • What this presentation is: • A “how-to” for running Hadoop jobs on HPC resources using myHadoop • A description of the performance implications of using myHadoop • What this presentation is not: • A propaganda for the use of Hadoop on HPC resources
  • 6. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Main Challenges • Shared-nothing (Hadoop) versus HPC-style architectures • In terms of philosophies and implementation • Control and co-existence of Hadoop and HPC batch systems • Typically both Hadoop and HPC batch systems (viz., SGE, PBS) need completely control over the resources for scheduling purposes
  • 7. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Traditional HPC Architecture PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH MINIMAL LOCAL STORAGE Shared-nothing (MapReduce-style) Architectures COMPUTE/DATA CLUSTER WITH LOCAL STOARGE ETHERNET
  • 8. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Hadoop and HPC Batch Systems • Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc • These systems have complete control over the compute resources • Users typically can’t log in directly to the compute nodes (via ssh) to start various daemons • Hadoop manages its resources using its own set of daemons • NameNode&DataNodefor Hadoop Distributed File System (HDFS) • JobTracker&TaskTrackerfor MapReduce jobs • Hadoop daemons and batch systems can’t co-exist seamlessly • Will interfere with each other’s scheduling algorithms
  • 9. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop Requirements 1. Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systems a) Working with a variety of batch systems (PBS, SGE, etc) 2. Allowing users to run Hadoop jobs without needing root-level access 3. Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource 4. Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)
  • 10. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO COMPUTE NODES PERSISTENT MODE NON-PERSISTENT MODE BATCH PROCESSING SYSTEM (PBS, SGE) PARALLEL FILE SYSTEM HADOOP DAEMONS myHadoop Architecture [2, 3] [1] [4(a)][4(b)]
  • 11. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Implementation Details: PBS, SGE
  • 12. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO User Workflow BOOTSTRAP TEARDOWN
  • 13. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance Evaluation • Goals and non-goals • Study the performance overhead and implication of myHadoop • Not to optimize/improve existing Hadoop code • Software and Hardware • Triton Compute Cluster (http://tritonresource.sdsc.edu/) • Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode” • Apache Hadoop version 0.20.2 • Various parameters tuned for performance on Triton • Applications • Compute-intensive: HadoopBlast (Indiana University) • Modest-sized inputs – 128 query sequences (70K each) • Compared against NR database – 200MB in size • Data-intensive: Data Selections (OpenTopography Facility at SDSC) • Input size from 1GB to 100GB • Sub-selecting around 10% of the entire dataset
  • 14. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO HadoopBlast
  • 15. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Selections
  • 16. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Related Work • Recipe for running Hadoop over PBS in blogosphere • http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using- pbs.html • myHadoop is “inspired” by their approach – but is more general- purpose and configurable • Apache Hadoop On Demand (HOD) • http://hadoop.apache.org/common/docs/r0.17.0/hod.html • Only PBS support, needs external HDFS, harder to use, and has trouble with multiple concurrent Hadoop instances • CloudBatch – batch queuing system on clouds • Use of Hadoop to run batch systems like PBS • Exact opposite of our goals – but similar approach
  • 17. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Center for Large-Scale Data Systems Research (CLDS) CLDS Industry Advisory Board Academic Advisory Board Benchmarking,Perfor mance Evaluation and Systems Development Projects Industry Forums and Professional Education Industry-University Consortium on Software for Large-scale Data Systems How Much Information? Project Public Private Personal Visiting Fellows Information Metrology Data Growth, Information Mgt Cloud Storage Architecture Cloud Storage and Performance Benchmarking Industry Interchange Mgt, Technical Forums • Student internships • Joint collaborations
  • 18. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Summary • myHadoop – an open source tool for running Hadoop jobs on HPC resources • Without need for root-level access • Co-exists with traditional batch systems • Allows “persistent” and “non-persistent” modes to save HDFS state across runs • Tested on SDSC Triton, TeraGrid and UC Grid resources • More information • Software: https://sourceforge.net/projects/myhadoop/ • SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR- 2011-2-Hadoop.pdf
  • 19. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions? • Email me at sriram@sdsc.edu
  • 20. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Appendix
  • 21. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO io.file.buffer.size 131072 Size of read/write buffer fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs io.sort.mb 650 Memory limit for sorting data core-site.xml: dfs.replication 2 Number of times data is replicated dfs.block.size 134217728 HDFS block size in bytes dfs.datanode.handler.count 64 Number of handlers to serve block requests hdfs-site.xml: mapred.reduce.parallel.copies 4 Number of parallel copies run by reducers mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks mapred.child.java.opts -Xmx1024m Large heap size for child JVMs hdfs-site.xml:
  • 22. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data SelectCounts on Dash

Editor's Notes

  1. Mahidhar, Chaitan: Architecture, prototypingJim: myHadoop rollShava: UC Grid support
  2. Motivations – why myHadoopTechnical challenges – what are the problemsImplementation Details – howPerformance evaluation - findings
  3. Note: we didn’t make up these requirements. Came out of our existing requirements as end-users and computer scientistsMost of us have access to resources such as the TeraGrid, UC Grid, SDSC TritonI have no official affiliation with any of those resources – but I had access to these resources, and wanted to use them for performance studies
  4. Co-location of data and computes in shared-nothing – no centralized shared storage in the Hadoop modelHigh performance parallel file systems for HPC resources
  5. Scheduler access2, 3) Non-root concurrent users4) Persistent and non-persistent modes
  6. Data loads not very dominant – more CPU-intensivePerformance slightly better on local disk – more contention on Oasis, and also that Lustre is optimized for large files, not lots of smaller ones
  7. Data loads are the dominating factor for the non-persistent runsMakes more sense to leave the data on shared file system – and use that at the HDFS locationOutput writes are also time consuming in this case – so might as well leave the data in HDFS for future runs