SlideShare a Scribd company logo
myHadoop 0.30 and Beyond!
Glenn K. Lockwood, Ph.D.!
User Services Group!
San Diego Supercomputer Center!

SAN DIEGO SUPERCOMPUTER CENTER
Hadoop and HPC!
PROBLEM: domain scientists aren't using Hadoop!
•  toy for computer scientists and salespeople!
•  Java is not high-performance!
•  don't want to learn computer science and Java!

!
SOLUTION: make Hadoop easier for HPC users!
•  use existing HPC clusters and software!
•  use Perl/Python/C/C++/Fortran instead of Java!
•  make starting Hadoop as easy as possible!

SAN DIEGO SUPERCOMPUTER CENTER
Compute: Traditional vs. Data-Intensive!
Traditional HPC!
•  CPU-bound problems!
•  Solution: OpenMPand MPI-based
parallelism!

Data-Intensive!
•  IO-bound problems!
•  Solution: Map/reducebased parallelism!
SAN DIEGO SUPERCOMPUTER CENTER
Architecture for Both Workloads!
• 
• 
• 
• 

PROs!
High-speed
interconnect!
Complementary
object storage!
Fast CPUs, RAM!
Less faulty!

CONs!
•  Nodes aren't storagerich!
•  Transferring data
between HDFS and
object storage*!

!
* unless using Lustre, S3, etc backends!
SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute
Infrastructure!

SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute
Infrastructure!

SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute
Infrastructure!

SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute
Infrastructure!

SAN DIEGO SUPERCOMPUTER CENTER
myHadoop – 3-step Install!
1. Download Apache Hadoop 1.x and myHadoop
0.30!
$ wget http://apache.cs.utah.edu/hadoop/common/
hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz!
$ wget glennklockwood.com/files/myhadoop-0.30.tar.gz!

2. Unpack both Hadoop and myHadoop!
$ tar zxvf hadoop-1.2.1-bin.tar.gz!
$ tar zxvf myhadoop-0.30.tar.gz!

3. Apply myHadoop patch to Hadoop!
$ cd hadoop-1.2.1/conf!
$ patch < ../myhadoop-0.30/myhadoop-1.2.1.patch!

!
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop – 3-step Cluster!
1. Set a few environment variables!
# sets HADOOP_HOME, JAVA_HOME, and PATH!
$ module load hadoop!
$ export HADOOP_CONF_DIR=$HOME/mycluster.conf!
!

2. Run myhadoop-configure.sh to set up Hadoop!
$ myhadoop-configure.sh -s /scratch/$USER/$PBS_JOBID!
!

3. Start cluster with Hadoop's start-all.sh!
$ start-all.sh!

SAN DIEGO SUPERCOMPUTER CENTER
Easy Wordcount in Python!
#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
for	
  line	
  in	
  sys.stdin:	
  
	
  	
  line	
  =	
  line.strip()	
  
	
  	
  keys	
  =	
  line.split()	
  
	
  	
  for	
  key	
  in	
  keys	
  
	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  print(	
  '%st%d'	
  %	
  	
  
	
  	
  	
  	
  	
  	
  (key,	
  value)	
  )	
  

#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
last_key	
  =	
  None	
  
running_tot	
  =	
  0	
  
	
  
for	
  input_line	
  in	
  sys.stdin:	
  
	
  	
  input_line	
  =	
  input_line.strip()	
  
	
  	
  this_key,	
  value	
  =	
  input_line.split("t",	
  1)	
  
	
  	
  value	
  =	
  int(value)	
  
	
  	
  if	
  last_key	
  ==	
  this_key:	
  
	
  	
  	
  	
  running_tot	
  +=	
  value	
  
	
  	
  else:	
  
	
  	
  	
  	
  if	
  last_key:	
  
	
  	
  	
  	
  	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_tot)	
  )
	
  
	
  	
  	
  	
  	
  	
  running_tot	
  =	
  value	
  
	
  	
  	
  	
  	
  	
  last_key	
  =	
  this_key	
  
	
  	
  
if	
  last_key	
  ==	
  this_key:	
  
	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_tot)	
  )	
  

https://github.com/glennklockwood/hpchadoop/tree/master/wordcount.py!
SAN DIEGO SUPERCOMPUTER CENTER
Easy Wordcount in Python!
$	
  hadoop	
  dfs	
  -­‐put	
  ./input.txt	
  mobydick.txt	
  
	
  
$	
  hadoop	
  jar	
  	
  
	
  	
  	
  	
  /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar	
  	
  
	
  	
  	
  	
  -­‐mapper	
  "python	
  $PWD/mapper.py"	
  	
  
	
  	
  	
  	
  -­‐reducer	
  "python	
  $PWD/reducer.py"	
  	
  
	
  	
  	
  	
  -­‐input	
  mobydick.txt	
  	
  
	
  	
  	
  	
  -­‐output	
  output	
  
	
  
$	
  hadoop	
  dfs	
  -­‐cat	
  output/part-­‐*	
  >	
  ./output.txt	
  

SAN DIEGO SUPERCOMPUTER CENTER
Data-Intensive Performance Scaling!
8-node Hadoop cluster on Gordon
!
8 GB VCF (Variant Calling Format) file
!

9x speedup using
2 mappers/node
!

SAN DIEGO SUPERCOMPUTER CENTER
Data-Intensive Performance Scaling!
7.7 GB of chemical evolution data, 270 MB/s
processing rate
!

SAN DIEGO SUPERCOMPUTER CENTER
Advanced Features - Useability!
•  System-wide default configurations!
•  myhadoop-0.30/conf/myhadoop.conf!
•  MH_SCRATCH_DIR – specify location of node-local
storage for all users!
•  MH_IPOIB_TRANSFORM – specify regex to transform
node hostnames into IP over InfiniBand hostnames!

•  Users can remain totally ignorant of scratch
disks and InfiniBand!
•  Literally define HADOOP_CONF_DIR and run
myhadoop-configure.sh with no parameters –
myHadoop figures out everything else!
SAN DIEGO SUPERCOMPUTER CENTER
Advanced Features - Useability!
•  Parallel filesystem support!
•  HDFS on Lustre via myHadoop persistent mode (-p)!
•  Direct Lustre support (IDH)!
•  No performance loss at smaller scales for HDFS on Lustre!

•  Resource managers supported in unified
framework:!
• 
• 
• 
• 

Torque 2.x and 4.x – Tested on SDSC Gordon!
SLURM 2.6 – Tested on TACC Stampede!
Grid Engine!
Can support LSF, PBSpro, Condor easily (need testbeds)!
SAN DIEGO SUPERCOMPUTER CENTER
New Features!
•  Perl reimplementation for expandability!
•  Scalability improvements!
•  Separate namenode, secondary namenode, jobtrackers!
•  Automatic switchover based on cluster size best practices!

•  Hadoop 2.0 / YARN support (halfway there)!
•  Spark integration!

SAN DIEGO SUPERCOMPUTER CENTER

More Related Content

What's hot

Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Sumeet Singh
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

What's hot (19)

Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 

Viewers also liked

Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Glenn K. Lockwood
 
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon
Glenn K. Lockwood
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS
Technopark
 
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
Hakky St
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
Technopark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
hagino 3000
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
narumikanno0918
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
Hideo Terada
 
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
Yota Ishida
 
リクルートにおける画像解析事例紹介
リクルートにおける画像解析事例紹介リクルートにおける画像解析事例紹介
リクルートにおける画像解析事例紹介
Recruit Technologies
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線
Yoshitaka Ushiku
 

Viewers also liked (11)

Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS
 
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
 
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
ディープラーニングでおそ松さんの6つ子は見分けられるのか? FIT2016
 
リクルートにおける画像解析事例紹介
リクルートにおける画像解析事例紹介リクルートにおける画像解析事例紹介
リクルートにおける画像解析事例紹介
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線
 

Similar to myHadoop 0.30

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
Glenn K. Lockwood
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
Big Data Joe™ Rossi
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
George Markomanolis
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
Julien Anguenot
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
php & performance
 php & performance php & performance
php & performance
simon8410
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018
Fabio Janiszevski
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
jtdudley
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Data Con LA
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outspardhavi reddy
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
Ehsan Totoni
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
Soam Acharya
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 

Similar to myHadoop 0.30 (20)

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
php & performance
 php & performance php & performance
php & performance
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

myHadoop 0.30

  • 1. myHadoop 0.30 and Beyond! Glenn K. Lockwood, Ph.D.! User Services Group! San Diego Supercomputer Center! SAN DIEGO SUPERCOMPUTER CENTER
  • 2. Hadoop and HPC! PROBLEM: domain scientists aren't using Hadoop! •  toy for computer scientists and salespeople! •  Java is not high-performance! •  don't want to learn computer science and Java! ! SOLUTION: make Hadoop easier for HPC users! •  use existing HPC clusters and software! •  use Perl/Python/C/C++/Fortran instead of Java! •  make starting Hadoop as easy as possible! SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Compute: Traditional vs. Data-Intensive! Traditional HPC! •  CPU-bound problems! •  Solution: OpenMPand MPI-based parallelism! Data-Intensive! •  IO-bound problems! •  Solution: Map/reducebased parallelism! SAN DIEGO SUPERCOMPUTER CENTER
  • 4. Architecture for Both Workloads! •  •  •  •  PROs! High-speed interconnect! Complementary object storage! Fast CPUs, RAM! Less faulty! CONs! •  Nodes aren't storagerich! •  Transferring data between HDFS and object storage*! ! * unless using Lustre, S3, etc backends! SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER
  • 7. Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER
  • 8. Add Data Analysis to Existing Compute Infrastructure! SAN DIEGO SUPERCOMPUTER CENTER
  • 9. myHadoop – 3-step Install! 1. Download Apache Hadoop 1.x and myHadoop 0.30! $ wget http://apache.cs.utah.edu/hadoop/common/ hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz! $ wget glennklockwood.com/files/myhadoop-0.30.tar.gz! 2. Unpack both Hadoop and myHadoop! $ tar zxvf hadoop-1.2.1-bin.tar.gz! $ tar zxvf myhadoop-0.30.tar.gz! 3. Apply myHadoop patch to Hadoop! $ cd hadoop-1.2.1/conf! $ patch < ../myhadoop-0.30/myhadoop-1.2.1.patch! ! SAN DIEGO SUPERCOMPUTER CENTER
  • 10. myHadoop – 3-step Cluster! 1. Set a few environment variables! # sets HADOOP_HOME, JAVA_HOME, and PATH! $ module load hadoop! $ export HADOOP_CONF_DIR=$HOME/mycluster.conf! ! 2. Run myhadoop-configure.sh to set up Hadoop! $ myhadoop-configure.sh -s /scratch/$USER/$PBS_JOBID! ! 3. Start cluster with Hadoop's start-all.sh! $ start-all.sh! SAN DIEGO SUPERCOMPUTER CENTER
  • 11. Easy Wordcount in Python! #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:      line  =  line.strip()      keys  =  line.split()      for  key  in  keys          value  =  1          print(  '%st%d'  %                (key,  value)  )   #!/usr/bin/env  python     import  sys     last_key  =  None   running_tot  =  0     for  input_line  in  sys.stdin:      input_line  =  input_line.strip()      this_key,  value  =  input_line.split("t",  1)      value  =  int(value)      if  last_key  ==  this_key:          running_tot  +=  value      else:          if  last_key:              print(  "%st%d"  %  (last_key,  running_tot)  )              running_tot  =  value              last_key  =  this_key       if  last_key  ==  this_key:      print(  "%st%d"  %  (last_key,  running_tot)  )   https://github.com/glennklockwood/hpchadoop/tree/master/wordcount.py! SAN DIEGO SUPERCOMPUTER CENTER
  • 12. Easy Wordcount in Python! $  hadoop  dfs  -­‐put  ./input.txt  mobydick.txt     $  hadoop  jar            /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐mapper  "python  $PWD/mapper.py"            -­‐reducer  "python  $PWD/reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER
  • 13. Data-Intensive Performance Scaling! 8-node Hadoop cluster on Gordon ! 8 GB VCF (Variant Calling Format) file ! 9x speedup using 2 mappers/node ! SAN DIEGO SUPERCOMPUTER CENTER
  • 14. Data-Intensive Performance Scaling! 7.7 GB of chemical evolution data, 270 MB/s processing rate ! SAN DIEGO SUPERCOMPUTER CENTER
  • 15. Advanced Features - Useability! •  System-wide default configurations! •  myhadoop-0.30/conf/myhadoop.conf! •  MH_SCRATCH_DIR – specify location of node-local storage for all users! •  MH_IPOIB_TRANSFORM – specify regex to transform node hostnames into IP over InfiniBand hostnames! •  Users can remain totally ignorant of scratch disks and InfiniBand! •  Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters – myHadoop figures out everything else! SAN DIEGO SUPERCOMPUTER CENTER
  • 16. Advanced Features - Useability! •  Parallel filesystem support! •  HDFS on Lustre via myHadoop persistent mode (-p)! •  Direct Lustre support (IDH)! •  No performance loss at smaller scales for HDFS on Lustre! •  Resource managers supported in unified framework:! •  •  •  •  Torque 2.x and 4.x – Tested on SDSC Gordon! SLURM 2.6 – Tested on TACC Stampede! Grid Engine! Can support LSF, PBSpro, Condor easily (need testbeds)! SAN DIEGO SUPERCOMPUTER CENTER
  • 17. New Features! •  Perl reimplementation for expandability! •  Scalability improvements! •  Separate namenode, secondary namenode, jobtrackers! •  Automatic switchover based on cluster size best practices! •  Hadoop 2.0 / YARN support (halfway there)! •  Spark integration! SAN DIEGO SUPERCOMPUTER CENTER