SlideShare a Scribd company logo
1 of 32
Hadoop Streaming

Programming Hadoop without Java
!

Glenn K. Lockwood, Ph.D.
!
User Services Group
!
San Diego Supercomputer Center
!
University of California San Diego
!
November 8, 2013
!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

HADOOP ARCHITECTURE
RECAP"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Map/Reduce Parallelism
"

task 5!
Data!

task 4!
Data!

task 0!
Data!

task 3!
Data!
task 1!
Data!

task 2!
Data!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Magic of HDFS
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Workflow
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Processing Pipeline
"
1.  Map – convert raw input into key/value pairs on
each node!
2.  Shuffle/Sort – Send all key/value pairs with the
same key to the same reducer node!
3.  Reduce – For each unique key, do something
with all the corresponding values!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

WORDCOUNT EXAMPLES"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop and Python
"
•  Hadoop streaming w/ Python mappers/reducers!
•  portable!
•  most difficult (or least difficult) to use!
•  you are the glue between Python and Hadoop!

•  mrjob (or others: hadoopy, dumbo, etc)!
• 
• 
• 
• 

comprehensive integration!
Python interface to Hadoop streaming!
Analogous interface libraries exist in R, Perl!
Can interface directly with Amazon!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount Example
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming with Python
"
•  "Simplest" (most portable) method!
•  Uses raw Python, Hadoop – you are the glue!
cat input.txt | mapper.py | sort | reducer.py > output.txt

provide these two scripts; Hadoop does the rest!

•  generalizable to any language you want (Perl, R,
etc)!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON – Hadoop Streaming
"
Located in streaming/streaming/:!
•  wordcount-streaming-mapper.py

We'll look at this first!
•  wordcount-streaming-reducer.py

We'll look at this second!
•  run-wordcount.sh

All of the Hadoop commands needed to run this example.
Run the script (./run-­‐wordcount.sh) or paste each
command line-by-line!
•  pg2701.txt

The full text of Melville's Moby Dick!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Hadoop streaming mapper
"
#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
for	
  line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  print(	
  '%st%d'	
  %	
  (key,	
  value)	
  )	
  

...!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What One Mapper Does
"
line	
  =	
  

Call me Ishmael. Some years ago—never mind how long!

keys	
  =	
   Call!

me! Ishmael.! Some!years! ago--never! mind! how! long!

emit.keyval(key,value)	
  ...	
  
Call!

years!
1! Ishmael.! 1! 1!
me!

mind!

long!

1!

1!
to the reducers!

ago--never! 1! how! 1!
1!
Some!1!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Reducer Loop
"
•  If this key is the same as the previous key,!
•  add this key's value to our running total.!

•  Otherwise,!
• 
• 
• 
• 

print out the previous key's name and the running total,!
reset our running total to 0,!
add this key's value to the running total, and!
"this key" is now considered the "previous key"!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Streaming Reducer (1/2)
"
#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
last_key	
  =	
  None	
  
running_total	
  =	
  0	
  
	
  
for	
  input_line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  input_line	
  =	
  input_line.strip()	
  
	
  	
  	
  	
  this_key,	
  value	
  =	
  input_line.split("t",	
  1)	
  
	
  	
  	
  	
  value	
  =	
  int(value)	
  
	
  
(to	
  be	
  continued...)	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Streaming Reducer (2/2)
"
	
  if	
  last_key	
  ==	
  this_key:	
  
	
  	
  	
  	
  	
  	
  	
  	
  running_total	
  +=	
  value	
  	
  
	
  	
  	
  	
  else:	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  last_key:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_total)	
  )	
  
	
  	
  	
  	
  	
  	
  	
  	
  running_total	
  =	
  value	
  
	
  	
  	
  	
  	
  	
  	
  	
  last_key	
  =	
  this_key	
  
	
  
if	
  last_key	
  ==	
  this_key:	
  
	
  	
  	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_total)	
  )	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Testing Mappers/Reducers
"
•  Debugging Hadoop is not fun!
$	
  head	
  -­‐n100	
  pg2701.txt	
  |	
  	
  
	
  	
  ./wordcount-­‐streaming-­‐mapper.py	
  |	
  sort	
  |	
  	
  
	
  	
  ./wordcount-­‐streaming-­‐reducer.py	
  
...	
  
with 	
  5	
  
word,	
  1	
  
world.
	
  1	
  
www.gutenberg.org	
  1	
  
you 	
  3	
  
You 	
  1	
  
	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Launching Hadoop Streaming
"
$	
  hadoop	
  dfs	
  -­‐copyFromLocal	
  ./pg2701.txt	
  mobydick.txt	
  
	
  
$	
  hadoop	
  jar	
  	
  
	
  /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar	
  	
  
	
  	
  	
  	
  -­‐D	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  -­‐mapper	
  "$(which	
  python)	
  $PWD/wordcount-­‐streaming-­‐mapper.py"	
  	
  
	
  	
  	
  	
  -­‐reducer	
  "$(which	
  python)	
  $PWD/wordcount-­‐streaming-­‐reducer.py"	
  	
  
	
  	
  	
  	
  -­‐input	
  mobydick.txt	
  	
  
	
  	
  	
  	
  -­‐output	
  output	
  
	
  
$	
  hadoop	
  dfs	
  -­‐cat	
  output/part-­‐*	
  >	
  ./output.txt	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop with Python - mrjob
"
• 
• 
• 
• 
• 

Mapper, reducer written as functions!
Can serialize (Pickle) objects to use as values!
Presents a single key + all values at once!
Extracts map/reduce errors from Hadoop for you!
Hadoop runs entirely through Python:!
$	
  ./wordcount-­‐mrjob.py	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐jobconf	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  	
  	
  –r	
  hadoop	
  	
  
	
  	
  	
  	
  	
  	
  hdfs:///user/glock/mobydick.txt	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐output-­‐dir	
  hdfs:///user/glock/output	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON - mrjob
"
Located in streaming/mrjob:!
•  wordcount-mrjob.py

Contains both mapper and reducer code!
•  run-wordcount-mrjob.sh

All of the hadoop commands needed to run this example.
Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each
command line-by-line!
•  pg2701.txt

The full text of Melville's Moby Dick!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob - Mapper
"
#!/usr/bin/env	
  python	
  
	
  
from	
  mrjob.job	
  import	
  MRJob	
  
	
  
class	
  MRwordcount(MRJob):	
  
	
  
	
  	
  	
  	
  def	
  mapper(self,	
  _,	
  line):	
   for	
  line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  keys	
  =	
  line.split()	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  print('%st%d'	
  %	
  (key,	
  value))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  value	
  
	
  
	
  	
  	
  	
  def	
  reducer(self,	
  key,	
  values):	
  
	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  sum(values)	
  
	
  
if	
  __name__	
  ==	
  '__main__':	
  
	
  	
  	
  	
  MRwordcount.run()	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob - Reducer
"
	
  
	
  	
  	
  	
  def	
  mapper(self,	
  _,	
  line):	
  
	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  value	
  
	
  
	
  	
  	
  	
  def	
  reducer(self,	
  key,	
  values):	
  
	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  sum(values)	
  
	
  
if	
  __name__	
  ==	
  '__main__':	
  
	
  	
  	
  	
  MRwordcount.run()	
  

•  Reducer gets one
key and ALL values !
•  No need to loop
through key/value
pairs!
•  Use list methods/
iterators to deal with
keys!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob – Job Launch
"
Run as a python script like
any other!

can pass Hadoop
parameters (and many
more!) in through Python!

$	
  ./wordcount-­‐mrjob.py	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐jobconf	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  	
  	
  –r	
  hadoop	
  	
  
	
  	
  	
  	
  	
  	
  hdfs:///user/glock/mobydick.txt	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐output-­‐dir	
  hdfs:///user/glock/output	
  
	
  

Default file locations are
NOT on HDFS—copying to/
from HDFS is done
automatically!

Default output action is to
print results to your screen!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

VCF PARSING: A REAL
EXAMPLE"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
VCF Parsing Problem"
•  Variant Calling Files (VCFs) are a standard in
bioinformatics!
•  Large files (> 10 GB), semi-structured!
•  Format is a moving target BUT parsing libraries
exist (PyVCF, VCFtools)!
•  Large VCFs still take too long to process serially!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
VCF File Format
"
##fileformat=VCFv4.1	
  
Structure
##FILTER=<ID=LowQual,Description="Low	
  quality">	
   of entire header
must remain intact to
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic	
  depths	
  for	
  the	
  	
  
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate	
  read	
  depth	
  
describe each variant
	
  
record!
...	
  
	
  
#CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0	
  
1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43	
  
1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51	
  
1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28	
  
1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=	
  
	
  
...	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Strategy: Expand the Pipeline
"
1.  Preprocess VCF to separate header!
2.  Map!
1. 
2. 
3. 

read in header to make

sense of records!
filter out useless records!
generate key/value pairs

for interesting variants!

3.  Sort/Shuffle!
4.  Reduce (if necessary)!
5.  Postprocess (upload to PostgresQL)!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON - VCF
"
Located in: streaming/vcf/	
  
•  preprocess.py

Extracts the header from the VCF file!
•  mapper.py

Simple PyVCF-based mapper!
•  run-parsevcf.sh

Commands to launch the simple VCF parser example!
•  sample.vcf

Sample VCF (cancer)!
•  parsevcf.py

Full preprocess+map+reduce+postprocess application!
•  run-parsevcf-full.sh

Commands to run full pre+map+red+post pipeline!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Mapper
"
#!/usr/bin/env	
  python	
  
	
  
import	
  vcf	
  
import	
  sys	
  
	
  
vcf_reader	
  =	
  vcf.Reader(open(vcfHeader,	
  'r'))	
  
	
  
vcf_reader._reader	
  =	
  sys.stdin	
  
	
  
vcf_reader.reader	
  =	
  (line.rstrip()	
  for	
  line	
  in	
  	
  
	
  	
  	
  	
  vcf_reader._reader	
  if	
  line.rstrip()	
  and	
  line[0]	
  !=	
  '#')	
  
	
  
(continued...)	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Mapper
"
	
  for	
  record	
  in	
  vcf_reader:	
  
	
  	
  	
  	
  	
  	
  	
  	
  chrom	
  =	
  record.CHROM	
  
	
  	
  	
  	
  	
  	
  	
  	
  id	
  =	
  record.ID	
  
	
  	
  	
  	
  	
  	
  	
  	
  pos	
  =	
  record.POS	
  
	
  	
  	
  	
  	
  	
  	
  	
  ref	
  =	
  record.REF	
  
	
  	
  	
  	
  	
  	
  	
  	
  alt	
  =	
  record.ALT	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  idx,	
  af	
  in	
  enumerate(record.INFO['AF']):	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  af	
  >	
  target_af:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print(	
  "%dt%st%dt%st%st%.2ft%dt%d"	
  %	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.POS,	
  record.CHROM,	
  record.POS	
  ,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.REF,	
  record.ALT[idx],	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.INFO['AF'][idx],	
  record.INFO['AC'][idx],	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.INFO['AN']	
  )	
  )	
  
	
  	
  	
  	
  	
  	
  	
  	
  except	
  KeyError:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pass	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Reducer
"
No reduction step—can turn off reducer entirely
!

hadoop	
  jar	
  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar	
  	
  

	
  	
  	
  	
  -­‐D	
  mapred.reduce.tasks=0	
  	
  
	
  	
  	
  	
  -­‐mapper	
  "$(which	
  python)	
  $PWD/parsevcf.py	
  -­‐m	
  $PWD/header.txt,0.30"	
  	
  
	
  	
  	
  	
  -­‐reducer	
  "$(which	
  python)	
  $PWD/parsevcf.py	
  -­‐r"	
  	
  
	
  	
  	
  	
  -­‐input	
  vcfparse-­‐input/sample.vcf	
  	
  
	
  	
  	
  	
  -­‐output	
  vcfparse-­‐output	
  	
  

	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
No Reducer – What's the Point?
"
8-node test: two mappers per node = 9x speedup
!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO

More Related Content

What's hot

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

What's hot (19)

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache pig
Apache pigApache pig
Apache pig
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Viewers also liked

Thinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIThinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIjkalucki
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDataWorks Summit
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 

Viewers also liked (12)

Thinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIThinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming API
 
информатика 5. информация сообщение
информатика 5. информация сообщениеинформатика 5. информация сообщение
информатика 5. информация сообщение
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Types of pipes
Types of pipesTypes of pipes
Types of pipes
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Hadoop Streaming: Programming Hadoop without Java

Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
FunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - StreamsFunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - Streamsdarach
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introductionthasso23
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
How to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsHow to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsmeldra
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteAllen Wittenauer
 
Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Prof. Wim Van Criekinge
 
Web development tools { starter pack }
Web development tools { starter pack }Web development tools { starter pack }
Web development tools { starter pack }François Michaudon
 

Similar to Hadoop Streaming: Programming Hadoop without Java (20)

Parallel R and Hadoop
Parallel R and HadoopParallel R and Hadoop
Parallel R and Hadoop
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
FunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - StreamsFunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - Streams
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introduction
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
How to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsHow to herd cat statues and make awesome things
How to herd cat statues and make awesome things
 
Shell scripting
Shell scriptingShell scripting
Shell scripting
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013
 
Web development tools { starter pack }
Web development tools { starter pack }Web development tools { starter pack }
Web development tools { starter pack }
 

Recently uploaded

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Hadoop Streaming: Programming Hadoop without Java

  • 1. Hadoop Streaming
 Programming Hadoop without Java ! Glenn K. Lockwood, Ph.D. ! User Services Group ! San Diego Supercomputer Center ! University of California San Diego ! November 8, 2013 ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 2. Hadoop Streaming! HADOOP ARCHITECTURE RECAP" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 3. Map/Reduce Parallelism " task 5! Data! task 4! Data! task 0! Data! task 3! Data! task 1! Data! task 2! Data! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 4. Magic of HDFS " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 5. Hadoop Workflow " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 6. Hadoop Processing Pipeline " 1.  Map – convert raw input into key/value pairs on each node! 2.  Shuffle/Sort – Send all key/value pairs with the same key to the same reducer node! 3.  Reduce – For each unique key, do something with all the corresponding values! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 7. Hadoop Streaming! WORDCOUNT EXAMPLES" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 8. Hadoop and Python " •  Hadoop streaming w/ Python mappers/reducers! •  portable! •  most difficult (or least difficult) to use! •  you are the glue between Python and Hadoop! •  mrjob (or others: hadoopy, dumbo, etc)! •  •  •  •  comprehensive integration! Python interface to Hadoop streaming! Analogous interface libraries exist in R, Perl! Can interface directly with Amazon! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 9. Wordcount Example " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 10. Hadoop Streaming with Python " •  "Simplest" (most portable) method! •  Uses raw Python, Hadoop – you are the glue! cat input.txt | mapper.py | sort | reducer.py > output.txt provide these two scripts; Hadoop does the rest! •  generalizable to any language you want (Perl, R, etc)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 11. HANDS ON – Hadoop Streaming " Located in streaming/streaming/:! •  wordcount-streaming-mapper.py
 We'll look at this first! •  wordcount-streaming-reducer.py
 We'll look at this second! •  run-wordcount.sh
 All of the Hadoop commands needed to run this example. Run the script (./run-­‐wordcount.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 12. Wordcount: Hadoop streaming mapper " #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:          line  =  line.strip()          keys  =  line.split()          for  key  in  keys:                  value  =  1                  print(  '%st%d'  %  (key,  value)  )   ...! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 13. What One Mapper Does " line  =   Call me Ishmael. Some years ago—never mind how long! keys  =   Call! me! Ishmael.! Some!years! ago--never! mind! how! long! emit.keyval(key,value)  ...   Call! years! 1! Ishmael.! 1! 1! me! mind! long! 1! 1! to the reducers! ago--never! 1! how! 1! 1! Some!1! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 14. Reducer Loop " •  If this key is the same as the previous key,! •  add this key's value to our running total.! •  Otherwise,! •  •  •  •  print out the previous key's name and the running total,! reset our running total to 0,! add this key's value to the running total, and! "this key" is now considered the "previous key"! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 15. Wordcount: Streaming Reducer (1/2) " #!/usr/bin/env  python     import  sys     last_key  =  None   running_total  =  0     for  input_line  in  sys.stdin:          input_line  =  input_line.strip()          this_key,  value  =  input_line.split("t",  1)          value  =  int(value)     (to  be  continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 16. Wordcount: Streaming Reducer (2/2) "  if  last_key  ==  this_key:                  running_total  +=  value            else:                  if  last_key:                          print(  "%st%d"  %  (last_key,  running_total)  )                  running_total  =  value                  last_key  =  this_key     if  last_key  ==  this_key:          print(  "%st%d"  %  (last_key,  running_total)  )   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 17. Testing Mappers/Reducers " •  Debugging Hadoop is not fun! $  head  -­‐n100  pg2701.txt  |        ./wordcount-­‐streaming-­‐mapper.py  |  sort  |        ./wordcount-­‐streaming-­‐reducer.py   ...   with  5   word,  1   world.  1   www.gutenberg.org  1   you  3   You  1       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 18. Launching Hadoop Streaming " $  hadoop  dfs  -­‐copyFromLocal  ./pg2701.txt  mobydick.txt     $  hadoop  jar      /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐D  mapred.reduce.tasks=2            -­‐mapper  "$(which  python)  $PWD/wordcount-­‐streaming-­‐mapper.py"            -­‐reducer  "$(which  python)  $PWD/wordcount-­‐streaming-­‐reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 19. Hadoop with Python - mrjob " •  •  •  •  •  Mapper, reducer written as functions! Can serialize (Pickle) objects to use as values! Presents a single key + all values at once! Extracts map/reduce errors from Hadoop for you! Hadoop runs entirely through Python:! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 20. HANDS ON - mrjob " Located in streaming/mrjob:! •  wordcount-mrjob.py
 Contains both mapper and reducer code! •  run-wordcount-mrjob.sh
 All of the hadoop commands needed to run this example. Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 21. mrjob - Mapper " #!/usr/bin/env  python     from  mrjob.job  import  MRJob     class  MRwordcount(MRJob):            def  mapper(self,  _,  line):   for  line  in  sys.stdin:          line  =  line.strip()                  line  =  line.strip()          keys  =  line.split()                    keys  =  line.split()          for  key  in  keys:                  for  key  in  keys:                  value  =  1                          value  =  1                  print('%st%d'  %  (key,  value))                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 22. mrjob - Reducer "          def  mapper(self,  _,  line):                  line  =  line.strip()                  keys  =  line.split()                  for  key  in  keys:                          value  =  1                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   •  Reducer gets one key and ALL values ! •  No need to loop through key/value pairs! •  Use list methods/ iterators to deal with keys! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 23. mrjob – Job Launch " Run as a python script like any other! can pass Hadoop parameters (and many more!) in through Python! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     Default file locations are NOT on HDFS—copying to/ from HDFS is done automatically! Default output action is to print results to your screen! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 24. Hadoop Streaming! VCF PARSING: A REAL EXAMPLE" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 25. VCF Parsing Problem" •  Variant Calling Files (VCFs) are a standard in bioinformatics! •  Large files (> 10 GB), semi-structured! •  Format is a moving target BUT parsing libraries exist (PyVCF, VCFtools)! •  Large VCFs still take too long to process serially! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 26. VCF File Format " ##fileformat=VCFv4.1   Structure ##FILTER=<ID=LowQual,Description="Low  quality">   of entire header must remain intact to ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic  depths  for  the     ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate  read  depth   describe each variant   record! ...     #CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0   1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43   1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51   1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28   1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=     ...     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 27. Strategy: Expand the Pipeline " 1.  Preprocess VCF to separate header! 2.  Map! 1.  2.  3.  read in header to make
 sense of records! filter out useless records! generate key/value pairs
 for interesting variants! 3.  Sort/Shuffle! 4.  Reduce (if necessary)! 5.  Postprocess (upload to PostgresQL)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 28. HANDS ON - VCF " Located in: streaming/vcf/   •  preprocess.py
 Extracts the header from the VCF file! •  mapper.py
 Simple PyVCF-based mapper! •  run-parsevcf.sh
 Commands to launch the simple VCF parser example! •  sample.vcf
 Sample VCF (cancer)! •  parsevcf.py
 Full preprocess+map+reduce+postprocess application! •  run-parsevcf-full.sh
 Commands to run full pre+map+red+post pipeline! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 29. Our Hands-On Version: Mapper " #!/usr/bin/env  python     import  vcf   import  sys     vcf_reader  =  vcf.Reader(open(vcfHeader,  'r'))     vcf_reader._reader  =  sys.stdin     vcf_reader.reader  =  (line.rstrip()  for  line  in            vcf_reader._reader  if  line.rstrip()  and  line[0]  !=  '#')     (continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 30. Our Hands-On Version: Mapper "  for  record  in  vcf_reader:                  chrom  =  record.CHROM                  id  =  record.ID                  pos  =  record.POS                  ref  =  record.REF                  alt  =  record.ALT                    try:                      for  idx,  af  in  enumerate(record.INFO['AF']):                          if  af  >  target_af:                              print(  "%dt%st%dt%st%st%.2ft%dt%d"  %  (                                  record.POS,  record.CHROM,  record.POS  ,                                    record.REF,  record.ALT[idx],                                  record.INFO['AF'][idx],  record.INFO['AC'][idx],                                  record.INFO['AN']  )  )                  except  KeyError:                          pass   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 31. Our Hands-On Version: Reducer " No reduction step—can turn off reducer entirely ! hadoop  jar  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar            -­‐D  mapred.reduce.tasks=0            -­‐mapper  "$(which  python)  $PWD/parsevcf.py  -­‐m  $PWD/header.txt,0.30"            -­‐reducer  "$(which  python)  $PWD/parsevcf.py  -­‐r"            -­‐input  vcfparse-­‐input/sample.vcf            -­‐output  vcfparse-­‐output       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 32. No Reducer – What's the Point? " 8-node test: two mappers per node = 9x speedup ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO