Hadoop Streaming: Programming Hadoop without Java

Hadoop Streaming 
Programming Hadoop without Java
!

Glenn K. Lockwood, Ph.D.
!
User Services Group
!
San Diego Supercomputer Center
!
University of California San Diego
!
November 8, 2013
!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hadoop Streaming!

HADOOP ARCHITECTURE
RECAP"

Map/Reduce Parallelism
"

task 5!
Data!

task 4!
Data!

task 0!
Data!

task 3!
Data!
task 1!
Data!

task 2!
Data!


Magic of HDFS
"


Hadoop Workﬂow
"


Hadoop Processing Pipeline
"
1.  Map – convert raw input into key/value pairs on
each node!
2.  Shufﬂe/Sort – Send all key/value pairs with the
same key to the same reducer node!
3.  Reduce – For each unique key, do something
with all the corresponding values!


Hadoop Streaming!

WORDCOUNT EXAMPLES"

Hadoop and Python
"
•  Hadoop streaming w/ Python mappers/reducers!
•  portable!
•  most difﬁcult (or least difﬁcult) to use!
•  you are the glue between Python and Hadoop!

•  mrjob (or others: hadoopy, dumbo, etc)!
• 
• 
• 
• 

comprehensive integration!
Python interface to Hadoop streaming!
Analogous interface libraries exist in R, Perl!
Can interface directly with Amazon!


Wordcount Example
"


Hadoop Streaming with Python
"
•  "Simplest" (most portable) method!
•  Uses raw Python, Hadoop – you are the glue!
cat input.txt | mapper.py | sort | reducer.py > output.txt

provide these two scripts; Hadoop does the rest!

•  generalizable to any language you want (Perl, R,
etc)!

HANDS ON – Hadoop Streaming
"
Located in streaming/streaming/:!
•  wordcount-streaming-mapper.py 
We'll look at this ﬁrst!
•  wordcount-streaming-reducer.py 
We'll look at this second!
•  run-wordcount.sh 
All of the Hadoop commands needed to run this example.
Run the script (./run-‐wordcount.sh) or paste each
command line-by-line!
•  pg2701.txt 
The full text of Melville's Moby Dick!

Wordcount: Hadoop streaming mapper
"
#!/usr/bin/env
python

import
sys

for
line
in
sys.stdin:

line
=
line.strip()

keys
=
line.split()

for
key
in
keys:

value
=
1

print(
'%st%d'
%
(key,
value)
)

...!

What One Mapper Does
"
line
=

Call me Ishmael. Some years ago—never mind how long!

keys
=
Call!

me! Ishmael.! Some!years! ago--never! mind! how! long!

emit.keyval(key,value)
...

Call!

years!
1! Ishmael.! 1! 1!
me!

mind!

long!

1!

1!
to the reducers!

ago--never! 1! how! 1!
1!
Some!1!


Reducer Loop
"
•  If this key is the same as the previous key,!
•  add this key's value to our running total.!

•  Otherwise,!
• 
• 
• 
• 

print out the previous key's name and the running total,!
reset our running total to 0,!
add this key's value to the running total, and!
"this key" is now considered the "previous key"!


Wordcount: Streaming Reducer (1/2)
"
#!/usr/bin/env
python

import
sys

last_key
=
None

running_total
=
0

for
input_line
in
sys.stdin:

input_line
=
input_line.strip()

this_key,
value
=
input_line.split("t",
1)

value
=
int(value)

(to
be
continued...)


Wordcount: Streaming Reducer (2/2)
"

if
last_key
==
this_key:

running_total
+=
value

else:

if
last_key:

print(
"%st%d"
%
(last_key,
running_total)
)

running_total
=
value

last_key
=
this_key

if
last_key
==
this_key:

print(
"%st%d"
%
(last_key,
running_total)
)


Testing Mappers/Reducers
"
•  Debugging Hadoop is not fun!
$
head
-‐n100
pg2701.txt
|

./wordcount-‐streaming-‐mapper.py
|
sort
|

./wordcount-‐streaming-‐reducer.py

...

with
5

word,
1

world.

1

www.gutenberg.org
1

you
3

You
1


Launching Hadoop Streaming
"
$
hadoop
dfs
-‐copyFromLocal
./pg2701.txt
mobydick.txt

$
hadoop
jar

/opt/hadoop/contrib/streaming/hadoop-‐streaming-‐1.1.1.jar

-‐D
mapred.reduce.tasks=2

-‐mapper
"$(which
python)
$PWD/wordcount-‐streaming-‐mapper.py"

-‐reducer
"$(which
python)
$PWD/wordcount-‐streaming-‐reducer.py"

-‐input
mobydick.txt

-‐output
output

$
hadoop
dfs
-‐cat
output/part-‐*
>
./output.txt


Hadoop with Python - mrjob
"
• 
• 
• 
• 
• 

Mapper, reducer written as functions!
Can serialize (Pickle) objects to use as values!
Presents a single key + all values at once!
Extracts map/reduce errors from Hadoop for you!
Hadoop runs entirely through Python:!
$
./wordcount-‐mrjob.py

-‐-‐jobconf

–r
hadoop

hdfs:///user/glock/mobydick.txt

-‐-‐output-‐dir
hdfs:///user/glock/output


HANDS ON - mrjob
"
Located in streaming/mrjob:!
•  wordcount-mrjob.py 
Contains both mapper and reducer code!
•  run-wordcount-mrjob.sh 
All of the hadoop commands needed to run this example.
Run the script (./run-‐wordcount-‐mrjob.sh) or paste each
command line-by-line!
•  pg2701.txt 
The full text of Melville's Moby Dick!


mrjob - Mapper
"
#!/usr/bin/env
python

from
mrjob.job
import
MRJob

class
MRwordcount(MRJob):

def
mapper(self,
_,
line):
for
line
in
sys.stdin:

line
=
line.strip()

line
=
line.strip()

keys
=
line.split()

keys
=
line.split()

for
key
in
keys:

for
key
in
keys:

value
=
1

value
=
1

print('%st%d'
%
(key,
value))

yield
key,
value

def
reducer(self,
key,
values):

yield
key,
sum(values)

if
__name__
==
'__main__':

MRwordcount.run()


mrjob - Reducer
"

def
mapper(self,
_,
line):

line
=
line.strip()

keys
=
line.split()

for
key
in
keys:

value
=
1

yield
key,
value

def
reducer(self,
key,
values):

yield
key,
sum(values)

if
__name__
==
'__main__':

MRwordcount.run()

•  Reducer gets one
key and ALL values !
•  No need to loop
through key/value
pairs!
•  Use list methods/
iterators to deal with
keys!


mrjob – Job Launch
"
Run as a python script like
any other!

can pass Hadoop
parameters (and many
more!) in through Python!

$
./wordcount-‐mrjob.py

-‐-‐jobconf

–r
hadoop

hdfs:///user/glock/mobydick.txt

-‐-‐output-‐dir
hdfs:///user/glock/output

Default ﬁle locations are
NOT on HDFS—copying to/
from HDFS is done
automatically!

Default output action is to
print results to your screen!


Hadoop Streaming!

VCF PARSING: A REAL
EXAMPLE"

VCF Parsing Problem"
•  Variant Calling Files (VCFs) are a standard in
bioinformatics!
•  Large ﬁles (> 10 GB), semi-structured!
•  Format is a moving target BUT parsing libraries
exist (PyVCF, VCFtools)!
•  Large VCFs still take too long to process serially!


VCF File Format
"
##fileformat=VCFv4.1

Structure
##FILTER=<ID=LowQual,Description="Low
quality">
of entire header
must remain intact to
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic
depths
for
the

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate
read
depth

describe each variant

record!
...

#CHROMÎPOSÎIDÎREFÎALTÎQUALÎFILTERÎINFOÎFORMATÎHT020enÎHT0

1Î10186Î.ÎTÎGÎ45.44Î.ÎAC=2;AF=0.500;AN=4;BaseQRankSum=-‐0.584;DP=43

1Î10198Î.ÎTÎGÎ33.46Î.ÎAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51

1Î10279Î.ÎTÎGÎ48.49Î.ÎAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28

1Î10389Î.ÎACÎAÎ288.40Î.ÎAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=

...


Strategy: Expand the Pipeline
"
1.  Preprocess VCF to separate header!
2.  Map!
1. 
2. 
3. 

read in header to make 
sense of records!
ﬁlter out useless records!
generate key/value pairs 
for interesting variants!

3.  Sort/Shufﬂe!
4.  Reduce (if necessary)!
5.  Postprocess (upload to PostgresQL)!

HANDS ON - VCF
"
Located in: streaming/vcf/

•  preprocess.py 
Extracts the header from the VCF ﬁle!
•  mapper.py 
Simple PyVCF-based mapper!
•  run-parsevcf.sh 
Commands to launch the simple VCF parser example!
•  sample.vcf 
Sample VCF (cancer)!
•  parsevcf.py 
Full preprocess+map+reduce+postprocess application!
•  run-parsevcf-full.sh 
Commands to run full pre+map+red+post pipeline!

Our Hands-On Version: Mapper
"
#!/usr/bin/env
python

import
vcf

import
sys

vcf_reader
=
vcf.Reader(open(vcfHeader,
'r'))

vcf_reader._reader
=
sys.stdin

vcf_reader.reader
=
(line.rstrip()
for
line
in

vcf_reader._reader
if
line.rstrip()
and
line[0]
!=
'#')

(continued...)


Our Hands-On Version: Mapper
"

for
record
in
vcf_reader:

chrom
=
record.CHROM

id
=
record.ID

pos
=
record.POS

ref
=
record.REF

alt
=
record.ALT

try:

for
idx,
af
in
enumerate(record.INFO['AF']):

if
af
>
target_af:

print(
"%dt%st%dt%st%st%.2ft%dt%d"
%
(

record.POS,
record.CHROM,
record.POS
,

record.REF,
record.ALT[idx],

record.INFO['AF'][idx],
record.INFO['AC'][idx],

record.INFO['AN']
)
)

except
KeyError:

pass


Our Hands-On Version: Reducer
"
No reduction step—can turn off reducer entirely
!

hadoop
jar
$HADOOP_HOME/contrib/streaming/hadoop-‐streaming-‐*.jar

-‐D

-‐mapper
"$(which
python)
$PWD/parsevcf.py
-‐m
$PWD/header.txt,0.30"

-‐reducer
"$(which
python)
$PWD/parsevcf.py
-‐r"

-‐input
vcfparse-‐input/sample.vcf

-‐output
vcfparse-‐output


No Reducer – What's the Point?
"
8-node test: two mappers per node = 9x speedup
!


Hadoop Streaming: Programming Hadoop without Java

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Hadoop Streaming: Programming Hadoop without Java

Similar to Hadoop Streaming: Programming Hadoop without Java (20)

Recently uploaded

Recently uploaded (20)

Hadoop Streaming: Programming Hadoop without Java