• Save
Hadoop Streaming: Programming Hadoop without Java
Upcoming SlideShare
Loading in...5
×
 

Hadoop Streaming: Programming Hadoop without Java

on

  • 460 views

An introduction to using Hadoop Streaming to write map/reduce functions without knowing Java. Includes examples written in Python.

An introduction to using Hadoop Streaming to write map/reduce functions without knowing Java. Includes examples written in Python.

Statistics

Views

Total Views
460
Views on SlideShare
454
Embed Views
6

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 6

http://192.168.6.179 5
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Streaming: Programming Hadoop without Java Hadoop Streaming: Programming Hadoop without Java Presentation Transcript

    • Hadoop Streaming
 Programming Hadoop without Java ! Glenn K. Lockwood, Ph.D. ! User Services Group ! San Diego Supercomputer Center ! University of California San Diego ! November 8, 2013 ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Streaming! HADOOP ARCHITECTURE RECAP" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Map/Reduce Parallelism " task 5! Data! task 4! Data! task 0! Data! task 3! Data! task 1! Data! task 2! Data! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Magic of HDFS " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Workflow " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Processing Pipeline " 1.  Map – convert raw input into key/value pairs on each node! 2.  Shuffle/Sort – Send all key/value pairs with the same key to the same reducer node! 3.  Reduce – For each unique key, do something with all the corresponding values! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Streaming! WORDCOUNT EXAMPLES" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop and Python " •  Hadoop streaming w/ Python mappers/reducers! •  portable! •  most difficult (or least difficult) to use! •  you are the glue between Python and Hadoop! •  mrjob (or others: hadoopy, dumbo, etc)! •  •  •  •  comprehensive integration! Python interface to Hadoop streaming! Analogous interface libraries exist in R, Perl! Can interface directly with Amazon! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Wordcount Example " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Streaming with Python " •  "Simplest" (most portable) method! •  Uses raw Python, Hadoop – you are the glue! cat input.txt | mapper.py | sort | reducer.py > output.txt provide these two scripts; Hadoop does the rest! •  generalizable to any language you want (Perl, R, etc)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • HANDS ON – Hadoop Streaming " Located in streaming/streaming/:! •  wordcount-streaming-mapper.py
 We'll look at this first! •  wordcount-streaming-reducer.py
 We'll look at this second! •  run-wordcount.sh
 All of the Hadoop commands needed to run this example. Run the script (./run-­‐wordcount.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Wordcount: Hadoop streaming mapper " #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:          line  =  line.strip()          keys  =  line.split()          for  key  in  keys:                  value  =  1                  print(  '%st%d'  %  (key,  value)  )   ...! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • What One Mapper Does " line  =   Call me Ishmael. Some years ago—never mind how long! keys  =   Call! me! Ishmael.! Some!years! ago--never! mind! how! long! emit.keyval(key,value)  ...   Call! years! 1! Ishmael.! 1! 1! me! mind! long! 1! 1! to the reducers! ago--never! 1! how! 1! 1! Some!1! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Reducer Loop " •  If this key is the same as the previous key,! •  add this key's value to our running total.! •  Otherwise,! •  •  •  •  print out the previous key's name and the running total,! reset our running total to 0,! add this key's value to the running total, and! "this key" is now considered the "previous key"! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Wordcount: Streaming Reducer (1/2) " #!/usr/bin/env  python     import  sys     last_key  =  None   running_total  =  0     for  input_line  in  sys.stdin:          input_line  =  input_line.strip()          this_key,  value  =  input_line.split("t",  1)          value  =  int(value)     (to  be  continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Wordcount: Streaming Reducer (2/2) "  if  last_key  ==  this_key:                  running_total  +=  value            else:                  if  last_key:                          print(  "%st%d"  %  (last_key,  running_total)  )                  running_total  =  value                  last_key  =  this_key     if  last_key  ==  this_key:          print(  "%st%d"  %  (last_key,  running_total)  )   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Testing Mappers/Reducers " •  Debugging Hadoop is not fun! $  head  -­‐n100  pg2701.txt  |        ./wordcount-­‐streaming-­‐mapper.py  |  sort  |        ./wordcount-­‐streaming-­‐reducer.py   ...   with  5   word,  1   world.  1   www.gutenberg.org  1   you  3   You  1       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Launching Hadoop Streaming " $  hadoop  dfs  -­‐copyFromLocal  ./pg2701.txt  mobydick.txt     $  hadoop  jar      /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐D  mapred.reduce.tasks=2            -­‐mapper  "$(which  python)  $PWD/wordcount-­‐streaming-­‐mapper.py"            -­‐reducer  "$(which  python)  $PWD/wordcount-­‐streaming-­‐reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop with Python - mrjob " •  •  •  •  •  Mapper, reducer written as functions! Can serialize (Pickle) objects to use as values! Presents a single key + all values at once! Extracts map/reduce errors from Hadoop for you! Hadoop runs entirely through Python:! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • HANDS ON - mrjob " Located in streaming/mrjob:! •  wordcount-mrjob.py
 Contains both mapper and reducer code! •  run-wordcount-mrjob.sh
 All of the hadoop commands needed to run this example. Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • mrjob - Mapper " #!/usr/bin/env  python     from  mrjob.job  import  MRJob     class  MRwordcount(MRJob):            def  mapper(self,  _,  line):   for  line  in  sys.stdin:          line  =  line.strip()                  line  =  line.strip()          keys  =  line.split()                    keys  =  line.split()          for  key  in  keys:                  for  key  in  keys:                  value  =  1                          value  =  1                  print('%st%d'  %  (key,  value))                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • mrjob - Reducer "          def  mapper(self,  _,  line):                  line  =  line.strip()                  keys  =  line.split()                  for  key  in  keys:                          value  =  1                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   •  Reducer gets one key and ALL values ! •  No need to loop through key/value pairs! •  Use list methods/ iterators to deal with keys! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • mrjob – Job Launch " Run as a python script like any other! can pass Hadoop parameters (and many more!) in through Python! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     Default file locations are NOT on HDFS—copying to/ from HDFS is done automatically! Default output action is to print results to your screen! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Hadoop Streaming! VCF PARSING: A REAL EXAMPLE" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • VCF Parsing Problem" •  Variant Calling Files (VCFs) are a standard in bioinformatics! •  Large files (> 10 GB), semi-structured! •  Format is a moving target BUT parsing libraries exist (PyVCF, VCFtools)! •  Large VCFs still take too long to process serially! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • VCF File Format " ##fileformat=VCFv4.1   Structure ##FILTER=<ID=LowQual,Description="Low  quality">   of entire header must remain intact to ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic  depths  for  the     ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate  read  depth   describe each variant   record! ...     #CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0   1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43   1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51   1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28   1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=     ...     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Strategy: Expand the Pipeline " 1.  Preprocess VCF to separate header! 2.  Map! 1.  2.  3.  read in header to make
 sense of records! filter out useless records! generate key/value pairs
 for interesting variants! 3.  Sort/Shuffle! 4.  Reduce (if necessary)! 5.  Postprocess (upload to PostgresQL)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • HANDS ON - VCF " Located in: streaming/vcf/   •  preprocess.py
 Extracts the header from the VCF file! •  mapper.py
 Simple PyVCF-based mapper! •  run-parsevcf.sh
 Commands to launch the simple VCF parser example! •  sample.vcf
 Sample VCF (cancer)! •  parsevcf.py
 Full preprocess+map+reduce+postprocess application! •  run-parsevcf-full.sh
 Commands to run full pre+map+red+post pipeline! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Our Hands-On Version: Mapper " #!/usr/bin/env  python     import  vcf   import  sys     vcf_reader  =  vcf.Reader(open(vcfHeader,  'r'))     vcf_reader._reader  =  sys.stdin     vcf_reader.reader  =  (line.rstrip()  for  line  in            vcf_reader._reader  if  line.rstrip()  and  line[0]  !=  '#')     (continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Our Hands-On Version: Mapper "  for  record  in  vcf_reader:                  chrom  =  record.CHROM                  id  =  record.ID                  pos  =  record.POS                  ref  =  record.REF                  alt  =  record.ALT                    try:                      for  idx,  af  in  enumerate(record.INFO['AF']):                          if  af  >  target_af:                              print(  "%dt%st%dt%st%st%.2ft%dt%d"  %  (                                  record.POS,  record.CHROM,  record.POS  ,                                    record.REF,  record.ALT[idx],                                  record.INFO['AF'][idx],  record.INFO['AC'][idx],                                  record.INFO['AN']  )  )                  except  KeyError:                          pass   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • Our Hands-On Version: Reducer " No reduction step—can turn off reducer entirely ! hadoop  jar  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar            -­‐D  mapred.reduce.tasks=0            -­‐mapper  "$(which  python)  $PWD/parsevcf.py  -­‐m  $PWD/header.txt,0.30"            -­‐reducer  "$(which  python)  $PWD/parsevcf.py  -­‐r"            -­‐input  vcfparse-­‐input/sample.vcf            -­‐output  vcfparse-­‐output       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
    • No Reducer – What's the Point? " 8-node test: two mappers per node = 9x speedup ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO