Hadoop Streaming: Programming Hadoop without Java

1,031 views
872 views

Published on

An introduction to using Hadoop Streaming to write map/reduce functions without knowing Java. Includes examples written in Python.

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,031
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hadoop Streaming: Programming Hadoop without Java

  1. 1. Hadoop Streaming
 Programming Hadoop without Java ! Glenn K. Lockwood, Ph.D. ! User Services Group ! San Diego Supercomputer Center ! University of California San Diego ! November 8, 2013 ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  2. 2. Hadoop Streaming! HADOOP ARCHITECTURE RECAP" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  3. 3. Map/Reduce Parallelism " task 5! Data! task 4! Data! task 0! Data! task 3! Data! task 1! Data! task 2! Data! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  4. 4. Magic of HDFS " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  5. 5. Hadoop Workflow " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  6. 6. Hadoop Processing Pipeline " 1.  Map – convert raw input into key/value pairs on each node! 2.  Shuffle/Sort – Send all key/value pairs with the same key to the same reducer node! 3.  Reduce – For each unique key, do something with all the corresponding values! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  7. 7. Hadoop Streaming! WORDCOUNT EXAMPLES" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  8. 8. Hadoop and Python " •  Hadoop streaming w/ Python mappers/reducers! •  portable! •  most difficult (or least difficult) to use! •  you are the glue between Python and Hadoop! •  mrjob (or others: hadoopy, dumbo, etc)! •  •  •  •  comprehensive integration! Python interface to Hadoop streaming! Analogous interface libraries exist in R, Perl! Can interface directly with Amazon! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  9. 9. Wordcount Example " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  10. 10. Hadoop Streaming with Python " •  "Simplest" (most portable) method! •  Uses raw Python, Hadoop – you are the glue! cat input.txt | mapper.py | sort | reducer.py > output.txt provide these two scripts; Hadoop does the rest! •  generalizable to any language you want (Perl, R, etc)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  11. 11. HANDS ON – Hadoop Streaming " Located in streaming/streaming/:! •  wordcount-streaming-mapper.py
 We'll look at this first! •  wordcount-streaming-reducer.py
 We'll look at this second! •  run-wordcount.sh
 All of the Hadoop commands needed to run this example. Run the script (./run-­‐wordcount.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  12. 12. Wordcount: Hadoop streaming mapper " #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:          line  =  line.strip()          keys  =  line.split()          for  key  in  keys:                  value  =  1                  print(  '%st%d'  %  (key,  value)  )   ...! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  13. 13. What One Mapper Does " line  =   Call me Ishmael. Some years ago—never mind how long! keys  =   Call! me! Ishmael.! Some!years! ago--never! mind! how! long! emit.keyval(key,value)  ...   Call! years! 1! Ishmael.! 1! 1! me! mind! long! 1! 1! to the reducers! ago--never! 1! how! 1! 1! Some!1! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  14. 14. Reducer Loop " •  If this key is the same as the previous key,! •  add this key's value to our running total.! •  Otherwise,! •  •  •  •  print out the previous key's name and the running total,! reset our running total to 0,! add this key's value to the running total, and! "this key" is now considered the "previous key"! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  15. 15. Wordcount: Streaming Reducer (1/2) " #!/usr/bin/env  python     import  sys     last_key  =  None   running_total  =  0     for  input_line  in  sys.stdin:          input_line  =  input_line.strip()          this_key,  value  =  input_line.split("t",  1)          value  =  int(value)     (to  be  continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  16. 16. Wordcount: Streaming Reducer (2/2) "  if  last_key  ==  this_key:                  running_total  +=  value            else:                  if  last_key:                          print(  "%st%d"  %  (last_key,  running_total)  )                  running_total  =  value                  last_key  =  this_key     if  last_key  ==  this_key:          print(  "%st%d"  %  (last_key,  running_total)  )   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  17. 17. Testing Mappers/Reducers " •  Debugging Hadoop is not fun! $  head  -­‐n100  pg2701.txt  |        ./wordcount-­‐streaming-­‐mapper.py  |  sort  |        ./wordcount-­‐streaming-­‐reducer.py   ...   with  5   word,  1   world.  1   www.gutenberg.org  1   you  3   You  1       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  18. 18. Launching Hadoop Streaming " $  hadoop  dfs  -­‐copyFromLocal  ./pg2701.txt  mobydick.txt     $  hadoop  jar      /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐D  mapred.reduce.tasks=2            -­‐mapper  "$(which  python)  $PWD/wordcount-­‐streaming-­‐mapper.py"            -­‐reducer  "$(which  python)  $PWD/wordcount-­‐streaming-­‐reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  19. 19. Hadoop with Python - mrjob " •  •  •  •  •  Mapper, reducer written as functions! Can serialize (Pickle) objects to use as values! Presents a single key + all values at once! Extracts map/reduce errors from Hadoop for you! Hadoop runs entirely through Python:! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  20. 20. HANDS ON - mrjob " Located in streaming/mrjob:! •  wordcount-mrjob.py
 Contains both mapper and reducer code! •  run-wordcount-mrjob.sh
 All of the hadoop commands needed to run this example. Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  21. 21. mrjob - Mapper " #!/usr/bin/env  python     from  mrjob.job  import  MRJob     class  MRwordcount(MRJob):            def  mapper(self,  _,  line):   for  line  in  sys.stdin:          line  =  line.strip()                  line  =  line.strip()          keys  =  line.split()                    keys  =  line.split()          for  key  in  keys:                  for  key  in  keys:                  value  =  1                          value  =  1                  print('%st%d'  %  (key,  value))                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  22. 22. mrjob - Reducer "          def  mapper(self,  _,  line):                  line  =  line.strip()                  keys  =  line.split()                  for  key  in  keys:                          value  =  1                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   •  Reducer gets one key and ALL values ! •  No need to loop through key/value pairs! •  Use list methods/ iterators to deal with keys! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  23. 23. mrjob – Job Launch " Run as a python script like any other! can pass Hadoop parameters (and many more!) in through Python! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     Default file locations are NOT on HDFS—copying to/ from HDFS is done automatically! Default output action is to print results to your screen! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  24. 24. Hadoop Streaming! VCF PARSING: A REAL EXAMPLE" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  25. 25. VCF Parsing Problem" •  Variant Calling Files (VCFs) are a standard in bioinformatics! •  Large files (> 10 GB), semi-structured! •  Format is a moving target BUT parsing libraries exist (PyVCF, VCFtools)! •  Large VCFs still take too long to process serially! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  26. 26. VCF File Format " ##fileformat=VCFv4.1   Structure ##FILTER=<ID=LowQual,Description="Low  quality">   of entire header must remain intact to ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic  depths  for  the     ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate  read  depth   describe each variant   record! ...     #CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0   1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43   1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51   1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28   1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=     ...     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  27. 27. Strategy: Expand the Pipeline " 1.  Preprocess VCF to separate header! 2.  Map! 1.  2.  3.  read in header to make
 sense of records! filter out useless records! generate key/value pairs
 for interesting variants! 3.  Sort/Shuffle! 4.  Reduce (if necessary)! 5.  Postprocess (upload to PostgresQL)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  28. 28. HANDS ON - VCF " Located in: streaming/vcf/   •  preprocess.py
 Extracts the header from the VCF file! •  mapper.py
 Simple PyVCF-based mapper! •  run-parsevcf.sh
 Commands to launch the simple VCF parser example! •  sample.vcf
 Sample VCF (cancer)! •  parsevcf.py
 Full preprocess+map+reduce+postprocess application! •  run-parsevcf-full.sh
 Commands to run full pre+map+red+post pipeline! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  29. 29. Our Hands-On Version: Mapper " #!/usr/bin/env  python     import  vcf   import  sys     vcf_reader  =  vcf.Reader(open(vcfHeader,  'r'))     vcf_reader._reader  =  sys.stdin     vcf_reader.reader  =  (line.rstrip()  for  line  in            vcf_reader._reader  if  line.rstrip()  and  line[0]  !=  '#')     (continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  30. 30. Our Hands-On Version: Mapper "  for  record  in  vcf_reader:                  chrom  =  record.CHROM                  id  =  record.ID                  pos  =  record.POS                  ref  =  record.REF                  alt  =  record.ALT                    try:                      for  idx,  af  in  enumerate(record.INFO['AF']):                          if  af  >  target_af:                              print(  "%dt%st%dt%st%st%.2ft%dt%d"  %  (                                  record.POS,  record.CHROM,  record.POS  ,                                    record.REF,  record.ALT[idx],                                  record.INFO['AF'][idx],  record.INFO['AC'][idx],                                  record.INFO['AN']  )  )                  except  KeyError:                          pass   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  31. 31. Our Hands-On Version: Reducer " No reduction step—can turn off reducer entirely ! hadoop  jar  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar            -­‐D  mapred.reduce.tasks=0            -­‐mapper  "$(which  python)  $PWD/parsevcf.py  -­‐m  $PWD/header.txt,0.30"            -­‐reducer  "$(which  python)  $PWD/parsevcf.py  -­‐r"            -­‐input  vcfparse-­‐input/sample.vcf            -­‐output  vcfparse-­‐output       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  32. 32. No Reducer – What's the Point? " 8-node test: two mappers per node = 9x speedup ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO

×