A Guide to Python Frameworks for Hadoop

  • 1,815 views
Uploaded on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/17fsvKl. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/17fsvKl.

Uri Laserson reviews the different available Python frameworks for Hadoop, including a comparison of performance, ease of use/installation, differences in implementation, and other features. Filmed at qconnewyork.com.

Uri Laserson is a data scientist at Cloudera. Previously, he received his PhD from MIT developing applications of high-throughput DNA sequencing to immunology. During that time, he co-founded Good Start Genetics, a next-generation diagnostics company focused on genetic carrier screening. In 2012 he was selected to Forbes's list of 30 under 30.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,815
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 A  Guide  to  Python  Frameworks  for  Hadoop   Uri  Laserson  |  Data  Scien>st   laserson@cloudera.com   14  June  3013  
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /python-hadoop
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. About  the  speaker   •  Joined  Cloudera  late  2012   •  Focused  on  life  sciences/medical   •  PhD  in  BME/computa>onal  biology  at  MIT/Harvard   (2005-­‐2012)   •  Focused  on  genomics   •  Cofounded  Good  Start  Gene>cs  (2007-­‐)   •  Applying  next-­‐gen  DNA  sequencing  to  gene>c  carrier   screening   2
  • 5. About  the  speaker   •  No  formal  training  in  computer  science   •  Never  touched  Java   •  Almost  all  work  using  Python   3
  • 6. 4  
  • 7. Python  frameworks  for  Hadoop   •  Hadoop  Streaming   •  mrjob  (Yelp)   •  dumbo   •  Luigi  (Spo>fy)   •  hadoopy   •  pydoop   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   5
  • 8. Goals  for  Python  framework   1.  “Pseudocodiness”/simplicity   2.  Flexibility/generality   3.  Ease  of  use/installa>on   4.  Performance   6
  • 9. 7   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams  
  • 10. 8   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams   1   2   3   4   5   6   7   8   (   )   8-­‐gram  
  • 11. 9   "A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves."  
  • 12. 10   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A 1! partial 2! differential 1! equation 2! is 1! an 1! that 1! contains 1! derivatives. 1! 1-­‐grams  
  • 13. 11   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial 1! partial differential 1! differential equation 1! equation is 1! is an 1! an equation 1! equation that 1! that contains 1! contains partial 1! partial derivatives. 1! 2-­‐grams  
  • 14. 12   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial differential equation is 1! partial differential equation is an 1! differential equation is an equation 1! equation is an equation that 1! is an equation that contains 1! an equation that contains partial 1! equation that contains partial derivatives. 1! 5-­‐grams  
  • 15. 13
  • 16. 14   flourished in 1993 2 2 2! flourished in 1998 2 2 1! flourished in 1999 6 6 4! flourished in 2000 5 5 5! flourished in 2001 1 1 1! flourished in 2002 7 7 3! flourished in 2003 9 9 4! flourished in 2004 22 21 13! flourished in 2005 37 37 22! flourished in 2006 55 55 38! flourished in 2007 99 98 76! flourished in 2008 220 215 118! fluid of 1899 2 2 1! fluid of 2000 3 3 1! fluid of 2002 2 1 1! fluid of 2003 3 3 1! fluid of 2004 3 3 3! 2-­‐gram   year   matches   pages   volumes  
  • 17. 15   Compute  how  ocen  two  words  are  near  each   other  in  a  given  year.   Two  words  are  “near”  if  they  are  both   present  in  a  2-­‐,  3,  4-­‐,  or  5-­‐gram.  
  • 18. 16   ...2-grams...! (cat, the) 1999 14! (the, cat) 1999 7002! ! ...3-grams...! (the, cheshire, cat) 1999 563! ! ...4-grams...! ! ...5-grams...! (the, cat, in, the, hat) 1999 1023! (the, dog, chased, the, cat) 1999 403! (cat, is, one, of, the) 1999 24! (cat, the) 1999 8006! (hat, the) 1999 1023! raw  data   aggregated  results   lexicographic   ordering   internal  n-­‐grams  counted  by  smaller  n-­‐grams:   •  avoids  double-­‐coun>ng   •  increases  sensi>vity  (observed  at  least  40  >mes)  
  • 19. Pseudocode  for  MapReduce   17 def map(record):!     (ngram, year, count) = unpack(record)!     // ensure word1 has the lexicographically first word:!     (word1, word2) = sorted(ngram[first], ngram[last])!     key = (word1, word2, year)!     emit(key, count)!  ! ! def reduce(key, values):!     emit(key, sum(values))! All  source  code  available  on  GitHub:   h_ps://github.com/cloudera/python-­‐ngrams  
  • 20. Na>ve  Java   18 import org.apache.hadoop.conf.Configured;! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Job;! import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;! import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;! import org.apache.hadoop.util.Tool;! import org.apache.hadoop.util.ToolRunner;! ! ! public class NgramsDriver extends Configured implements Tool {! ! public int run(String[] args) throws Exception {! Job job = new Job(getConf());! job.setJarByClass(getClass());! ! FileInputFormat.addInputPath(job, new Path(args[0]));! FileOutputFormat.setOutputPath(job, new Path(args[1]));! ! job.setMapperClass(NgramsMapper.class);! job.setCombinerClass(NgramsReducer.class);! job.setReducerClass(NgramsReducer.class);! ! job.setOutputKeyClass(TextTriple.class);! job.setOutputValueClass(IntWritable.class);! ! job.setNumReduceTasks(10);! ! return job.waitForCompletion(true) ? 0 : 1;! }! ! public static void main(String[] args) throws Exception {! int exitCode = ToolRunner.run(new NgramsDriver(), args);! System.exit(exitCode);! }! }! ! import java.io.IOException;! import java.util.ArrayList;! import java.util.Collections;! import java.util.List;! import java.util.regex.Matcher;! import java.util.regex.Pattern;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapreduce.Mapper;! import org.apache.hadoop.mapreduce.lib.input.FileSplit;! import org.apache.log4j.Logger;! ! ! public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {! ! private Logger LOG = Logger.getLogger(getClass());! ! private int expectedTokens;! ! @Override! protected void setup(Context context) throws IOException, InterruptedException {! String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();! LOG.info("inputFile: " + inputFile);! Pattern c = Pattern.compile("([d]+)gram");! Matcher m = c.matcher(inputFile);! m.find();! expectedTokens = Integer.parseInt(m.group(1));! return;! }! ! @Override! public void map(LongWritable key, Text value, Context context)! throws IOException, InterruptedException {! String[] data = value.toString().split("t");! ! if (data.length < 3) {! return;! }! ! String[] ngram = data[0].split("s+");! String year = data[1];! IntWritable count = new IntWritable(Integer.parseInt(data[2]));! ! if (ngram.length != this.expectedTokens) {! return;! }! ! // build keyOut! List<String> triple = new ArrayList<String>(3);! triple.add(ngram[0]);! triple.add(ngram[expectedTokens - 1]);! Collections.sort(triple);! triple.add(year);! TextTriple keyOut = new TextTriple(triple);! ! context.write(keyOut, count);! }! }! import java.io.IOException;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Reducer;! ! ! public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> ! @Override! protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)! throws IOException, InterruptedException {! int sum = 0;! for (IntWritable value : values) {! sum += value.get();! }! context.write(key, new IntWritable(sum));! }! }! ! import java.io.DataInput;! import java.io.DataOutput;! import java.io.IOException;! import java.util.List;! ! import org.apache.hadoop.io.Text;! import org.apache.hadoop.io.WritableComparable;! ! ! public class TextTriple implements WritableComparable<TextTriple> {! ! private Text first;! private Text second;! private Text third;! ! public TextTriple() {! set(new Text(), new Text(), new Text());! }! ! public TextTriple(List<String> list) {! set(new Text(list.get(0)),! new Text(list.get(1)),! new Text(list.get(2)));! }! ! public void set(Text first, Text second, Text third) {! this.first = first;! this.second = second;! this.third = third;! }! ! public void write(DataOutput out) throws IOException {! first.write(out);! second.write(out);! third.write(out);! }! ! public void readFields(DataInput in) throws IOException {! first.readFields(in);! second.readFields(in);! third.readFields(in);! }! ! @Override! public int hashCode() {! return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();! }! ! @Override! public boolean equals(Object obj) {! if (obj instanceof TextTriple) {! TextTriple tt = (TextTriple) obj;! return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.thir }! return false;! }! ! @Override! public String toString() {! return first + "t" + second + "t" + third;! }! ! public int compareTo(TextTriple other) {! int comp = first.compareTo(other.first);! if (comp != 0) {! return comp;! }! comp = second.compareTo(other.second);! if (comp != 0) {! return comp;! }! return third.compareTo(other.third);! } ! }!
  • 21. Na>ve  Java   •  Maximum  flexibility   •  Fastest  performance   •  Na>ve  to  Hadoop   •  Most  difficult  to  write   19
  • 22. Python  implementa>on  strategies   •  Hadoop  Streaming   •  mrjob   •  dumbo   •  hadoopy   •  Hadoop  Pipes   •  pydoop   •  Non-­‐Hadoop   •  Disco   •  octopy   20
  • 23. Hadoop  Streaming:  execu>on   21 hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar ! -input /ngrams ! -output /output-streaming ! -mapper mapper.py ! -combiner reducer.py ! -reducer reducer.py ! -jobconf stream.num.map.output.key.fields=3 ! -jobconf stream.num.reduce.output.key.fields=3 ! -jobconf mapred.reduce.tasks=10 ! -file mapper.py ! -file reducer.py!
  • 24. Hadoop  Streaming:  code   22
  • 25. Hadoop  Streaming:  features   •  Canonical  method  for  using  any  executable  as   mapper/reducer   •  Includes  shell  commands,  like  grep   •  Transparent  communica>on  with  Hadoop  though   stdin/stdout   •  Key  boundaries  manually  detected  in  reducer   •  Built-­‐in  with  Hadoop:  should  require  no  addi>onal   framework  installa>on   •  Developer  must  decide  how  to  encode  more   complicated  objects  (e.g.,  JSON)  or  binary  data   23
  • 26. mrjob   24 class NgramNeighbors(MRJob):! # specify input/intermed/output serialization! # default output protocol is JSON; here we set it to text! OUTPUT_PROTOCOL = RawProtocol! ! def mapper(self, key, line):! pass! ! def combiner(self, key, counts):! pass! ! def reducer(self, key, counts):! pass! ! if __name__ == '__main__':! # sets up a runner, based on command line options! NgramNeighbors.run()! ! !
  • 27. mrjob:  runner   25 ./ngrams.py -r hadoop ! --hadoop-bin /usr/bin/hadoop ! --jobconf mapred.reduce.tasks=10 ! -o hdfs:///output-mrjob ! hdfs:///ngrams!
  • 28. mrjob:  code   26
  • 29. mrjob:  features   •  Abstracted  MapReduce  interface   •  Handles  complex  Python  objects   •  Mul>-­‐step  MapReduce  workflows   •  Extremely  >ght  AWS  integra>on   •  Easily  choose  to  run  locally,  on  Hadoop  cluster,  or  on   EMR   •  Ac>vely  developed;  great  documenta>on   27
  • 30. mrjob:  serializa>on   28 class MyMRJob(mrjob.job.MRJob):! INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol! INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol! OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol! Defaults   RawProtocol / RawValueProtocol! JSONProtocol / JSONValueProtocol! PickleProtocol / PickleValueProtocol! ReprProtocol / ReprValueProtocol! Available   Custom  protocols  can  be  wri_en.   No  current  support  for  binary  serializa>on  schemes.  
  • 31. dumbo   •  Similar  in  spirit  to  mrjob   •  abstracted   •  complex  objects   •  various  runners   •  composable  jobs   •  Sporadically  developed?   •  Documenta>on  is  a  series  of  blog  posts   29
  • 32. dumbo:  serializa>on   •  Typed  bytes  added  to  Hadoop  allowing  binary  data   •  ctypedbytes   •  binary  serializa>on   •  packs  Python  objects  in  C  structs   •  Much  faster  and  more  efficient  than  JSON  or  pickle   •  Na>vely  read  SequenceFile   •  Execute  code  from  any  Python  egg  or  JAR   •  Point  to  any  Java  InputFormat! 30
  • 33. dumbo:  installa>on  notes   •  Required  manual  install  on  each  node   •  dumbo  and  typedbytes  had  to  be  installed  as  Python   eggs   •  Had  trouble  running  a  combiner  due  to   MemoryErrors! 31
  • 34. hadoopy   •  Similar  to  dumbo,  with  be_er  docs   •  Typedbytes  serializa>on   •  Experimental  Hbase  integra>on   •  Allows  launching  python  jobs  even  on  nodes  that  do   not  have  Python   •  No  command  line  u>lity:  must  launch  MR  jobs  within   a  python  program   32
  • 35. pydoop   •  Wraps  Hadoop  Pipes  (C++  API)  instead  of  Streaming   •  HDFS  commands  communicate  through  libhdfs  rather   than  shell   •  Ability  to  implement  a  Python  Partitioner,   RecordReader,  and  RecordWriter! •  All  input/output  must  be  strings   •  Could  not  install  it   33
  • 36. luigi   •  Full-­‐fledged  workflow  management,  task  scheduling,   dependency  resolu>on  tool  in  Python  (similar  to   Apache  Oozie)   •  Built-­‐in  support  for  Hadoop  by  wrapping  Streaming   •  Not  as  fully-­‐featured  as  mrjob  for  Hadoop,  but  easily   customizable   •  Internal  serializa>on  through  repr/eval   •  Ac>vely  developed  at  Spo>fy   •  README  is  good  but  documenta>on  is  lacking   34
  • 37. luigi:  runner   35 python ngrams.py Ngrams ! --local-scheduler ! --n-reduce-tasks 10 ! --source /ngrams ! --destination /output-luigi!
  • 38. luigi:  code   36
  • 39. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   37
  • 40. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   38
  • 41. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   39
  • 42. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   40
  • 43. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   41
  • 44. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython   •  hipy   42
  • 45. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy   43
  • 46. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy  Python  syntac>c  sugar  to  construct  Hive  queries   44
  • 47. Commit  ac>vity   45 mrjob   dumbo  
  • 48. Commit  ac>vity   46 luigi   hadoopy  
  • 49. The  cluster   •  5  virtual  machines   •  4  CPUs   •  10  GB  RAM   •  100  GB  disk   •  CentOS  6.2   •  CDH4  (Hadoop  2)   •  20  map  tasks   •  10  reduce  tasks   •  Python  2.6   47
  • 50. (Unscien>fic)  performance  comparison   48
  • 51. (Unscien>fic)  performance  comparison   49 Streaming  has   lowest  overhead  
  • 52. (Unscien>fic)  performance  comparison   50 JSON  SerDe  
  • 53. (Unscien>fic)  performance  comparison   51 Combiner  was  not  used  
  • 54. Feature  comparison   52
  • 55. Feature  comparison   53
  • 56. Conclusions   •  Prefer  Hadoop  Streaming  if  possible   •  It’s  easy  enough   •  Lowest  overhead   •  Prefer  mrjob  for  higher  abstrac>on   •  Ac>vely  developed/great  documenta>on   •  Feature-­‐rich  (incl.  composable  jobs)   •  Integra>on  with  AWS   •  Prefer  luigi  for  more  complicated  job  flows   •  Ac>vely  developed   •  Much  more  general  than  purely  Hadoop   54
  • 57. 55
  • 58. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/python- hadoop