1
A	
  Guide	
  to	
  Python	
  Frameworks	
  for	
  Hadoop	
  
Uri	
  Laserson	
  |	
  Data	
  Scien>st	
  
laserson@clou...
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the sprea...
About	
  the	
  speaker	
  
•  Joined	
  Cloudera	
  late	
  2012	
  
•  Focused	
  on	
  life	
  sciences/medical	
  
•  ...
About	
  the	
  speaker	
  
•  No	
  formal	
  training	
  in	
  computer	
  science	
  
•  Never	
  touched	
  Java	
  
•...
4	
  
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  
•  mrjob	
  (Yelp)	
  
•  dumbo	
  
•  Luigi	
  (Spo>...
Goals	
  for	
  Python	
  framework	
  
1.  “Pseudocodiness”/simplicity	
  
2.  Flexibility/generality	
  
3.  Ease	
  of	...
7	
  
An	
  n-­‐gram	
  is	
  a	
  tuple	
  of	
  n	
  words.	
  
Problem:	
  aggrega>ng	
  the	
  Google	
  n-­‐gram	
  d...
8	
  
An	
  n-­‐gram	
  is	
  a	
  tuple	
  of	
  n	
  words.	
  
Problem:	
  aggrega>ng	
  the	
  Google	
  n-­‐gram	
  d...
9	
  
"A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves."	
  
10	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A 1!
p...
11	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A part...
12	
  
A	
  par'al	
  differen'al	
  equa'on	
  is	
  an	
  equa'on	
  that	
  contains	
  par'al	
  deriva'ves.	
  
A part...
13
14	
  
flourished in 1993 2 2 2!
flourished in 1998 2 2 1!
flourished in 1999 6 6 4!
flourished in 2000 5 5 5!
flourished in 20...
15	
  
Compute	
  how	
  ocen	
  two	
  words	
  are	
  near	
  each	
  
other	
  in	
  a	
  given	
  year.	
  
Two	
  wor...
16	
  
...2-grams...!
(cat, the) 1999 14!
(the, cat) 1999 7002!
!
...3-grams...!
(the, cheshire, cat) 1999 563!
!
...4-gra...
Pseudocode	
  for	
  MapReduce	
  
17
def map(record):!
    (ngram, year, count) = unpack(record)!
    // ensure word1 has...
Na>ve	
  Java	
  
18
import org.apache.hadoop.conf.Configured;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoo...
Na>ve	
  Java	
  
•  Maximum	
  flexibility	
  
•  Fastest	
  performance	
  
•  Na>ve	
  to	
  Hadoop	
  
•  Most	
  difficu...
Python	
  implementa>on	
  strategies	
  
•  Hadoop	
  Streaming	
  
•  mrjob	
  
•  dumbo	
  
•  hadoopy	
  
•  Hadoop	
 ...
Hadoop	
  Streaming:	
  execu>on	
  
21
hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar !
-input /ngrams !
-output /out...
Hadoop	
  Streaming:	
  code	
  
22
Hadoop	
  Streaming:	
  features	
  
•  Canonical	
  method	
  for	
  using	
  any	
  executable	
  as	
  
mapper/reducer	...
mrjob	
  
24
class NgramNeighbors(MRJob):!
# specify input/intermed/output serialization!
# default output protocol is JSO...
mrjob:	
  runner	
  
25
./ngrams.py -r hadoop !
--hadoop-bin /usr/bin/hadoop !
--jobconf mapred.reduce.tasks=10 !
-o hdfs:...
mrjob:	
  code	
  
26
mrjob:	
  features	
  
•  Abstracted	
  MapReduce	
  interface	
  
•  Handles	
  complex	
  Python	
  objects	
  
•  Mul>-...
mrjob:	
  serializa>on	
  
28
class MyMRJob(mrjob.job.MRJob):!
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol!
INTERNAL_...
dumbo	
  
•  Similar	
  in	
  spirit	
  to	
  mrjob	
  
•  abstracted	
  
•  complex	
  objects	
  
•  various	
  runners	...
dumbo:	
  serializa>on	
  
•  Typed	
  bytes	
  added	
  to	
  Hadoop	
  allowing	
  binary	
  data	
  
•  ctypedbytes	
  ...
dumbo:	
  installa>on	
  notes	
  
•  Required	
  manual	
  install	
  on	
  each	
  node	
  
•  dumbo	
  and	
  typedbyte...
hadoopy	
  
•  Similar	
  to	
  dumbo,	
  with	
  be_er	
  docs	
  
•  Typedbytes	
  serializa>on	
  
•  Experimental	
  H...
pydoop	
  
•  Wraps	
  Hadoop	
  Pipes	
  (C++	
  API)	
  instead	
  of	
  Streaming	
  
•  HDFS	
  commands	
  communicat...
luigi	
  
•  Full-­‐fledged	
  workflow	
  management,	
  task	
  scheduling,	
  
dependency	
  resolu>on	
  tool	
  in	
  P...
luigi:	
  runner	
  
35
python ngrams.py Ngrams !
--local-scheduler !
--n-reduce-tasks 10 !
--source /ngrams !
--destinati...
luigi:	
  code	
  
36
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Python	
  frameworks	
  for	
  Hadoop	
  
•  Hadoop	
  Streaming	
  ✓	
  
•  mrjob	
  (Yelp)	
  ✓	
  
•  dumbo✓	
  
•  Lui...
Commit	
  ac>vity	
  
45
mrjob	
  
dumbo	
  
Commit	
  ac>vity	
  
46
luigi	
  
hadoopy	
  
The	
  cluster	
  
•  5	
  virtual	
  machines	
  
•  4	
  CPUs	
  
•  10	
  GB	
  RAM	
  
•  100	
  GB	
  disk	
  
•  Cen...
(Unscien>fic)	
  performance	
  comparison	
  
48
(Unscien>fic)	
  performance	
  comparison	
  
49
Streaming	
  has	
  
lowest	
  overhead	
  
(Unscien>fic)	
  performance	
  comparison	
  
50
JSON	
  SerDe	
  
(Unscien>fic)	
  performance	
  comparison	
  
51
Combiner	
  was	
  not	
  used	
  
Feature	
  comparison	
  
52
Feature	
  comparison	
  
53
Conclusions	
  
•  Prefer	
  Hadoop	
  Streaming	
  if	
  possible	
  
•  It’s	
  easy	
  enough	
  
•  Lowest	
  overhead...
55
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/python-
hadoop
Upcoming SlideShare
Loading in...5
×

A Guide to Python Frameworks for Hadoop

2,519

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/17fsvKl.

Uri Laserson reviews the different available Python frameworks for Hadoop, including a comparison of performance, ease of use/installation, differences in implementation, and other features. Filmed at qconnewyork.com.

Uri Laserson is a data scientist at Cloudera. Previously, he received his PhD from MIT developing applications of high-throughput DNA sequencing to immunology. During that time, he co-founded Good Start Genetics, a next-generation diagnostics company focused on genetic carrier screening. In 2012 he was selected to Forbes's list of 30 under 30.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,519
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "A Guide to Python Frameworks for Hadoop"

  1. 1. 1 A  Guide  to  Python  Frameworks  for  Hadoop   Uri  Laserson  |  Data  Scien>st   laserson@cloudera.com   14  June  3013  
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /python-hadoop
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. About  the  speaker   •  Joined  Cloudera  late  2012   •  Focused  on  life  sciences/medical   •  PhD  in  BME/computa>onal  biology  at  MIT/Harvard   (2005-­‐2012)   •  Focused  on  genomics   •  Cofounded  Good  Start  Gene>cs  (2007-­‐)   •  Applying  next-­‐gen  DNA  sequencing  to  gene>c  carrier   screening   2
  5. 5. About  the  speaker   •  No  formal  training  in  computer  science   •  Never  touched  Java   •  Almost  all  work  using  Python   3
  6. 6. 4  
  7. 7. Python  frameworks  for  Hadoop   •  Hadoop  Streaming   •  mrjob  (Yelp)   •  dumbo   •  Luigi  (Spo>fy)   •  hadoopy   •  pydoop   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   5
  8. 8. Goals  for  Python  framework   1.  “Pseudocodiness”/simplicity   2.  Flexibility/generality   3.  Ease  of  use/installa>on   4.  Performance   6
  9. 9. 7   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams  
  10. 10. 8   An  n-­‐gram  is  a  tuple  of  n  words.   Problem:  aggrega>ng  the  Google  n-­‐gram  data   h_p://books.google.com/ngrams   1   2   3   4   5   6   7   8   (   )   8-­‐gram  
  11. 11. 9   "A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves."  
  12. 12. 10   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A 1! partial 2! differential 1! equation 2! is 1! an 1! that 1! contains 1! derivatives. 1! 1-­‐grams  
  13. 13. 11   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial 1! partial differential 1! differential equation 1! equation is 1! is an 1! an equation 1! equation that 1! that contains 1! contains partial 1! partial derivatives. 1! 2-­‐grams  
  14. 14. 12   A  par'al  differen'al  equa'on  is  an  equa'on  that  contains  par'al  deriva'ves.   A partial differential equation is 1! partial differential equation is an 1! differential equation is an equation 1! equation is an equation that 1! is an equation that contains 1! an equation that contains partial 1! equation that contains partial derivatives. 1! 5-­‐grams  
  15. 15. 13
  16. 16. 14   flourished in 1993 2 2 2! flourished in 1998 2 2 1! flourished in 1999 6 6 4! flourished in 2000 5 5 5! flourished in 2001 1 1 1! flourished in 2002 7 7 3! flourished in 2003 9 9 4! flourished in 2004 22 21 13! flourished in 2005 37 37 22! flourished in 2006 55 55 38! flourished in 2007 99 98 76! flourished in 2008 220 215 118! fluid of 1899 2 2 1! fluid of 2000 3 3 1! fluid of 2002 2 1 1! fluid of 2003 3 3 1! fluid of 2004 3 3 3! 2-­‐gram   year   matches   pages   volumes  
  17. 17. 15   Compute  how  ocen  two  words  are  near  each   other  in  a  given  year.   Two  words  are  “near”  if  they  are  both   present  in  a  2-­‐,  3,  4-­‐,  or  5-­‐gram.  
  18. 18. 16   ...2-grams...! (cat, the) 1999 14! (the, cat) 1999 7002! ! ...3-grams...! (the, cheshire, cat) 1999 563! ! ...4-grams...! ! ...5-grams...! (the, cat, in, the, hat) 1999 1023! (the, dog, chased, the, cat) 1999 403! (cat, is, one, of, the) 1999 24! (cat, the) 1999 8006! (hat, the) 1999 1023! raw  data   aggregated  results   lexicographic   ordering   internal  n-­‐grams  counted  by  smaller  n-­‐grams:   •  avoids  double-­‐coun>ng   •  increases  sensi>vity  (observed  at  least  40  >mes)  
  19. 19. Pseudocode  for  MapReduce   17 def map(record):!     (ngram, year, count) = unpack(record)!     // ensure word1 has the lexicographically first word:!     (word1, word2) = sorted(ngram[first], ngram[last])!     key = (word1, word2, year)!     emit(key, count)!  ! ! def reduce(key, values):!     emit(key, sum(values))! All  source  code  available  on  GitHub:   h_ps://github.com/cloudera/python-­‐ngrams  
  20. 20. Na>ve  Java   18 import org.apache.hadoop.conf.Configured;! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Job;! import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;! import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;! import org.apache.hadoop.util.Tool;! import org.apache.hadoop.util.ToolRunner;! ! ! public class NgramsDriver extends Configured implements Tool {! ! public int run(String[] args) throws Exception {! Job job = new Job(getConf());! job.setJarByClass(getClass());! ! FileInputFormat.addInputPath(job, new Path(args[0]));! FileOutputFormat.setOutputPath(job, new Path(args[1]));! ! job.setMapperClass(NgramsMapper.class);! job.setCombinerClass(NgramsReducer.class);! job.setReducerClass(NgramsReducer.class);! ! job.setOutputKeyClass(TextTriple.class);! job.setOutputValueClass(IntWritable.class);! ! job.setNumReduceTasks(10);! ! return job.waitForCompletion(true) ? 0 : 1;! }! ! public static void main(String[] args) throws Exception {! int exitCode = ToolRunner.run(new NgramsDriver(), args);! System.exit(exitCode);! }! }! ! import java.io.IOException;! import java.util.ArrayList;! import java.util.Collections;! import java.util.List;! import java.util.regex.Matcher;! import java.util.regex.Pattern;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapreduce.Mapper;! import org.apache.hadoop.mapreduce.lib.input.FileSplit;! import org.apache.log4j.Logger;! ! ! public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {! ! private Logger LOG = Logger.getLogger(getClass());! ! private int expectedTokens;! ! @Override! protected void setup(Context context) throws IOException, InterruptedException {! String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();! LOG.info("inputFile: " + inputFile);! Pattern c = Pattern.compile("([d]+)gram");! Matcher m = c.matcher(inputFile);! m.find();! expectedTokens = Integer.parseInt(m.group(1));! return;! }! ! @Override! public void map(LongWritable key, Text value, Context context)! throws IOException, InterruptedException {! String[] data = value.toString().split("t");! ! if (data.length < 3) {! return;! }! ! String[] ngram = data[0].split("s+");! String year = data[1];! IntWritable count = new IntWritable(Integer.parseInt(data[2]));! ! if (ngram.length != this.expectedTokens) {! return;! }! ! // build keyOut! List<String> triple = new ArrayList<String>(3);! triple.add(ngram[0]);! triple.add(ngram[expectedTokens - 1]);! Collections.sort(triple);! triple.add(year);! TextTriple keyOut = new TextTriple(triple);! ! context.write(keyOut, count);! }! }! import java.io.IOException;! ! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.mapreduce.Reducer;! ! ! public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> ! @Override! protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)! throws IOException, InterruptedException {! int sum = 0;! for (IntWritable value : values) {! sum += value.get();! }! context.write(key, new IntWritable(sum));! }! }! ! import java.io.DataInput;! import java.io.DataOutput;! import java.io.IOException;! import java.util.List;! ! import org.apache.hadoop.io.Text;! import org.apache.hadoop.io.WritableComparable;! ! ! public class TextTriple implements WritableComparable<TextTriple> {! ! private Text first;! private Text second;! private Text third;! ! public TextTriple() {! set(new Text(), new Text(), new Text());! }! ! public TextTriple(List<String> list) {! set(new Text(list.get(0)),! new Text(list.get(1)),! new Text(list.get(2)));! }! ! public void set(Text first, Text second, Text third) {! this.first = first;! this.second = second;! this.third = third;! }! ! public void write(DataOutput out) throws IOException {! first.write(out);! second.write(out);! third.write(out);! }! ! public void readFields(DataInput in) throws IOException {! first.readFields(in);! second.readFields(in);! third.readFields(in);! }! ! @Override! public int hashCode() {! return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();! }! ! @Override! public boolean equals(Object obj) {! if (obj instanceof TextTriple) {! TextTriple tt = (TextTriple) obj;! return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.thir }! return false;! }! ! @Override! public String toString() {! return first + "t" + second + "t" + third;! }! ! public int compareTo(TextTriple other) {! int comp = first.compareTo(other.first);! if (comp != 0) {! return comp;! }! comp = second.compareTo(other.second);! if (comp != 0) {! return comp;! }! return third.compareTo(other.third);! } ! }!
  21. 21. Na>ve  Java   •  Maximum  flexibility   •  Fastest  performance   •  Na>ve  to  Hadoop   •  Most  difficult  to  write   19
  22. 22. Python  implementa>on  strategies   •  Hadoop  Streaming   •  mrjob   •  dumbo   •  hadoopy   •  Hadoop  Pipes   •  pydoop   •  Non-­‐Hadoop   •  Disco   •  octopy   20
  23. 23. Hadoop  Streaming:  execu>on   21 hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar ! -input /ngrams ! -output /output-streaming ! -mapper mapper.py ! -combiner reducer.py ! -reducer reducer.py ! -jobconf stream.num.map.output.key.fields=3 ! -jobconf stream.num.reduce.output.key.fields=3 ! -jobconf mapred.reduce.tasks=10 ! -file mapper.py ! -file reducer.py!
  24. 24. Hadoop  Streaming:  code   22
  25. 25. Hadoop  Streaming:  features   •  Canonical  method  for  using  any  executable  as   mapper/reducer   •  Includes  shell  commands,  like  grep   •  Transparent  communica>on  with  Hadoop  though   stdin/stdout   •  Key  boundaries  manually  detected  in  reducer   •  Built-­‐in  with  Hadoop:  should  require  no  addi>onal   framework  installa>on   •  Developer  must  decide  how  to  encode  more   complicated  objects  (e.g.,  JSON)  or  binary  data   23
  26. 26. mrjob   24 class NgramNeighbors(MRJob):! # specify input/intermed/output serialization! # default output protocol is JSON; here we set it to text! OUTPUT_PROTOCOL = RawProtocol! ! def mapper(self, key, line):! pass! ! def combiner(self, key, counts):! pass! ! def reducer(self, key, counts):! pass! ! if __name__ == '__main__':! # sets up a runner, based on command line options! NgramNeighbors.run()! ! !
  27. 27. mrjob:  runner   25 ./ngrams.py -r hadoop ! --hadoop-bin /usr/bin/hadoop ! --jobconf mapred.reduce.tasks=10 ! -o hdfs:///output-mrjob ! hdfs:///ngrams!
  28. 28. mrjob:  code   26
  29. 29. mrjob:  features   •  Abstracted  MapReduce  interface   •  Handles  complex  Python  objects   •  Mul>-­‐step  MapReduce  workflows   •  Extremely  >ght  AWS  integra>on   •  Easily  choose  to  run  locally,  on  Hadoop  cluster,  or  on   EMR   •  Ac>vely  developed;  great  documenta>on   27
  30. 30. mrjob:  serializa>on   28 class MyMRJob(mrjob.job.MRJob):! INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol! INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol! OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol! Defaults   RawProtocol / RawValueProtocol! JSONProtocol / JSONValueProtocol! PickleProtocol / PickleValueProtocol! ReprProtocol / ReprValueProtocol! Available   Custom  protocols  can  be  wri_en.   No  current  support  for  binary  serializa>on  schemes.  
  31. 31. dumbo   •  Similar  in  spirit  to  mrjob   •  abstracted   •  complex  objects   •  various  runners   •  composable  jobs   •  Sporadically  developed?   •  Documenta>on  is  a  series  of  blog  posts   29
  32. 32. dumbo:  serializa>on   •  Typed  bytes  added  to  Hadoop  allowing  binary  data   •  ctypedbytes   •  binary  serializa>on   •  packs  Python  objects  in  C  structs   •  Much  faster  and  more  efficient  than  JSON  or  pickle   •  Na>vely  read  SequenceFile   •  Execute  code  from  any  Python  egg  or  JAR   •  Point  to  any  Java  InputFormat! 30
  33. 33. dumbo:  installa>on  notes   •  Required  manual  install  on  each  node   •  dumbo  and  typedbytes  had  to  be  installed  as  Python   eggs   •  Had  trouble  running  a  combiner  due  to   MemoryErrors! 31
  34. 34. hadoopy   •  Similar  to  dumbo,  with  be_er  docs   •  Typedbytes  serializa>on   •  Experimental  Hbase  integra>on   •  Allows  launching  python  jobs  even  on  nodes  that  do   not  have  Python   •  No  command  line  u>lity:  must  launch  MR  jobs  within   a  python  program   32
  35. 35. pydoop   •  Wraps  Hadoop  Pipes  (C++  API)  instead  of  Streaming   •  HDFS  commands  communicate  through  libhdfs  rather   than  shell   •  Ability  to  implement  a  Python  Partitioner,   RecordReader,  and  RecordWriter! •  All  input/output  must  be  strings   •  Could  not  install  it   33
  36. 36. luigi   •  Full-­‐fledged  workflow  management,  task  scheduling,   dependency  resolu>on  tool  in  Python  (similar  to   Apache  Oozie)   •  Built-­‐in  support  for  Hadoop  by  wrapping  Streaming   •  Not  as  fully-­‐featured  as  mrjob  for  Hadoop,  but  easily   customizable   •  Internal  serializa>on  through  repr/eval   •  Ac>vely  developed  at  Spo>fy   •  README  is  good  but  documenta>on  is  lacking   34
  37. 37. luigi:  runner   35 python ngrams.py Ngrams ! --local-scheduler ! --n-reduce-tasks 10 ! --source /ngrams ! --destination /output-luigi!
  38. 38. luigi:  code   36
  39. 39. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   37
  40. 40. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   38
  41. 41. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   39
  42. 42. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   40
  43. 43. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data   •  Pig  UDF/Jython   •  hipy   41
  44. 44. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython   •  hipy   42
  45. 45. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy   43
  46. 46. Python  frameworks  for  Hadoop   •  Hadoop  Streaming  ✓   •  mrjob  (Yelp)  ✓   •  dumbo✓   •  Luigi  (Spo>fy)  ✓   •  hadoopy✓   •  pydoop❌   •  PySpark  not  Hadoop   •  happy  abandoned?  Jython-­‐based   •  Disco  not  Hadoop   •  octopy  not  serious/not  Hadoop   •  Mortar  Data  HaaS;  support  numpy,  scipy,  nltk,  pip-­‐installable  in  UDF   •  Pig  UDF/Jython  Pig  is  another  talk;  Jython  limited   •  hipy  Python  syntac>c  sugar  to  construct  Hive  queries   44
  47. 47. Commit  ac>vity   45 mrjob   dumbo  
  48. 48. Commit  ac>vity   46 luigi   hadoopy  
  49. 49. The  cluster   •  5  virtual  machines   •  4  CPUs   •  10  GB  RAM   •  100  GB  disk   •  CentOS  6.2   •  CDH4  (Hadoop  2)   •  20  map  tasks   •  10  reduce  tasks   •  Python  2.6   47
  50. 50. (Unscien>fic)  performance  comparison   48
  51. 51. (Unscien>fic)  performance  comparison   49 Streaming  has   lowest  overhead  
  52. 52. (Unscien>fic)  performance  comparison   50 JSON  SerDe  
  53. 53. (Unscien>fic)  performance  comparison   51 Combiner  was  not  used  
  54. 54. Feature  comparison   52
  55. 55. Feature  comparison   53
  56. 56. Conclusions   •  Prefer  Hadoop  Streaming  if  possible   •  It’s  easy  enough   •  Lowest  overhead   •  Prefer  mrjob  for  higher  abstrac>on   •  Ac>vely  developed/great  documenta>on   •  Feature-­‐rich  (incl.  composable  jobs)   •  Integra>on  with  AWS   •  Prefer  luigi  for  more  complicated  job  flows   •  Ac>vely  developed   •  Much  more  general  than  purely  Hadoop   54
  57. 57. 55
  58. 58. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/python- hadoop

×