Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
A Guide to Python Frameworks for Hadoop
Uri Laserson
laserson@cloudera.com
20 March 2014
Goals for today
1. Easy to jump into Hadoop with Python
2. Describe 5 ways to use Python with Hadoop, batch
and interactiv...
3
Code:
https://github.com/laserson/rock-health-python
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
python...
About the speaker
• Joined Cloudera late 2012
• Focus on life sciences/medical
• PhD in BME/computational biology at MIT/H...
About the speaker
• No formal training in computer science
• Never touched Java
• Almost all work using Python
5
6
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• ha...
Goals for Python framework
1. “Pseudocodiness”/simplicity
2. Flexibility/generality
3. Ease of use/installation
4. Perform...
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• ha...
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• ha...
11
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams
12
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams
1 2 3 4 5 6...
13
"A partial differential equation is an equation that contains partial derivatives."
14
A partial differential equation is an equation that contains partial derivatives.
A 1
partial 2
differential 1
equation...
15
A partial differential equation is an equation that contains partial derivatives.
A partial 1
partial differential 1
di...
16
A partial differential equation is an equation that contains partial derivatives.
A partial differential equation is 1
...
17
18
goto code
19
flourished in 1993 2 2 2
flourished in 1998 2 2 1
flourished in 1999 6 6 4
flourished in 2000 5 5 5
flourished in 2001 1 1 1...
20
Compute how often two words are near each
other in a given year.
Two words are “near” if they are both
present in a 2-,...
21
...2-grams...
(cat, the) 1999 14
(the, cat) 1999 7002
...3-grams...
(the, cheshire, cat) 1999 563
...4-grams...
...5-gr...
What is Hadoop?
• Ecosystem of tools
• Core is the HDFS file system
• Downloadable set of jars that can be run on any
mach...
HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale...
HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low margina...
MapReduce computation
25
MapReduce computation
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
...
Pseudocode for MapReduce
27
def map(record):
(ngram, year, count) = unpack(record)
// ensure word1 has the lexicographical...
Native Java
28
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Int...
Native Java
• Maximum flexibility
• Fastest performance
• Native to Hadoop
• Most difficult to write
29
Hadoop Streaming
30
hadoop jar hadoop-streaming-*-.jar 
-input path/to/input
-output path/to/output
-mapper “grep WARN”
Hadoop Streaming: features
• Canonical method for using any executable as
mapper/reducer
• Includes shell commands, like g...
Hadoop Streaming
32
goto code
mrjob
33
class NgramNeighbors(MRJob):
# specify input/intermed/output serialization
# default output protocol is JSON; her...
mrjob: features
• Abstracted MapReduce interface
• Handles complex Python objects
• Multi-step MapReduce workflows
• Extre...
mrjob
35
goto code
mrjob: serialization
36
class MyMRJob(mrjob.job.MRJob):
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
INTERNAL_PROTOCOL...
luigi
• Full-fledged workflow management, task
scheduling, dependency resolution tool in Python
(similar to Apache Oozie)
...
luigi
38
goto code
The cluster used for benchmarking
• 5 virtual machines
• 4 CPUs
• 10 GB RAM
• 100 GB disk
• CentOS 6.2
• CDH4 (Hadoop 2)
•...
(Unscientific) performance comparison
40
(Unscientific) performance comparison
41
Streaming has
lowest overhead
(Unscientific) performance comparison
42
JSON SerDe
Feature comparison
43
Feature comparison
44
45
Questions?
‹#›
‹#›
What is Spark?
• Started in 2009 as academic project from Amplab at
UCBerkeley; now ASF and >100 contributors
• In-memory ...
Spark uses a general DAG scheduler
• Application aware scheduler
• Uses locality for both disk
and memory
• Partitioning-a...
Operations on RDDs
50
Zaharia 2011
Apache Spark
51
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR” in line)
# Count all the err...
Apache Spark
52
goto code
What’s Impala?
• Interactive SQL
• Typically 4-65x faster than the latest Hive (observed up to 100x faster)
• Responses in...
Cloudera Impala
54
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype...
Impala Architecture: Planner
• Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN Hba...
Impala User-defined Functions (UDFs)
• Tuple => Scalar value
• Substring
• sin, cos, pow, …
• Machine-learning models
• Su...
LLVM compiler infrastructure
57
LLVM: C++ example
58
bool StringEq(FunctionContext* context,
const StringVal& arg1,
const StringVal& arg2) {
if (arg1.is_n...
LLVM: IR output
59
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3...
LLVM compiler infrastructure
60
NumbaPython
Iris data and BigML
61
def predict_species_orig(sepal_width=None,
petal_length=None,
petal_width=None):
""" Predictor for ...
Impala + Numba
62
goto code
Impala + Numba
• Still pre-alpha
• Significantly faster execution thanks to native LLVM
• Significantly easier to write UD...
64
Conclusions
65
If you have access to a Hadoop cluster and you want a
one-off quick-and-dirty job…
Hadoop Streaming
66
If you want an expressive Pythonic interface to build
complex, regular ETL workflows…
Luigi
67
If you want to integrate Hadoop with other regular
processes…
Luigi
68
If you don’t have access to Hadoop and want to try
stuff out…
mrjob
69
If you’re heavily using AWS…
mrjob
70
If you want to work interactively…
PySpark
71
If you want to do in-memory analytics…
PySpark
72
If you want to do anything…*
PySpark
73
If you want ease of Python with high performance
Impala + Numba
74
If you want to write Python UDFs for SQL queries…
Impala + Numba
75
Code:
https://github.com/laserson/rock-health-python
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
pytho...
76
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Hadoop with Python
Next
Upcoming SlideShare
Hadoop with Python
Next
Download to read offline and view in fullscreen.

Share

Python in the Hadoop Ecosystem (Rock Health presentation)

Download to read offline

A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Python in the Hadoop Ecosystem (Rock Health presentation)

  1. 1. 1 A Guide to Python Frameworks for Hadoop Uri Laserson laserson@cloudera.com 20 March 2014
  2. 2. Goals for today 1. Easy to jump into Hadoop with Python 2. Describe 5 ways to use Python with Hadoop, batch and interactive 3. Guidelines for choosing Python framework 2
  3. 3. 3 Code: https://github.com/laserson/rock-health-python Blog post: http://blog.cloudera.com/blog/2013/01/a-guide-to- python-frameworks-for-hadoop/ Slides: http://www.slideshare.net/urilaserson/
  4. 4. About the speaker • Joined Cloudera late 2012 • Focus on life sciences/medical • PhD in BME/computational biology at MIT/Harvard (2005-2012) • Focused on genomics • Cofounded Good Start Genetics (2007-) • Applying next-gen DNA sequencing to genetic carrier screening 4
  5. 5. About the speaker • No formal training in computer science • Never touched Java • Almost all work using Python 5
  6. 6. 6
  7. 7. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy • Disco • octopy • Mortar Data • Pig UDF/Jython • hipy • Impala + Numba 7
  8. 8. Goals for Python framework 1. “Pseudocodiness”/simplicity 2. Flexibility/generality 3. Ease of use/installation 4. Performance 8
  9. 9. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy • Disco • octopy • Mortar Data • Pig UDF/Jython • hipy • Impala + Numba 9
  10. 10. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy abandoned? Jython-based • Disco not Hadoop • octopy not serious/not Hadoop • Mortar Data HaaS; support numpy, scipy, nltk, pip-installable in UDF • Pig UDF/Jython Pig is another talk; Jython limited • hipy Python syntactic sugar to construct Hive queries • Impala + Numba 10
  11. 11. 11 An n-gram is a tuple of n words. Problem: aggregating the Google n-gram data http://books.google.com/ngrams
  12. 12. 12 An n-gram is a tuple of n words. Problem: aggregating the Google n-gram data http://books.google.com/ngrams 1 2 3 4 5 6 7 8 ( ) 8-gram
  13. 13. 13 "A partial differential equation is an equation that contains partial derivatives."
  14. 14. 14 A partial differential equation is an equation that contains partial derivatives. A 1 partial 2 differential 1 equation 2 is 1 an 1 that 1 contains 1 derivatives. 1 1-grams
  15. 15. 15 A partial differential equation is an equation that contains partial derivatives. A partial 1 partial differential 1 differential equation 1 equation is 1 is an 1 an equation 1 equation that 1 that contains 1 contains partial 1 partial derivatives. 1 2-grams
  16. 16. 16 A partial differential equation is an equation that contains partial derivatives. A partial differential equation is 1 partial differential equation is an 1 differential equation is an equation 1 equation is an equation that 1 is an equation that contains 1 an equation that contains partial 1 equation that contains partial derivatives. 1 5-grams
  17. 17. 17
  18. 18. 18 goto code
  19. 19. 19 flourished in 1993 2 2 2 flourished in 1998 2 2 1 flourished in 1999 6 6 4 flourished in 2000 5 5 5 flourished in 2001 1 1 1 flourished in 2002 7 7 3 flourished in 2003 9 9 4 flourished in 2004 22 21 13 flourished in 2005 37 37 22 flourished in 2006 55 55 38 flourished in 2007 99 98 76 flourished in 2008 220 215 118 fluid of 1899 2 2 1 fluid of 2000 3 3 1 fluid of 2002 2 1 1 fluid of 2003 3 3 1 fluid of 2004 3 3 3 2-gram year matches pages volumes
  20. 20. 20 Compute how often two words are near each other in a given year. Two words are “near” if they are both present in a 2-, 3-, 4-, or 5-gram.
  21. 21. 21 ...2-grams... (cat, the) 1999 14 (the, cat) 1999 7002 ...3-grams... (the, cheshire, cat) 1999 563 ...4-grams... ...5-grams... (the, cat, in, the, hat) 1999 1023 (the, dog, chased, the, cat) 1999 403 (cat, is, one, of, the) 1999 24 (cat, the) 1999 8006 (hat, the) 1999 1023 raw data aggregated results lexicographic ordering internal n-grams counted by smaller n-grams: • avoids double-counting • increases sensitivity (observed at least 40 times)
  22. 22. What is Hadoop? • Ecosystem of tools • Core is the HDFS file system • Downloadable set of jars that can be run on any machine 22
  23. 23. HDFS design assumptions • Based on Google File System • Files are large (GBs to TBs) • Failures are common • Massive scale means failures very likely • Disk, node, or network failures • Accesses are large and sequential • Files are append-only 23
  24. 24. HDFS properties • Fault-tolerant • Gracefully responds to node/disk/network failures • Horizontally scalable • Low marginal cost • High-bandwidth 24 1 2 3 4 5 2 4 5 1 2 5 1 3 4 2 3 5 1 3 4 Input File HDFS storage distribution Node A Node B Node C Node D Node E
  25. 25. MapReduce computation 25
  26. 26. MapReduce computation • Structured as 1. Embarrassingly parallel “map stage” 2. Cluster-wide distributed sort (“shuffle”) 3. Aggregation “reduce stage” • Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema 26
  27. 27. Pseudocode for MapReduce 27 def map(record): (ngram, year, count) = unpack(record) // ensure word1 has the lexicographically first word: (word1, word2) = sorted(ngram[first], ngram[last]) key = (word1, word2, year) emit(key, count) def reduce(key, values): emit(key, sum(values)) All source code available on GitHub: https://github.com/laserson/rock-health-python
  28. 28. Native Java 28 import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class NgramsDriver extends Configured implements Tool { public int run(String[] args) throws Exception { Job job = new Job(getConf()); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(NgramsMapper.class); job.setCombinerClass(NgramsReducer.class); job.setReducerClass(NgramsReducer.class); job.setOutputKeyClass(TextTriple.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(10); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new NgramsDriver(), args); System.exit(exitCode); } } import java.io.IOException; import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.log4j.Logger; public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> { private Logger LOG = Logger.getLogger(getClass()); private int expectedTokens; @Override protected void setup(Context context) throws IOException, InterruptedException { String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName(); LOG.info("inputFile: " + inputFile); Pattern c = Pattern.compile("([d]+)gram"); Matcher m = c.matcher(inputFile); m.find(); expectedTokens = Integer.parseInt(m.group(1)); return; } @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] data = value.toString().split("t"); if (data.length < 3) { return; } String[] ngram = data[0].split("s+"); String year = data[1]; IntWritable count = new IntWritable(Integer.parseInt(data[2])); if (ngram.length != this.expectedTokens) { return; } // build keyOut List<String> triple = new ArrayList<String>(3); triple.add(ngram[0]); triple.add(ngram[expectedTokens - 1]); Collections.sort(triple); triple.add(year); TextTriple keyOut = new TextTriple(triple); context.write(keyOut, count); } } import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> { @Override protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.List; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; public class TextTriple implements WritableComparable<TextTriple> { private Text first; private Text second; private Text third; public TextTriple() { set(new Text(), new Text(), new Text()); } public TextTriple(List<String> list) { set(new Text(list.get(0)), new Text(list.get(1)), new Text(list.get(2))); } public void set(Text first, Text second, Text third) { this.first = first; this.second = second; this.third = third; } public void write(DataOutput out) throws IOException { first.write(out); second.write(out); third.write(out); } public void readFields(DataInput in) throws IOException { first.readFields(in); second.readFields(in); third.readFields(in); } @Override public int hashCode() { return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode(); } @Override public boolean equals(Object obj) { if (obj instanceof TextTriple) { TextTriple tt = (TextTriple) obj; return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third); } return false; } @Override public String toString() { return first + "t" + second + "t" + third; } public int compareTo(TextTriple other) { int comp = first.compareTo(other.first); if (comp != 0) { return comp; } comp = second.compareTo(other.second); if (comp != 0) { return comp; } return third.compareTo(other.third); } }
  29. 29. Native Java • Maximum flexibility • Fastest performance • Native to Hadoop • Most difficult to write 29
  30. 30. Hadoop Streaming 30 hadoop jar hadoop-streaming-*-.jar -input path/to/input -output path/to/output -mapper “grep WARN”
  31. 31. Hadoop Streaming: features • Canonical method for using any executable as mapper/reducer • Includes shell commands, like grep • Transparent communication with Hadoop though stdin/stdout • Key boundaries manually detected in reducer • Built-in with Hadoop: should require no additional framework installation • Developer must decide how to encode more complicated objects (e.g., JSON) or binary data 31
  32. 32. Hadoop Streaming 32 goto code
  33. 33. mrjob 33 class NgramNeighbors(MRJob): # specify input/intermed/output serialization # default output protocol is JSON; here we set it to text OUTPUT_PROTOCOL = RawProtocol def mapper(self, key, line): pass def combiner(self, key, counts): pass def reducer(self, key, counts): pass if __name__ == '__main__': # sets up a runner, based on command line options NgramNeighbors.run()
  34. 34. mrjob: features • Abstracted MapReduce interface • Handles complex Python objects • Multi-step MapReduce workflows • Extremely tight AWS integration • Easily choose to run locally, on Hadoop cluster, or on EMR • Actively developed; great documentation 34
  35. 35. mrjob 35 goto code
  36. 36. mrjob: serialization 36 class MyMRJob(mrjob.job.MRJob): INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol Defaults RawProtocol / RawValueProtocol JSONProtocol / JSONValueProtocol PickleProtocol / PickleValueProtocol ReprProtocol / ReprValueProtocol Available Custom protocols can be written. No current support for binary serialization schemes.
  37. 37. luigi • Full-fledged workflow management, task scheduling, dependency resolution tool in Python (similar to Apache Oozie) • Built-in support for Hadoop by wrapping Streaming • Not as fully-featured as mrjob for Hadoop, but easily customizable • Internal serialization through repr/eval • Actively developed at Spotify • README is good but documentation is lacking 37
  38. 38. luigi 38 goto code
  39. 39. The cluster used for benchmarking • 5 virtual machines • 4 CPUs • 10 GB RAM • 100 GB disk • CentOS 6.2 • CDH4 (Hadoop 2) • 20 map tasks • 10 reduce tasks • Python 2.6 39
  40. 40. (Unscientific) performance comparison 40
  41. 41. (Unscientific) performance comparison 41 Streaming has lowest overhead
  42. 42. (Unscientific) performance comparison 42 JSON SerDe
  43. 43. Feature comparison 43
  44. 44. Feature comparison 44
  45. 45. 45 Questions?
  46. 46. ‹#›
  47. 47. ‹#›
  48. 48. What is Spark? • Started in 2009 as academic project from Amplab at UCBerkeley; now ASF and >100 contributors • In-memory distributed execution engine • Operates on Resilient Distributed Datasets (RDDs) • Provides richer distributed computing primitives for various problems • Can support SQL, stream processing, ML, graph computation • Supports Scala, Java, and Python 48
  49. 49. Spark uses a general DAG scheduler • Application aware scheduler • Uses locality for both disk and memory • Partitioning-aware to avoid shuffles • Can rewrite and optimize graph based on analysis join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition
  50. 50. Operations on RDDs 50 Zaharia 2011
  51. 51. Apache Spark 51 file = spark.textFile("hdfs://...") errors = file.filter(lambda line: "ERROR” in line) # Count all the errors errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL” in line).count() # Fetch the MySQL errors as an array of strings errors.filter(lambda line: "MySQL” in line).collect() val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w) Logfiltering (Python) Logisticregression (Scala)
  52. 52. Apache Spark 52 goto code
  53. 53. What’s Impala? • Interactive SQL • Typically 4-65x faster than the latest Hive (observed up to 100x faster) • Responses in seconds instead of minutes (sometimes sub-second) • ANSI-92 standard SQL queries with HiveQL • Compatible SQL interface for existing Hadoop/CDH applications • Based on industry standard SQL • Natively on Hadoop/HBase storage and metadata • Flexibility, scale, and cost advantages of Hadoop • No duplication/synchronization of data and metadata • Local processing to avoid network bottlenecks • Separate runtime from batch processing • Hive, Pig, MapReduce are designed and great for batch • Impala is purpose-built for low-latency SQL queries on Hadoop Cloudera Confidential. ©2013 Cloudera, Inc. All Rights Reserved. 53
  54. 54. Cloudera Impala 54 SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16";
  55. 55. Impala Architecture: Planner • Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 Hbase Scan Hash Join Hdfs Scan Exch TopN Agg Exch at coordinator at DataNodes at region servers Agg TopN Agg Hash Join Hdfs Scan Hbase Scan Cloudera Confidential. ©2013 Cloudera, Inc. All Rights Reserved. 55
  56. 56. Impala User-defined Functions (UDFs) • Tuple => Scalar value • Substring • sin, cos, pow, … • Machine-learning models • Supports Hive UDFs (Java) • Highly unpleasurable • Impala (native) UDFs • C++ interface designed for efficiency • Similar to Postgres UDFs • Runs any LLVM-compiled code 56
  57. 57. LLVM compiler infrastructure 57
  58. 58. LLVM: C++ example 58 bool StringEq(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) { if (arg1.is_null != arg2.is_null) return false; if (arg1.is_null) return true; if (arg1.len != arg2.len) return false; return (arg1.ptr == arg2.ptr) || memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0; }
  59. 59. LLVM: IR output 59 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  60. 60. LLVM compiler infrastructure 60 NumbaPython
  61. 61. Iris data and BigML 61 def predict_species_orig(sepal_width=None, petal_length=None, petal_width=None): """ Predictor for species from model/52952081035d07727e01d836 Predictive model by BigML - Machine Learning Made Easy """ if (petal_width is None): return u'Iris-virginica' if (petal_width > 0.8): if (petal_width <= 1.75): if (petal_length is None): return u'Iris-versicolor' if (petal_length > 4.95): if (petal_width <= 1.55): return u'Iris-virginica' if (petal_width > 1.55): if (petal_length > 5.45): return u'Iris-virginica' if (petal_length <= 5.45): return u'Iris-versicolor' if (petal_length <= 4.95): if (petal_width <= 1.65): return u'Iris-versicolor' if (petal_width > 1.65): return u'Iris-virginica' if (petal_width > 1.75): if (petal_length is None): return u'Iris-virginica' if (petal_length > 4.85): return u'Iris-virginica' if (petal_length <= 4.85): if (sepal_width is None): return u'Iris-virginica' if (sepal_width <= 3.1): return u'Iris-virginica' if (sepal_width > 3.1): return u'Iris-versicolor' if (petal_width <= 0.8): return u'Iris-setosa'
  62. 62. Impala + Numba 62 goto code
  63. 63. Impala + Numba • Still pre-alpha • Significantly faster execution thanks to native LLVM • Significantly easier to write UDFs 63
  64. 64. 64 Conclusions
  65. 65. 65 If you have access to a Hadoop cluster and you want a one-off quick-and-dirty job… Hadoop Streaming
  66. 66. 66 If you want an expressive Pythonic interface to build complex, regular ETL workflows… Luigi
  67. 67. 67 If you want to integrate Hadoop with other regular processes… Luigi
  68. 68. 68 If you don’t have access to Hadoop and want to try stuff out… mrjob
  69. 69. 69 If you’re heavily using AWS… mrjob
  70. 70. 70 If you want to work interactively… PySpark
  71. 71. 71 If you want to do in-memory analytics… PySpark
  72. 72. 72 If you want to do anything…* PySpark
  73. 73. 73 If you want ease of Python with high performance Impala + Numba
  74. 74. 74 If you want to write Python UDFs for SQL queries… Impala + Numba
  75. 75. 75 Code: https://github.com/laserson/rock-health-python Blog post: http://blog.cloudera.com/blog/2013/01/a-guide-to- python-frameworks-for-hadoop/ Slides: http://www.slideshare.net/urilaserson/
  76. 76. 76
  • SangSubChong

    Aug. 24, 2019
  • kush93pandey

    Dec. 12, 2017
  • shenwuxiong

    Oct. 22, 2017
  • ssuseraaa7cb

    Oct. 4, 2017
  • DavidLevy7

    Jul. 23, 2017
  • DenisSharoukhov

    Apr. 19, 2017
  • SwethaSingiri

    Apr. 15, 2017
  • AmulyaDeepthi

    Feb. 22, 2017
  • ssuser4a734e

    Jan. 5, 2017
  • ilan.sinai

    Dec. 31, 2016
  • FredLiu3

    Dec. 17, 2016
  • klimdster

    Sep. 15, 2016
  • srinudasari796

    Aug. 6, 2016
  • samasaikiran

    Jul. 18, 2016
  • choeungjin

    Jun. 8, 2016
  • JaiprakashS1

    Apr. 3, 2016
  • SungIlShin1

    Mar. 28, 2016
  • JakeBouma1

    Mar. 8, 2016
  • fitnatalia

    Feb. 21, 2016
  • bookwormkr

    Feb. 15, 2016

A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.

Views

Total views

168,431

On Slideshare

0

From embeds

0

Number of embeds

144,153

Actions

Downloads

707

Shares

0

Comments

0

Likes

42

×