SlideShare a Scribd company logo
1 of 23
Download to read offline
1
Compiled Python UDFs for Impala
Uri Laserson
20 May 2014
Impala User-defined Functions (UDFs)
• Tuple => Scalar value
• Substring
• sin, cos, pow, …
• Machine-learning models
• Supports Hive UDFs (Java)
• Relatively unpleasurable
• Slower
• Impala (native) UDFs
• C++ interface designed for efficiency
• Similar to Postgres UDFs
• Runs any LLVM-compiled code
2
LLVM compiler infrastructure
3
LLVM: C++ example
4
bool StringEq(FunctionContext* context,
const StringVal& arg1,
const StringVal& arg2) {
if (arg1.is_null != arg2.is_null)
return false;
if (arg1.is_null)
return true;
if (arg1.len != arg2.len)
return false;
return (arg1.ptr == arg2.ptr) ||
memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0;
}
LLVM: IR output
5
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return
LLVM: IR output
6
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return
Data type compatibility
7
struct AnyVal {
bool is_null;
};
struct StringVal : public AnyVal {
int len;
uint8_t* ptr;
};
%AnyVal = type { i8 }
%StringVal = type { %AnyVal, i32, i8* }
; or
%StringVal = type { { i8 }, i32, i8* }
C++LLVMIR
Register and execute the function
8
CREATE FUNCTION StringEq(STRING, STRING)
RETURNS BOOLEAN
LOCATION '/path/to/bitcode.ll’
SYMBOL=’StringEq’;
SELECT StringEq(a, b) FROM mytable;
Numba compiler
9
NumbaPython
Impyla: Python Library for Impala
• pip install impyla
• DB API v2.0 (PEP 249) compatible
• Prototype sklearn API for Impala ML
• Numba integration (described here)
• See blog post:
http://blog.cloudera.com/blog/2014/04/a-new-
python-client-for-impala/
10
LLVM: Python example
11
@udf(IntVal(FunctionContext, StringVal))
def hour_from_weird_date_format(context, date):
return int(split(date, '-')[1])
ship_udf(cursor, hour_from_weird_data_format,
'/path/to/store/udf.ll', 'my.impala.host')
cur.execute('SELECT hour_from_weird_data_format(date) ’ +
‘AS hour FROM mytable LIMIT 100’)
Model Scoring: BigML on Census Data
12
MLaaS
Model Scoring: BigML on Census Data
13
Example: 100 Node Decision Tree
14
def predict_income(impala_function_context, age, workclass, final_weight, education, education_num, marital_status, occupation, relationship,
race, sex, hours_per_week, native_country, income):
if (marital_status is None):
return '<=50K'
if (marital_status == 'Married-civ-spouse'):
if (education_num is None):
return '<=50K'
if (education_num > 12):
if (hours_per_week is None):
return '>50K'
if (hours_per_week > 31):
if (age is None):
return '>50K'
if (age > 28):
if (education_num > 13):
if (age > 58):
return '>50K'
if (age <= 58):
return '>50K'
if (education_num <= 13):
if (occupation is None):
return '>50K'
if (occupation == 'Exec-managerial'):
return '>50K'
if (occupation != 'Exec-managerial'):
return '>50K'
if (age <= 28):
if (age > 24):
if (occupation is None):
return '<=50K'
if (occupation == 'Tech-support'):
return '>50K'
if (occupation != 'Tech-support'):
return '<=50K'
if (age <= 24):
if (final_weight is None):
return '<=50K'
if (final_weight > 492053):
return '>50K'
if (final_weight <= 492053):
return '<=50K'
if (hours_per_week <= 31):
if (sex is None):
return '<=50K'
if (sex == 'Male'):
Batch Scoring with PySpark
15
# parse the text data
observations = sc.textFile('/path/to/census_data').map(parse_obs)
# perform batch scoring
predictions = observations.map(lambda tup: predict_income(*tup))
# trigger computation
distinct = predictions.distinct().collect()
Batch Scoring with Impala
16
# compile the scoring function
predict_income = udf(signature)(predict_income)
ship_udf(cursor, predict_income, ...)
# perform batch scoring
cursor.execute(‘SELECT DISTINCT predict_income(age, ... ) ‘ +
‘FROM census_text’)
distinct = cursor.fetchall()
Execution Time
17
execution_time =
per_job_overhead +
N * ( per_record_exec + memcmp_exec )
PySpark vs. Impala Performance
18
Tree size
(nodes)
Spark
execution
time (s)
Impala
execution
time (s)
Fold
differenc
e
Impala
compilati
on time
(s)
Bytecode
size
(bytes)
Percent
memcmp
nodes
0 160 9 17x 0 4
100 175 22 8x 1 2254 22%
500 178 27 7x 4 9803 35%
1000 184 32 6x 16 23495 34%
1500 188 35 5x 18 28301 34%
2000 196 37 5x 31 42442 33%
Execution Time
19
execution_time =
per_job_overhead +
N * ( per_record_exec + memcmp_exec )
Spark: 24 threads / node
[ ]
Impala: 1 thread / node
PySpark vs. Impala Performance
20
Tree size
(nodes)
Spark
execution
time (s)
Impala
execution
time (s)
Fold
differenc
e
Impala
compilati
on time
(s)
Bytecode
size
(bytes)
Percent
memcmp
nodes
0 160 9 17x 0 4
100 175 22 8x 1 2254 22%
500 178 27 7x 4 9803 35%
1000 184 32 6x 16 23495 34%
1500 188 35 5x 18 28301 34%
2000 196 37 5x 31 42442 33%
Current Status
• Support for all Impala UDF data types (e.g., IntVal,
StringVal, etc.)
• Support for casts to/from primitive types:
• Any operations on primitives should work on Impala types
• Support for NULL types as Python None
• Proof-of-principle support for Python string module
• len
• split
• Concatenation
• Call out to any extern C functions
• Proposed directions
• Array handling
• Numpy support
• What else?
21
UDFs with Impala + Numba
• Simplicity of Python interface/syntax
• Performance of compiled language like C++
• Developed at: https://github.com/cloudera/impyla
• Please try it and tell us what features would be useful
• Please contribute!
22
pip install impyla
23

More Related Content

What's hot

Python meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OPython meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OBuzzcapture
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Fwdays
 
PHP Performance Trivia
PHP Performance TriviaPHP Performance Trivia
PHP Performance TriviaNikita Popov
 
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNoSuchCon
 
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Susan Potter
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architectureJung Kim
 
Introduction to Swift programming language.
Introduction to Swift programming language.Introduction to Swift programming language.
Introduction to Swift programming language.Icalia Labs
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleSaúl Ibarra Corretgé
 
PyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for TornadoPyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for Tornadoemptysquare
 
PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021Ayesh Karunaratne
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionAnnibale Panichella
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!Boy Baukema
 
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...dantleech
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014Fantix King 王川
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...James Titcumb
 

What's hot (20)

Python meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/OPython meetup: coroutines, event loops, and non-blocking I/O
Python meetup: coroutines, event loops, and non-blocking I/O
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"Nikita Popov "What’s new in PHP 8.0?"
Nikita Popov "What’s new in PHP 8.0?"
 
PHP Performance Trivia
PHP Performance TriviaPHP Performance Trivia
PHP Performance Trivia
 
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineeringNSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
NSC #2 - D1 01 - Rolf Rolles - Program synthesis in reverse engineering
 
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architecture
 
Introduction to Swift programming language.
Introduction to Swift programming language.Introduction to Swift programming language.
Introduction to Swift programming language.
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio module
 
node ffi
node ffinode ffi
node ffi
 
PyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for TornadoPyCon lightning talk on my Toro module for Tornado
PyCon lightning talk on my Toro module for Tornado
 
PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021PHP Enums - PHPCon Japan 2021
PHP Enums - PHPCon Japan 2021
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash Reproduction
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!
 
asyncio internals
asyncio internalsasyncio internals
asyncio internals
 
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
Building and Incredible Machine with Pipelines and Generators in PHP (IPC Ber...
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHP Oxford June Meetup 2...
 

Similar to Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)

Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate CompilersFunctional Thursday
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987乐群 陈
 
Introduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaIntroduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaFlorent Pillet
 
Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)ujihisa
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick reviewCe.Se.N.A. Security
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelizationAlbert DeFusco
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfabdulrahamanbags
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2PVS-Studio
 
Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)David de Boer
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]RootedCON
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Rust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineRust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineMatt Provost
 
EcmaScript unchained
EcmaScript unchainedEcmaScript unchained
EcmaScript unchainedEduard Tomàs
 

Similar to Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14) (20)

Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987Compiler2016 by abcdabcd987
Compiler2016 by abcdabcd987
 
Introduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoaIntroduction to reactive programming & ReactiveCocoa
Introduction to reactive programming & ReactiveCocoa
 
Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)Hacking parse.y (RubyKansai38)
Hacking parse.y (RubyKansai38)
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick review
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to c part -3
Introduction to c   part -3Introduction to c   part -3
Introduction to c part -3
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
 
Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)Being functional in PHP (DPC 2016)
Being functional in PHP (DPC 2016)
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
 
CompilersAndLibraries
CompilersAndLibrariesCompilersAndLibraries
CompilersAndLibraries
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Rust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command LineRust LDN 24 7 19 Oxidising the Command Line
Rust LDN 24 7 19 Oxidising the Command Line
 
OpenMP
OpenMPOpenMP
OpenMP
 
EcmaScript unchained
EcmaScript unchainedEcmaScript unchained
EcmaScript unchained
 

More from Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 

More from Uri Laserson (6)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionDEEPRAJ PATHAK
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Why Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile EvolutionWhy Agile? - A handbook behind Agile Evolution
Why Agile? - A handbook behind Agile Evolution
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)

  • 1. 1 Compiled Python UDFs for Impala Uri Laserson 20 May 2014
  • 2. Impala User-defined Functions (UDFs) • Tuple => Scalar value • Substring • sin, cos, pow, … • Machine-learning models • Supports Hive UDFs (Java) • Relatively unpleasurable • Slower • Impala (native) UDFs • C++ interface designed for efficiency • Similar to Postgres UDFs • Runs any LLVM-compiled code 2
  • 4. LLVM: C++ example 4 bool StringEq(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) { if (arg1.is_null != arg2.is_null) return false; if (arg1.is_null) return true; if (arg1.len != arg2.len) return false; return (arg1.ptr == arg2.ptr) || memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0; }
  • 5. LLVM: IR output 5 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  • 6. LLVM: IR output 6 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  • 7. Data type compatibility 7 struct AnyVal { bool is_null; }; struct StringVal : public AnyVal { int len; uint8_t* ptr; }; %AnyVal = type { i8 } %StringVal = type { %AnyVal, i32, i8* } ; or %StringVal = type { { i8 }, i32, i8* } C++LLVMIR
  • 8. Register and execute the function 8 CREATE FUNCTION StringEq(STRING, STRING) RETURNS BOOLEAN LOCATION '/path/to/bitcode.ll’ SYMBOL=’StringEq’; SELECT StringEq(a, b) FROM mytable;
  • 10. Impyla: Python Library for Impala • pip install impyla • DB API v2.0 (PEP 249) compatible • Prototype sklearn API for Impala ML • Numba integration (described here) • See blog post: http://blog.cloudera.com/blog/2014/04/a-new- python-client-for-impala/ 10
  • 11. LLVM: Python example 11 @udf(IntVal(FunctionContext, StringVal)) def hour_from_weird_date_format(context, date): return int(split(date, '-')[1]) ship_udf(cursor, hour_from_weird_data_format, '/path/to/store/udf.ll', 'my.impala.host') cur.execute('SELECT hour_from_weird_data_format(date) ’ + ‘AS hour FROM mytable LIMIT 100’)
  • 12. Model Scoring: BigML on Census Data 12 MLaaS
  • 13. Model Scoring: BigML on Census Data 13
  • 14. Example: 100 Node Decision Tree 14 def predict_income(impala_function_context, age, workclass, final_weight, education, education_num, marital_status, occupation, relationship, race, sex, hours_per_week, native_country, income): if (marital_status is None): return '<=50K' if (marital_status == 'Married-civ-spouse'): if (education_num is None): return '<=50K' if (education_num > 12): if (hours_per_week is None): return '>50K' if (hours_per_week > 31): if (age is None): return '>50K' if (age > 28): if (education_num > 13): if (age > 58): return '>50K' if (age <= 58): return '>50K' if (education_num <= 13): if (occupation is None): return '>50K' if (occupation == 'Exec-managerial'): return '>50K' if (occupation != 'Exec-managerial'): return '>50K' if (age <= 28): if (age > 24): if (occupation is None): return '<=50K' if (occupation == 'Tech-support'): return '>50K' if (occupation != 'Tech-support'): return '<=50K' if (age <= 24): if (final_weight is None): return '<=50K' if (final_weight > 492053): return '>50K' if (final_weight <= 492053): return '<=50K' if (hours_per_week <= 31): if (sex is None): return '<=50K' if (sex == 'Male'):
  • 15. Batch Scoring with PySpark 15 # parse the text data observations = sc.textFile('/path/to/census_data').map(parse_obs) # perform batch scoring predictions = observations.map(lambda tup: predict_income(*tup)) # trigger computation distinct = predictions.distinct().collect()
  • 16. Batch Scoring with Impala 16 # compile the scoring function predict_income = udf(signature)(predict_income) ship_udf(cursor, predict_income, ...) # perform batch scoring cursor.execute(‘SELECT DISTINCT predict_income(age, ... ) ‘ + ‘FROM census_text’) distinct = cursor.fetchall()
  • 17. Execution Time 17 execution_time = per_job_overhead + N * ( per_record_exec + memcmp_exec )
  • 18. PySpark vs. Impala Performance 18 Tree size (nodes) Spark execution time (s) Impala execution time (s) Fold differenc e Impala compilati on time (s) Bytecode size (bytes) Percent memcmp nodes 0 160 9 17x 0 4 100 175 22 8x 1 2254 22% 500 178 27 7x 4 9803 35% 1000 184 32 6x 16 23495 34% 1500 188 35 5x 18 28301 34% 2000 196 37 5x 31 42442 33%
  • 19. Execution Time 19 execution_time = per_job_overhead + N * ( per_record_exec + memcmp_exec ) Spark: 24 threads / node [ ] Impala: 1 thread / node
  • 20. PySpark vs. Impala Performance 20 Tree size (nodes) Spark execution time (s) Impala execution time (s) Fold differenc e Impala compilati on time (s) Bytecode size (bytes) Percent memcmp nodes 0 160 9 17x 0 4 100 175 22 8x 1 2254 22% 500 178 27 7x 4 9803 35% 1000 184 32 6x 16 23495 34% 1500 188 35 5x 18 28301 34% 2000 196 37 5x 31 42442 33%
  • 21. Current Status • Support for all Impala UDF data types (e.g., IntVal, StringVal, etc.) • Support for casts to/from primitive types: • Any operations on primitives should work on Impala types • Support for NULL types as Python None • Proof-of-principle support for Python string module • len • split • Concatenation • Call out to any extern C functions • Proposed directions • Array handling • Numpy support • What else? 21
  • 22. UDFs with Impala + Numba • Simplicity of Python interface/syntax • Performance of compiled language like C++ • Developed at: https://github.com/cloudera/impyla • Please try it and tell us what features would be useful • Please contribute! 22 pip install impyla
  • 23. 23

Editor's Notes

  1. Much easier than C++ workflow. UDF in principle available to others.
  2. Much easier than C++ workflow. UDF in principle available to others.
  3. Much easier than C++ workflow. UDF in principle available to others.
  4. Significant fold change. Fold change gets closer to 1 and stabilizes as memcmp dominates execution. Compilation time linear in size of byte code. Every extra 500 nodes is about an extra few seconds of work each Wall clock time
  5. Significant fold change. Fold change gets closer to 1 and stabilizes as memcmp dominates execution. Compilation time linear in size of byte code. But working to improve it. Every extra 500 nodes is about an extra few seconds of work each Wall clock time