SlideShare a Scribd company logo
1 of 68
Ibis:
Seamless Transition Between
Pandas and Spark
● Modeling Platform and Tools @ Two Sigma
● Apache Spark, Apache Arrow, Ibis
About Me: Li Jin
● Software engineer at Two Sigma
● Distributed compute, Spark
● Running, hiking, traveling
About Me: Hyonjee Joo
Outline
● Introduction: A common data science problem
● Ibis: A high level introduction
● Ibis Language
● Pyspark backend for Ibis
● Demo
● Conclusion
df = pd.read_parquet('s3://…')
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature'] = df.groupby('key')['feature'] 
.rolling(7, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg({'feature': {'max', 'min'}})
A common data science problem...
● Pandas is good for medium size data (<10G)
● You machine doesn’t have enough RAM to hold all the data
● You want to use more CPUs
You are happy until...The code runs too slow
● Use a bigger machine (You can get machines with 1T RAM these
days)
● Pros:
○ Low human cost: no code change
● Cons:
○ Not scalable
○ Same software limits (e.g., single thread)
○ Can be difficult depending on your organization
Try a few things...
● Use a generic way to distribution pandas code:
○ e.g.: sparkContext.parallelize(range(2000, 2020)).map(lambda year:
compute_for_year(year)).collect()
○ dask.delayed
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
Try a few things...
Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ …
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use
Take a step back...
df = pd.read_parquet('s3://…')
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature'] 
.rolling(7, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg({'feature2': {'max', 'min'}})
Separation of API and execution
● The problem:
○ software and hardware that we use (pandas on single machine) cannot
execute the computation we want to do.
● Can we separate “what we want to do” and “how it is done”
Separation of API and execution
● Yes!
● SQL is a way to express the compute that is independent of the
software and hardware
● Can we have something that is like SQL, but for Python?
Don’t limit yourself to what you can use today
● Current “DataFrame”:
○ Pandas
○ Spark
○ Dask
● Future “DataFrame”:
○ Arrow Dataset
○ Cudf / dask-cudf
○ Ray / Modin
○ ...
Ibis: A high level
introduction
Ibis: Python Data Analysis Productivity Framework
● Open source
● Started in 2014 by Wes McKinney
● Worked on by top committers of Pandas:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback
Ibis components
● ibis “language”
○ The ibis language that is used to express the computation
● ibis “backends”
○ Modules that translate ibis expressions to sth that is executable by
different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ...
Ibis language
● Table API
○ Projection
○ Filtering
○ Join
○ Groupby
○ Sort
○ Window
○ Aggregation
○ …
○ AsofJoin
○ UDFs
● Ibis expressions (intermediate representation)
Ibis language example
● Table API:
○ table = table.mutate(v3=table[‘v1’] + table[‘v2’])
○ type(table) = ibis.TableExpr
● Ibis expressions:
Ibis backends
● Ibis expressions -> Backend specific expressions
● table.mutate(v3=table[‘v1’] + table[‘v2’])
○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’])
○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])
Ibis Example
Recall our earlier example in Pandas
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature'] 
.rolling(7, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg({'feature2': {'max', 'min'}})
Basic expression and column selection
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
Basic expression and column selection
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature']
.rolling(3, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature']
.rolling(3, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature']
.rolling(3, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
metrics = [table.feature2.min(),
table.feature2.max()]
table.group_by('key').aggregate(metrics)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature']
.rolling(3, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Final translation
Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
metrics = [table.feature2.min(),
table.feature2.max()]
table.group_by('key').aggregate(metrics)
Pandas expression
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = df.groupby('key')['feature']
.rolling(3, min_periods=1).mean() 
.sort_index(level=1) 
.reset_index(drop=True)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Ibis: PySpark
backend
A distributed backend for Ibis with PySpark
Ibis expression tree
Initialize Ibis PySpark client
Ibis expression
session = SparkSession.builder.getOrCreate()
client = ibis.pyspark.connect(session)
Ibis expression
table = client.table('table')
Access table in Ibis
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
It has a translate() method that
translates an Ibis expression into a
PySpark object.
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
expr is the ibis expression to
translate
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
left and right are
PySpark columns
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
PySpark column division
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
left and right are
PySpark columns
PySpark column addition
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.WindowOp)
def compile_window_op(t, expr, scope, **kwargs):
op = expr.op()
ibis_window = op.window
operand = op.expr
.
# pull out grouping_keys, ordering_keys, start
# and end from ibis_window
pyspark_window = Window.partitionBy(
grouping_keys).orderBy(ordering_keys)
.rowsBetween(start, end)
return t.translate(operand, scope,
window=pyspark_window)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.WindowOp)
def compile_window_op(t, expr, scope, **kwargs):
op = expr.op()
ibis_window = op.window
operand = op.expr
.
# pull out grouping_keys, ordering_keys, start
# and end from ibis_window
pyspark_window = Window.partitionBy(
grouping_keys).orderBy(ordering_keys)
.rowsBetween(start, end)
return t.translate(operand, scope,
window=pyspark_window)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.WindowOp)
def compile_window_op(t, expr, scope, **kwargs):
op = expr.op()
ibis_window = op.window
operand = op.expr
.
# pull out grouping_keys, ordering_keys, start
# and end from ibis_window
pyspark_window = Window.partitionBy(
grouping_keys).orderBy(ordering_keys)
.rowsBetween(start, end)
return t.translate(operand, scope,
window=pyspark_window)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.WindowOp)
def compile_window_op(t, expr, scope, **kwargs):
op = expr.op()
ibis_window = op.window
operand = op.expr
.
# pull out grouping_keys, ordering_keys, start
# and end from ibis_window
pyspark_window = Window.partitionBy(
grouping_keys).orderBy(ordering_keys)
.rowsBetween(start, end)
return t.translate(operand, scope,
window=pyspark_window)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.Mean)
def compile_mean(t, expr, scope, context=None,
**kwargs):
def fn(col):
if 'window' in kwargs:
window = kwargs['window']
return F.mean(col).over(window)
else:
return F.mean(col)
return compile_aggregator(
t, expr, scope, fn, context, **kwargs
)
Group-by and windowed aggregation
Ibis expression
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
table = table.mutate(
feature2=table.feature.mean().over(w)
)
Code
@compiles(ops.Mean)
def compile_mean(t, expr, scope, context=None,
**kwargs):
def fn(col):
if 'window' in kwargs:
window = kwargs['window']
return F.mean(col).over(window)
else:
return F.mean(col)
return compile_aggregator(
t, expr, scope, fn, context, **kwargs
)
where F is pyspark.sql.functions
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
List of
PySpark
columns/
column
expressions
Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
PySpark
groupby
and
aggregate
Demo
Conclusion
Conclusion
● Separate API from execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...
Thanks!
ice.xelloss@gmail.com (@icexelloss)
hyonjee.joo@twosigma.com (@hjoo)

More Related Content

What's hot

Pyimproved again
Pyimproved againPyimproved again
Pyimproved againrik0
 
Object Oriented JavaScript
Object Oriented JavaScriptObject Oriented JavaScript
Object Oriented JavaScriptJulie Iskander
 
Unit3 jwfiles
Unit3 jwfilesUnit3 jwfiles
Unit3 jwfilesmrecedu
 
Advanced Python, Part 1
Advanced Python, Part 1Advanced Python, Part 1
Advanced Python, Part 1Zaar Hai
 
Introduction to ad-3.4, an automatic differentiation library in Haskell
Introduction to ad-3.4, an automatic differentiation library in HaskellIntroduction to ad-3.4, an automatic differentiation library in Haskell
Introduction to ad-3.4, an automatic differentiation library in Haskellnebuta
 
Sylius and Api Platform The story of integration
Sylius and Api Platform The story of integrationSylius and Api Platform The story of integration
Sylius and Api Platform The story of integrationŁukasz Chruściel
 
Oop03 6
Oop03 6Oop03 6
Oop03 6schwaa
 
Design patterns in javascript
Design patterns in javascriptDesign patterns in javascript
Design patterns in javascriptMiao Siyu
 
Learning Functional Programming Without Growing a Neckbeard
Learning Functional Programming Without Growing a NeckbeardLearning Functional Programming Without Growing a Neckbeard
Learning Functional Programming Without Growing a NeckbeardKelsey Gilmore-Innis
 
A Sceptical Guide to Functional Programming
A Sceptical Guide to Functional ProgrammingA Sceptical Guide to Functional Programming
A Sceptical Guide to Functional ProgrammingGarth Gilmour
 
Python Programming Essentials - M20 - Classes and Objects
Python Programming Essentials - M20 - Classes and ObjectsPython Programming Essentials - M20 - Classes and Objects
Python Programming Essentials - M20 - Classes and ObjectsP3 InfoTech Solutions Pvt. Ltd.
 
Advanced Python, Part 2
Advanced Python, Part 2Advanced Python, Part 2
Advanced Python, Part 2Zaar Hai
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of AtrocityMichael Pirnat
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreWeb Zhao
 

What's hot (20)

Pyimproved again
Pyimproved againPyimproved again
Pyimproved again
 
Lath
LathLath
Lath
 
Object Oriented JavaScript
Object Oriented JavaScriptObject Oriented JavaScript
Object Oriented JavaScript
 
Python classes objects
Python classes objectsPython classes objects
Python classes objects
 
C Language Unit-3
C Language Unit-3C Language Unit-3
C Language Unit-3
 
Unit3 jwfiles
Unit3 jwfilesUnit3 jwfiles
Unit3 jwfiles
 
Advanced Python, Part 1
Advanced Python, Part 1Advanced Python, Part 1
Advanced Python, Part 1
 
Introduction to ad-3.4, an automatic differentiation library in Haskell
Introduction to ad-3.4, an automatic differentiation library in HaskellIntroduction to ad-3.4, an automatic differentiation library in Haskell
Introduction to ad-3.4, an automatic differentiation library in Haskell
 
Sylius and Api Platform The story of integration
Sylius and Api Platform The story of integrationSylius and Api Platform The story of integration
Sylius and Api Platform The story of integration
 
Oop03 6
Oop03 6Oop03 6
Oop03 6
 
Python Puzzlers
Python PuzzlersPython Puzzlers
Python Puzzlers
 
Design patterns in javascript
Design patterns in javascriptDesign patterns in javascript
Design patterns in javascript
 
Learning Functional Programming Without Growing a Neckbeard
Learning Functional Programming Without Growing a NeckbeardLearning Functional Programming Without Growing a Neckbeard
Learning Functional Programming Without Growing a Neckbeard
 
A Sceptical Guide to Functional Programming
A Sceptical Guide to Functional ProgrammingA Sceptical Guide to Functional Programming
A Sceptical Guide to Functional Programming
 
ddd+scala
ddd+scaladdd+scala
ddd+scala
 
Python Programming Essentials - M20 - Classes and Objects
Python Programming Essentials - M20 - Classes and ObjectsPython Programming Essentials - M20 - Classes and Objects
Python Programming Essentials - M20 - Classes and Objects
 
Advanced Python, Part 2
Advanced Python, Part 2Advanced Python, Part 2
Advanced Python, Part 2
 
Exhibition of Atrocity
Exhibition of AtrocityExhibition of Atrocity
Exhibition of Atrocity
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript core
 

Similar to PyData NYC 2019

Writing DSL with Applicative Functors
Writing DSL with Applicative FunctorsWriting DSL with Applicative Functors
Writing DSL with Applicative FunctorsDavid Galichet
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
PofEAA and SQLAlchemy
PofEAA and SQLAlchemyPofEAA and SQLAlchemy
PofEAA and SQLAlchemyInada Naoki
 
Talk - Query monad
Talk - Query monad Talk - Query monad
Talk - Query monad Fabernovel
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAdam Getchell
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
20230721_OKC_Meetup_MuleSoft.pptx
20230721_OKC_Meetup_MuleSoft.pptx20230721_OKC_Meetup_MuleSoft.pptx
20230721_OKC_Meetup_MuleSoft.pptxDianeKesler1
 
Things about Functional JavaScript
Things about Functional JavaScriptThings about Functional JavaScript
Things about Functional JavaScriptChengHui Weng
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Andreas Dewes
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Implementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickImplementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickHermann Hueck
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingDamian T. Gordon
 
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]RootedCON
 

Similar to PyData NYC 2019 (20)

Writing DSL with Applicative Functors
Writing DSL with Applicative FunctorsWriting DSL with Applicative Functors
Writing DSL with Applicative Functors
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
PofEAA and SQLAlchemy
PofEAA and SQLAlchemyPofEAA and SQLAlchemy
PofEAA and SQLAlchemy
 
Talk - Query monad
Talk - Query monad Talk - Query monad
Talk - Query monad
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
20230721_OKC_Meetup_MuleSoft.pptx
20230721_OKC_Meetup_MuleSoft.pptx20230721_OKC_Meetup_MuleSoft.pptx
20230721_OKC_Meetup_MuleSoft.pptx
 
Things about Functional JavaScript
Things about Functional JavaScriptThings about Functional JavaScript
Things about Functional JavaScript
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Implementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickImplementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with Slick
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented Programming
 
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki IshizakiDemystifying DataFrame and Dataset with Kazuaki Ishizaki
Demystifying DataFrame and Dataset with Kazuaki Ishizaki
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
 

Recently uploaded

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 

Recently uploaded (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 

PyData NYC 2019

  • 2. ● Modeling Platform and Tools @ Two Sigma ● Apache Spark, Apache Arrow, Ibis About Me: Li Jin
  • 3. ● Software engineer at Two Sigma ● Distributed compute, Spark ● Running, hiking, traveling About Me: Hyonjee Joo
  • 4. Outline ● Introduction: A common data science problem ● Ibis: A high level introduction ● Ibis Language ● Pyspark backend for Ibis ● Demo ● Conclusion
  • 5. df = pd.read_parquet('s3://…') df['feature'] = (df['v1'] + df['v2']) / 2 df['feature'] = df.groupby('key')['feature'] .rolling(7, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg({'feature': {'max', 'min'}}) A common data science problem...
  • 6. ● Pandas is good for medium size data (<10G) ● You machine doesn’t have enough RAM to hold all the data ● You want to use more CPUs You are happy until...The code runs too slow
  • 7. ● Use a bigger machine (You can get machines with 1T RAM these days) ● Pros: ○ Low human cost: no code change ● Cons: ○ Not scalable ○ Same software limits (e.g., single thread) ○ Can be difficult depending on your organization Try a few things...
  • 8. ● Use a generic way to distribution pandas code: ○ e.g.: sparkContext.parallelize(range(2000, 2020)).map(lambda year: compute_for_year(year)).collect() ○ dask.delayed ● Pros: ○ Medium human cost: small code change ○ Scalable ● Cons: ○ Works only for embarrassingly parallel problems Try a few things...
  • 9. Try a few things... ● Use a distributed dataframe library ○ Spark ○ Dask ○ Koalas ○ … ● Pros: ○ Scalable ● Cons: ○ High human cost: learn another API ○ Not obvious which one to use
  • 10. Take a step back... df = pd.read_parquet('s3://…') df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(7, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg({'feature2': {'max', 'min'}})
  • 11. Separation of API and execution ● The problem: ○ software and hardware that we use (pandas on single machine) cannot execute the computation we want to do. ● Can we separate “what we want to do” and “how it is done”
  • 12. Separation of API and execution ● Yes! ● SQL is a way to express the compute that is independent of the software and hardware ● Can we have something that is like SQL, but for Python?
  • 13. Don’t limit yourself to what you can use today ● Current “DataFrame”: ○ Pandas ○ Spark ○ Dask ● Future “DataFrame”: ○ Arrow Dataset ○ Cudf / dask-cudf ○ Ray / Modin ○ ...
  • 14. Ibis: A high level introduction
  • 15. Ibis: Python Data Analysis Productivity Framework ● Open source ● Started in 2014 by Wes McKinney ● Worked on by top committers of Pandas: ○ Wes McKinney ○ Phillip Cloud ○ Jeff Reback
  • 16. Ibis components ● ibis “language” ○ The ibis language that is used to express the computation ● ibis “backends” ○ Modules that translate ibis expressions to sth that is executable by different computation engines ■ ibis.pandas ■ ibis.pyspark ■ Ibis.bigquery ■ ...
  • 17. Ibis language ● Table API ○ Projection ○ Filtering ○ Join ○ Groupby ○ Sort ○ Window ○ Aggregation ○ … ○ AsofJoin ○ UDFs ● Ibis expressions (intermediate representation)
  • 18. Ibis language example ● Table API: ○ table = table.mutate(v3=table[‘v1’] + table[‘v2’]) ○ type(table) = ibis.TableExpr ● Ibis expressions:
  • 19. Ibis backends ● Ibis expressions -> Backend specific expressions ● table.mutate(v3=table[‘v1’] + table[‘v2’]) ○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’]) ○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])
  • 21. Recall our earlier example in Pandas df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(7, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg({'feature2': {'max', 'min'}})
  • 22. Basic expression and column selection Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2
  • 23. Basic expression and column selection Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2
  • 24. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True)
  • 25. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True)
  • 26. Composite aggregation Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg( {'feature2': {'max', 'min'}} )
  • 27. Composite aggregation Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) metrics = [table.feature2.min(), table.feature2.max()] table.group_by('key').aggregate(metrics) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg( {'feature2': {'max', 'min'}} )
  • 28. Final translation Ibis expression table = table.mutate( feature=(table.v1 + table.v2)/2 ) w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) metrics = [table.feature2.min(), table.feature2.max()] table.group_by('key').aggregate(metrics) Pandas expression df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) df.groupby('key').agg( {'feature2': {'max', 'min'}} )
  • 30. A distributed backend for Ibis with PySpark Ibis expression tree
  • 31. Initialize Ibis PySpark client Ibis expression session = SparkSession.builder.getOrCreate() client = ibis.pyspark.connect(session)
  • 32. Ibis expression table = client.table('table') Access table in Ibis
  • 33. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 )
  • 34. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 )
  • 35. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 )
  • 36. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right
  • 37. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator
  • 38. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator It has a translate() method that translates an Ibis expression into a PySpark object.
  • 39. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right expr is the ibis expression to translate
  • 40. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right scope is a dict that caches results of previously translated Ibis expressions.
  • 41. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right left and right are PySpark columns
  • 42. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right PySpark column division
  • 43. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right
  • 44. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right
  • 45. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right left and right are PySpark columns PySpark column addition
  • 46. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right
  • 47. Basic expression and column selection Ibis expression table.mutate( feature=(table.v1 + table.v2)/2 ) Code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name]
  • 48. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) )
  • 49. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) )
  • 50. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) )
  • 51. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) )
  • 52. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.WindowOp) def compile_window_op(t, expr, scope, **kwargs): op = expr.op() ibis_window = op.window operand = op.expr . # pull out grouping_keys, ordering_keys, start # and end from ibis_window pyspark_window = Window.partitionBy( grouping_keys).orderBy(ordering_keys) .rowsBetween(start, end) return t.translate(operand, scope, window=pyspark_window)
  • 53. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.WindowOp) def compile_window_op(t, expr, scope, **kwargs): op = expr.op() ibis_window = op.window operand = op.expr . # pull out grouping_keys, ordering_keys, start # and end from ibis_window pyspark_window = Window.partitionBy( grouping_keys).orderBy(ordering_keys) .rowsBetween(start, end) return t.translate(operand, scope, window=pyspark_window)
  • 54. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.WindowOp) def compile_window_op(t, expr, scope, **kwargs): op = expr.op() ibis_window = op.window operand = op.expr . # pull out grouping_keys, ordering_keys, start # and end from ibis_window pyspark_window = Window.partitionBy( grouping_keys).orderBy(ordering_keys) .rowsBetween(start, end) return t.translate(operand, scope, window=pyspark_window)
  • 55. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.WindowOp) def compile_window_op(t, expr, scope, **kwargs): op = expr.op() ibis_window = op.window operand = op.expr . # pull out grouping_keys, ordering_keys, start # and end from ibis_window pyspark_window = Window.partitionBy( grouping_keys).orderBy(ordering_keys) .rowsBetween(start, end) return t.translate(operand, scope, window=pyspark_window)
  • 56. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.Mean) def compile_mean(t, expr, scope, context=None, **kwargs): def fn(col): if 'window' in kwargs: window = kwargs['window'] return F.mean(col).over(window) else: return F.mean(col) return compile_aggregator( t, expr, scope, fn, context, **kwargs )
  • 57. Group-by and windowed aggregation Ibis expression w = ibis.window(group_by='key', order_by='v1', preceding=2, following=0) table = table.mutate( feature2=table.feature.mean().over(w) ) Code @compiles(ops.Mean) def compile_mean(t, expr, scope, context=None, **kwargs): def fn(col): if 'window' in kwargs: window = kwargs['window'] return F.mean(col).over(window) else: return F.mean(col) return compile_aggregator( t, expr, scope, fn, context, **kwargs ) where F is pyspark.sql.functions
  • 58. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics)
  • 59. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics)
  • 60. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics)
  • 61. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics)
  • 62. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics) Code @compiles(ops.Aggregation) def compile_aggregation(t, expr, scope, **kwargs): op = expr.op() pyspark_df = t.translate(op.table, scope) if op.by: context = AggregationContext.GROUP aggs = [_compile_agg(...) for m in op.metrics] bys = [t.translate(b, ...) for b in op.by] return pyspark_df.groupby(*bys).agg(*aggs) else: context = AggregationContext.ENTIRE aggs = [_compile_agg(...) for m in op.metrics]
  • 63. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics) Code @compiles(ops.Aggregation) def compile_aggregation(t, expr, scope, **kwargs): op = expr.op() pyspark_df = t.translate(op.table, scope) if op.by: context = AggregationContext.GROUP aggs = [_compile_agg(...) for m in op.metrics] bys = [t.translate(b, ...) for b in op.by] return pyspark_df.groupby(*bys).agg(*aggs) else: context = AggregationContext.ENTIRE aggs = [_compile_agg(...) for m in op.metrics] List of PySpark columns/ column expressions
  • 64. Composite aggregation Ibis expression metrics = [ table.feature2.min(), table.feature2.max() ] result = table.group_by('key').aggregate(metrics) Code @compiles(ops.Aggregation) def compile_aggregation(t, expr, scope, **kwargs): op = expr.op() pyspark_df = t.translate(op.table, scope) if op.by: context = AggregationContext.GROUP aggs = [_compile_agg(...) for m in op.metrics] bys = [t.translate(b, ...) for b in op.by] return pyspark_df.groupby(*bys).agg(*aggs) else: context = AggregationContext.ENTIRE aggs = [_compile_agg(...) for m in op.metrics] PySpark groupby and aggregate
  • 65. Demo
  • 67. Conclusion ● Separate API from execution ● Don’t limit only to what you can use today, think about what you can use in the future ○ Arrow Dataset ○ Modin ○ cudf / dask-cudf ○ ...

Editor's Notes

  1. As a Data scientist, often you would need to do some data manipulation before all the fancy machine learning tactics For example
  2. As a Data scientist, often you would need to do some data manipulation before all the fancy machine learning tactics For example
  3. As a Data scientist, often you would need to do some data manipulation before all the fancy machine learning tactics For example
  4. As a Data scientist, often you would need to do some data manipulation before all the fancy machine learning tactics For example
  5. Key slides - spend more time to make sure the message is well received
  6. Key slides - spend more time to make sure the message is well received
  7. Key slides - spend more time to make sure the message is well received