PyData NYC 2019

Ibis:
Seamless Transition Between
Pandas and Spark

● Modeling Platform and Tools @ Two Sigma
● Apache Spark, Apache Arrow, Ibis
About Me: Li Jin

● Software engineer at Two Sigma
● Distributed compute, Spark
● Running, hiking, traveling
About Me: Hyonjee Joo

Outline
● Introduction: A common data science problem
● Ibis: A high level introduction
● Ibis Language
● Pyspark backend for Ibis
● Demo
● Conclusion

df = pd.read_parquet('s3://…')
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature'] = df.groupby('key')['feature']
.rolling(7, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
df.groupby('key').agg({'feature': {'max', 'min'}})
A common data science problem...

● Pandas is good for medium size data (<10G)
● You machine doesn’t have enough RAM to hold all the data
● You want to use more CPUs
You are happy until...The code runs too slow

● Use a bigger machine (You can get machines with 1T RAM these
days)
● Pros:
○ Low human cost: no code change
● Cons:
○ Not scalable
○ Same software limits (e.g., single thread)
○ Can be difficult depending on your organization
Try a few things...

● Use a generic way to distribution pandas code:
○ e.g.: sparkContext.parallelize(range(2000, 2020)).map(lambda year:
compute_for_year(year)).collect()
○ dask.delayed
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
Try a few things...

Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ …
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use

Take a step back...
df = pd.read_parquet('s3://…')
df['feature2'] = df.groupby('key')['feature']
df.groupby('key').agg({'feature2': {'max', 'min'}})

Separation of API and execution
● The problem:
○ software and hardware that we use (pandas on single machine) cannot
execute the computation we want to do.
● Can we separate “what we want to do” and “how it is done”

Separation of API and execution
● Yes!
● SQL is a way to express the compute that is independent of the
software and hardware
● Can we have something that is like SQL, but for Python?

Don’t limit yourself to what you can use today
● Current “DataFrame”:
○ Pandas
○ Spark
○ Dask
● Future “DataFrame”:
○ Arrow Dataset
○ Cudf / dask-cudf
○ Ray / Modin
○ ...

Ibis: A high level
introduction

Ibis: Python Data Analysis Productivity Framework
● Open source
● Started in 2014 by Wes McKinney
● Worked on by top committers of Pandas:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback

Ibis components
● ibis “language”
○ The ibis language that is used to express the computation
● ibis “backends”
○ Modules that translate ibis expressions to sth that is executable by
different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ...

Ibis language
● Table API
○ Projection
○ Filtering
○ Join
○ Groupby
○ Sort
○ Window
○ Aggregation
○ …
○ AsofJoin
○ UDFs
● Ibis expressions (intermediate representation)

Ibis language example
● Table API:
○ table = table.mutate(v3=table[‘v1’] + table[‘v2’])
○ type(table) = ibis.TableExpr
● Ibis expressions:

Ibis backends
● Ibis expressions -> Backend specific expressions
● table.mutate(v3=table[‘v1’] + table[‘v2’])
○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’])
○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])

Recall our earlier example in Pandas
df.groupby('key').agg({'feature2': {'max', 'min'}})

Basic expression and column selection
Pandas expression

Ibis expression
table = table.mutate(
feature=(table.v1 + table.v2)/2
)
Pandas expression

Group-by and windowed aggregation
Ibis expression
)
Pandas expression

Ibis expression
)
w = ibis.window(group_by='key', order_by='v1',
preceding=2, following=0)
feature2=table.feature.mean().over(w)
)
Pandas expression

Composite aggregation
Ibis expression
)
)
Pandas expression
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)

Ibis expression
)
)
metrics = [table.feature2.min(),
table.feature2.max()]
table.group_by('key').aggregate(metrics)
Pandas expression
)

Final translation
Ibis expression
)
)
metrics = [table.feature2.min(),
table.feature2.max()]
Pandas expression
)

A distributed backend for Ibis with PySpark
Ibis expression tree

Initialize Ibis PySpark client
Ibis expression
session = SparkSession.builder.getOrCreate()
client = ibis.pyspark.connect(session)

Ibis expression
table = client.table('table')
Access table in Ibis

Ibis expression
table.mutate(
)

Ibis expression
table.mutate(
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
t is a PySparkTranslator

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
t is a PySparkTranslator
It has a translate() method that
translates an Ibis expression into a
PySpark object.

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
expr is the ibis expression to
translate

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
left and right are
PySpark columns

Ibis expression
table.mutate(
)
Code
op = expr.op()
return left / right
PySpark column division

Ibis expression
table.mutate(
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
return left + right

Ibis expression
table.mutate(
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
return left + right
left and right are
PySpark columns
PySpark column addition

Ibis expression
table.mutate(
)
Code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]

Ibis expression
)

Ibis expression
)
Code
@compiles(ops.WindowOp)
def compile_window_op(t, expr, scope, **kwargs):
op = expr.op()
ibis_window = op.window
operand = op.expr
.
# pull out grouping_keys, ordering_keys, start
# and end from ibis_window
pyspark_window = Window.partitionBy(
grouping_keys).orderBy(ordering_keys)
.rowsBetween(start, end)
return t.translate(operand, scope,
window=pyspark_window)

Ibis expression
)
Code
@compiles(ops.Mean)
def compile_mean(t, expr, scope, context=None,
**kwargs):
def fn(col):
if 'window' in kwargs:
window = kwargs['window']
return F.mean(col).over(window)
else:
return F.mean(col)
return compile_aggregator(
t, expr, scope, fn, context, **kwargs
)

Ibis expression
)
Code
@compiles(ops.Mean)
def compile_mean(t, expr, scope, context=None,
**kwargs):
def fn(col):
if 'window' in kwargs:
window = kwargs['window']
return F.mean(col).over(window)
else:
return F.mean(col)
return compile_aggregator(
t, expr, scope, fn, context, **kwargs
)
where F is pyspark.sql.functions

Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =

Ibis expression
metrics = [
]
result =
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]

Ibis expression
metrics = [
]
result =
Code
op = expr.op()
if op.by:
else:
op.metrics]
List of
PySpark
columns/
column
expressions

Ibis expression
metrics = [
]
result =
Code
op = expr.op()
if op.by:
else:
op.metrics]
PySpark
groupby
and
aggregate

Conclusion
● Separate API from execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...

Thanks!
ice.xelloss@gmail.com (@icexelloss)
hyonjee.joo@twosigma.com (@hjoo)

PyData NYC 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PyData NYC 2019

Similar to PyData NYC 2019 (20)

Recently uploaded

Recently uploaded (20)

PyData NYC 2019

Editor's Notes