2. ● Modeling Platform and Tools @ Two Sigma
● Apache Spark, Apache Arrow, Ibis
About Me: Li Jin
3. ● Software engineer at Two Sigma
● Distributed compute, Spark
● Running, hiking, traveling
About Me: Hyonjee Joo
4. Outline
● Introduction: A common data science problem
● Ibis: A high level introduction
● Ibis Language
● Pyspark backend for Ibis
● Demo
● Conclusion
5. df = pd.read_parquet('s3://…')
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature'] = df.groupby('key')['feature']
.rolling(7, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
df.groupby('key').agg({'feature': {'max', 'min'}})
A common data science problem...
6. ● Pandas is good for medium size data (<10G)
● You machine doesn’t have enough RAM to hold all the data
● You want to use more CPUs
You are happy until...The code runs too slow
7. ● Use a bigger machine (You can get machines with 1T RAM these
days)
● Pros:
○ Low human cost: no code change
● Cons:
○ Not scalable
○ Same software limits (e.g., single thread)
○ Can be difficult depending on your organization
Try a few things...
8. ● Use a generic way to distribution pandas code:
○ e.g.: sparkContext.parallelize(range(2000, 2020)).map(lambda year:
compute_for_year(year)).collect()
○ dask.delayed
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
Try a few things...
9. Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ …
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use
11. Separation of API and execution
● The problem:
○ software and hardware that we use (pandas on single machine) cannot
execute the computation we want to do.
● Can we separate “what we want to do” and “how it is done”
12. Separation of API and execution
● Yes!
● SQL is a way to express the compute that is independent of the
software and hardware
● Can we have something that is like SQL, but for Python?
13. Don’t limit yourself to what you can use today
● Current “DataFrame”:
○ Pandas
○ Spark
○ Dask
● Future “DataFrame”:
○ Arrow Dataset
○ Cudf / dask-cudf
○ Ray / Modin
○ ...
15. Ibis: Python Data Analysis Productivity Framework
● Open source
● Started in 2014 by Wes McKinney
● Worked on by top committers of Pandas:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback
16. Ibis components
● ibis “language”
○ The ibis language that is used to express the computation
● ibis “backends”
○ Modules that translate ibis expressions to sth that is executable by
different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ...
36. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
37. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
38. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
It has a translate() method that
translates an Ibis expression into a
PySpark object.
39. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
expr is the ibis expression to
translate
40. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.
41. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
left and right are
PySpark columns
42. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
PySpark column division
43. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
44. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
45. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
left and right are
PySpark columns
PySpark column addition
46. Basic expression and column selection
Ibis expression
table.mutate(
feature=(table.v1 + table.v2)/2
)
Code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
62. Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
63. Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
List of
PySpark
columns/
column
expressions
64. Composite aggregation
Ibis expression
metrics = [
table.feature2.min(),
table.feature2.max()
]
result =
table.group_by('key').aggregate(metrics)
Code
@compiles(ops.Aggregation)
def compile_aggregation(t, expr, scope, **kwargs):
op = expr.op()
pyspark_df = t.translate(op.table, scope)
if op.by:
context = AggregationContext.GROUP
aggs = [_compile_agg(...) for m in op.metrics]
bys = [t.translate(b, ...) for b in op.by]
return pyspark_df.groupby(*bys).agg(*aggs)
else:
context = AggregationContext.ENTIRE
aggs = [_compile_agg(...) for m in
op.metrics]
PySpark
groupby
and
aggregate
67. Conclusion
● Separate API from execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...