WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Bill Chambers, Databricks
Tactical Data Science Tips:
Python and Spark Together
#UnifiedDataAnalytics #SparkAISummit
Overview of this talk
Set Context for the talk
Introductions
Discuss Spark + Python + ML
5 Ways to Process Data with Spark & Python
2 Data Science Use Cases and how we implemented them
3#UnifiedDataAnalytics #SparkAISummit
Setting Context: You
Data Scientists vs Data Engineers?
Years with Spark?
1/3/5+
Number of Spark Summits?
1/3/5+
Understanding of catalyst optimizer?
Yes/No
Years with pandas?
1/3/5+
Models/Data Science use cases in production?
1/10/100+
4#UnifiedDataAnalytics #SparkAISummit
Setting Context: Me
4 years at Databricks
~0.5 yr as Solutions Architect,
2.5 in Product,
1 in Data Science
Wrote a book on Spark
Master’s in Information Systems
History undergrad
5#UnifiedDataAnalytics #SparkAISummit
Setting Context: My Biases
Spark is an awesome(but a sometimes complex) tool
Information organization is a key to success
Bias for practicality and action
6#UnifiedDataAnalytics #SparkAISummit
5 ways of processing data
with Spark and Python
5 Ways of Processing with Python
RDDs
DataFrames
Koalas
UDFs
pandasUDFs
8#UnifiedDataAnalytics #SparkAISummit
Resilient Distributed Datasets (RDDs)
rdd = sc.parallelize(range(1000), 5)
rdd.map(lambda x: (x, x * 10)).take(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[(0, 0), (1, 10), (2, 20), (3, 30), (4, 40), (5, 50), (6, 60), (7, 70), (8,
80), (9, 90)]
9#UnifiedDataAnalytics #SparkAISummit
Resilient Distributed Datasets (RDDs)
What that requires…
10#UnifiedDataAnalytics #SparkAISummit
JVM
Serialize
row
Python
Process
Deserialize
Row
Perform
Operation
Python
Process
Serialize
Row
deserialize
row
JVM
Key Points
Expensive to operate (starting and shutting down
processes, pickle serialization)
Majority of operations can be performed using
DataFrames (next processing method)
Don’t use RDDs in Python
11#UnifiedDataAnalytics #SparkAISummit
DataFrames
12#UnifiedDataAnalytics #SparkAISummit
df = spark.range(1000)
print(df.limit(10).collect())
df = df.withColumn("col2", df.id * 10)
print(df.limit(10).collect())
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6),
Row(id=7), Row(id=8), Row(id=9)]
[Row(id=0, col2=0), Row(id=1, col2=10), Row(id=2, col2=20), Row(id=3,
col2=30), Row(id=4, col2=40), Row(id=5, col2=50), Row(id=6, col2=60),
Row(id=7, col2=70), Row(id=8, col2=80), Row(id=9, col2=90)]
DataFrames
13#UnifiedDataAnalytics #SparkAISummit
df = spark.range(1000)
print(df.limit(10).collect())
df = df.withColumn("col2", df.id * 10)
print(df.limit(10).collect())
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6),
Row(id=7), Row(id=8), Row(id=9)]
[Row(id=0, col2=0), Row(id=1, col2=10), Row(id=2, col2=20), Row(id=3,
col2=30), Row(id=4, col2=40), Row(id=5, col2=50), Row(id=6, col2=60),
Row(id=7, col2=70), Row(id=8, col2=80), Row(id=9, col2=90)]
DataFrames
What that requires…
14#UnifiedDataAnalytics #SparkAISummit
Spark’s
Catalyst
Internal
Row
Spark’s
Catalyst
Internal
Row
Key Points
Provides numerous operations (nearly anything
you’d find in SQL)
By using an internal format, DataFrames give
Python the same performance profile as Scala
15#UnifiedDataAnalytics #SparkAISummit
Koalas DataFrames
16#UnifiedDataAnalytics #SparkAISummit
import databricks.koalas as ks
kdf = ks.DataFrame(spark.range(1000))
kdf['col2'] = kdf.id * 10 # note pandas syntax
kdf.head(10) # returns pandas dataframe
Koalas: Background
Use Koalas - a library that aims
to make the pandas API
available on Spark.
pip install koalas
pandaspyspark
Koalas DataFrames
What that requires…
18#UnifiedDataAnalytics #SparkAISummit
Spark’s
Catalyst
Internal
Row
Spark’s
Catalyst
Internal
Row
Key Points
Koalas gives some API consistency gains between
pyspark + pandas
It’s never go to match either pandas or PySpark fully
Try it and see if it covers your use case, if not move
to DataFrames
19#UnifiedDataAnalytics #SparkAISummit
Key gap between RDDs + DataFrames
No way to run “custom” code on a row or a subset
of the data
Next two transforming methods are “user-defined
functions” or “UDFs”
20#UnifiedDataAnalytics #SparkAISummit
DataFrame UDFs
df = spark.range(1000)
from pyspark.sql.functions import udf
@udf
def regularPyUDF(value):
return value * 10
df = df.withColumn("col3_udf_", regularPyUDF(df.col2))
21#UnifiedDataAnalytics #SparkAISummit
DataFrame UDFs
What it requires…
22#UnifiedDataAnalytics #SparkAISummit
JVM Serialize
row
Python
Process
Deserialize
Row
Perform
Operation
Python
Process
Serialize
Row
deserialize
row
JVM
Key Points
Legacy Python UDFs are essentially the same as
RDDs
Suffer from nearly all the same inadequacies
Should never be used in place of PandasUDFs
(next processing method)
23#UnifiedDataAnalytics #SparkAISummit
DataFrames + PandasUDFs
df = spark.range(1000)
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('integer', PandasUDFType.SCALAR)
def pandasPyUDF(pandas_series):
return pandas_series.multiply(10)
# the above could also be a pandas DataFrame
# if multiple rows
df = df.withColumn("col3_pandas_udf_", pandasPyUDF(df.col2))
24#UnifiedDataAnalytics #SparkAISummit
DataFrames + PandasUDFs
25#UnifiedDataAnalytics #SparkAISummit
JVM
Serialize
Catalyst to
Arrow
deserialize
arrow as
pandas DF
or Series
Perform
Operation
Serialize to
arrow
deserialize
to Catalyst
Format
JVM
Optimized with pandas + NumPyOptimized with Apache Arrow
Key Points
Performance won’t match pure DataFrames
Extremely flexible + follows best practices (pandas)
Limitations:
Going to be challenging when working with GPUs + Deep Learning
Frameworks (connecting to hardware = challenge)
Batch size to pandas must fit in memory of executor
26#UnifiedDataAnalytics #SparkAISummit
Conclusion from 5 ways of processing
Use Koalas if it works for you, but Spark
DataFrames are the “safest” option
Use pandasUDFs for user-defined functions
27#UnifiedDataAnalytics #SparkAISummit
2 Data Science use cases
and how we implemented
them
2 Data Science Use Cases
Growth Forecasting
2 methods for implementation:
a. Use information about a single
customer
– Low n, low k, high m
b. Use information about all
customers
– High n, low k, low m
Churn Prediction
Low n, low k, low m
29
3 Key Dimensions
How many input rows do you have? [n]
Large (10M) vs small (10K)
How many input features do you have? [k]
Large (1M) vs small (100)
How many models do you need to produce? [m]
Large (1K) vs small (1)
30#UnifiedDataAnalytics #SparkAISummit
Growth Forecasting (variation a)
(low n, low k, high m)
“Historical” Approach:
• Collect to driver
• Use RDD pipe to try and
send it to another process
or other RDD code
31
“Our” Approach:
• pandasUDF for
distributed training (of
small models)
Growth Forecasting (variation b)
(high n, low k, low m)
“Historical” Approach:
• Use Spark’s MLlib;
distributed algorithms
to approach the
problem
32
Our Approach:
• MLlib
• Horovod
Churn Prediction
(low n, low k, low m)
“Historical” Approach:
• Tune on a single
machine
• Complex process to
distribute training
33
Our Approach:
• pandasUDF for
distributed hyper-
parameter tuning (of
small models)
Code walkthrough of each method
Code
gist.github.com/anabranch for code
34
The Final Gap
Each of these methods operate slightly
differently
Some distributed, some not
Consistency in production is essential to
success.
Know input and outputs, features, etc.
35
MLFlow is key to our success
Allows tracking of all inputs to model + results
- Inputs (data + hyperparameters)
- Models (trained and untrained)
- Rebuild everything after the fact
Code
gist.github.com/anabranch for code
36
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Tactical Data Science Tips: Python and Spark Together

  • 1.
    WIFI SSID:Spark+AISummit |Password: UnifiedDataAnalytics
  • 2.
    Bill Chambers, Databricks TacticalData Science Tips: Python and Spark Together #UnifiedDataAnalytics #SparkAISummit
  • 3.
    Overview of thistalk Set Context for the talk Introductions Discuss Spark + Python + ML 5 Ways to Process Data with Spark & Python 2 Data Science Use Cases and how we implemented them 3#UnifiedDataAnalytics #SparkAISummit
  • 4.
    Setting Context: You DataScientists vs Data Engineers? Years with Spark? 1/3/5+ Number of Spark Summits? 1/3/5+ Understanding of catalyst optimizer? Yes/No Years with pandas? 1/3/5+ Models/Data Science use cases in production? 1/10/100+ 4#UnifiedDataAnalytics #SparkAISummit
  • 5.
    Setting Context: Me 4years at Databricks ~0.5 yr as Solutions Architect, 2.5 in Product, 1 in Data Science Wrote a book on Spark Master’s in Information Systems History undergrad 5#UnifiedDataAnalytics #SparkAISummit
  • 6.
    Setting Context: MyBiases Spark is an awesome(but a sometimes complex) tool Information organization is a key to success Bias for practicality and action 6#UnifiedDataAnalytics #SparkAISummit
  • 7.
    5 ways ofprocessing data with Spark and Python
  • 8.
    5 Ways ofProcessing with Python RDDs DataFrames Koalas UDFs pandasUDFs 8#UnifiedDataAnalytics #SparkAISummit
  • 9.
    Resilient Distributed Datasets(RDDs) rdd = sc.parallelize(range(1000), 5) rdd.map(lambda x: (x, x * 10)).take(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [(0, 0), (1, 10), (2, 20), (3, 30), (4, 40), (5, 50), (6, 60), (7, 70), (8, 80), (9, 90)] 9#UnifiedDataAnalytics #SparkAISummit
  • 10.
    Resilient Distributed Datasets(RDDs) What that requires… 10#UnifiedDataAnalytics #SparkAISummit JVM Serialize row Python Process Deserialize Row Perform Operation Python Process Serialize Row deserialize row JVM
  • 11.
    Key Points Expensive tooperate (starting and shutting down processes, pickle serialization) Majority of operations can be performed using DataFrames (next processing method) Don’t use RDDs in Python 11#UnifiedDataAnalytics #SparkAISummit
  • 12.
    DataFrames 12#UnifiedDataAnalytics #SparkAISummit df =spark.range(1000) print(df.limit(10).collect()) df = df.withColumn("col2", df.id * 10) print(df.limit(10).collect()) [Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6), Row(id=7), Row(id=8), Row(id=9)] [Row(id=0, col2=0), Row(id=1, col2=10), Row(id=2, col2=20), Row(id=3, col2=30), Row(id=4, col2=40), Row(id=5, col2=50), Row(id=6, col2=60), Row(id=7, col2=70), Row(id=8, col2=80), Row(id=9, col2=90)]
  • 13.
    DataFrames 13#UnifiedDataAnalytics #SparkAISummit df =spark.range(1000) print(df.limit(10).collect()) df = df.withColumn("col2", df.id * 10) print(df.limit(10).collect()) [Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5), Row(id=6), Row(id=7), Row(id=8), Row(id=9)] [Row(id=0, col2=0), Row(id=1, col2=10), Row(id=2, col2=20), Row(id=3, col2=30), Row(id=4, col2=40), Row(id=5, col2=50), Row(id=6, col2=60), Row(id=7, col2=70), Row(id=8, col2=80), Row(id=9, col2=90)]
  • 14.
    DataFrames What that requires… 14#UnifiedDataAnalytics#SparkAISummit Spark’s Catalyst Internal Row Spark’s Catalyst Internal Row
  • 15.
    Key Points Provides numerousoperations (nearly anything you’d find in SQL) By using an internal format, DataFrames give Python the same performance profile as Scala 15#UnifiedDataAnalytics #SparkAISummit
  • 16.
    Koalas DataFrames 16#UnifiedDataAnalytics #SparkAISummit importdatabricks.koalas as ks kdf = ks.DataFrame(spark.range(1000)) kdf['col2'] = kdf.id * 10 # note pandas syntax kdf.head(10) # returns pandas dataframe
  • 17.
    Koalas: Background Use Koalas- a library that aims to make the pandas API available on Spark. pip install koalas pandaspyspark
  • 18.
    Koalas DataFrames What thatrequires… 18#UnifiedDataAnalytics #SparkAISummit Spark’s Catalyst Internal Row Spark’s Catalyst Internal Row
  • 19.
    Key Points Koalas givessome API consistency gains between pyspark + pandas It’s never go to match either pandas or PySpark fully Try it and see if it covers your use case, if not move to DataFrames 19#UnifiedDataAnalytics #SparkAISummit
  • 20.
    Key gap betweenRDDs + DataFrames No way to run “custom” code on a row or a subset of the data Next two transforming methods are “user-defined functions” or “UDFs” 20#UnifiedDataAnalytics #SparkAISummit
  • 21.
    DataFrame UDFs df =spark.range(1000) from pyspark.sql.functions import udf @udf def regularPyUDF(value): return value * 10 df = df.withColumn("col3_udf_", regularPyUDF(df.col2)) 21#UnifiedDataAnalytics #SparkAISummit
  • 22.
    DataFrame UDFs What itrequires… 22#UnifiedDataAnalytics #SparkAISummit JVM Serialize row Python Process Deserialize Row Perform Operation Python Process Serialize Row deserialize row JVM
  • 23.
    Key Points Legacy PythonUDFs are essentially the same as RDDs Suffer from nearly all the same inadequacies Should never be used in place of PandasUDFs (next processing method) 23#UnifiedDataAnalytics #SparkAISummit
  • 24.
    DataFrames + PandasUDFs df= spark.range(1000) from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('integer', PandasUDFType.SCALAR) def pandasPyUDF(pandas_series): return pandas_series.multiply(10) # the above could also be a pandas DataFrame # if multiple rows df = df.withColumn("col3_pandas_udf_", pandasPyUDF(df.col2)) 24#UnifiedDataAnalytics #SparkAISummit
  • 25.
    DataFrames + PandasUDFs 25#UnifiedDataAnalytics#SparkAISummit JVM Serialize Catalyst to Arrow deserialize arrow as pandas DF or Series Perform Operation Serialize to arrow deserialize to Catalyst Format JVM Optimized with pandas + NumPyOptimized with Apache Arrow
  • 26.
    Key Points Performance won’tmatch pure DataFrames Extremely flexible + follows best practices (pandas) Limitations: Going to be challenging when working with GPUs + Deep Learning Frameworks (connecting to hardware = challenge) Batch size to pandas must fit in memory of executor 26#UnifiedDataAnalytics #SparkAISummit
  • 27.
    Conclusion from 5ways of processing Use Koalas if it works for you, but Spark DataFrames are the “safest” option Use pandasUDFs for user-defined functions 27#UnifiedDataAnalytics #SparkAISummit
  • 28.
    2 Data Scienceuse cases and how we implemented them
  • 29.
    2 Data ScienceUse Cases Growth Forecasting 2 methods for implementation: a. Use information about a single customer – Low n, low k, high m b. Use information about all customers – High n, low k, low m Churn Prediction Low n, low k, low m 29
  • 30.
    3 Key Dimensions Howmany input rows do you have? [n] Large (10M) vs small (10K) How many input features do you have? [k] Large (1M) vs small (100) How many models do you need to produce? [m] Large (1K) vs small (1) 30#UnifiedDataAnalytics #SparkAISummit
  • 31.
    Growth Forecasting (variationa) (low n, low k, high m) “Historical” Approach: • Collect to driver • Use RDD pipe to try and send it to another process or other RDD code 31 “Our” Approach: • pandasUDF for distributed training (of small models)
  • 32.
    Growth Forecasting (variationb) (high n, low k, low m) “Historical” Approach: • Use Spark’s MLlib; distributed algorithms to approach the problem 32 Our Approach: • MLlib • Horovod
  • 33.
    Churn Prediction (low n,low k, low m) “Historical” Approach: • Tune on a single machine • Complex process to distribute training 33 Our Approach: • pandasUDF for distributed hyper- parameter tuning (of small models)
  • 34.
    Code walkthrough ofeach method Code gist.github.com/anabranch for code 34
  • 35.
    The Final Gap Eachof these methods operate slightly differently Some distributed, some not Consistency in production is essential to success. Know input and outputs, features, etc. 35
  • 36.
    MLFlow is keyto our success Allows tracking of all inputs to model + results - Inputs (data + hyperparameters) - Models (trained and untrained) - Rebuild everything after the fact Code gist.github.com/anabranch for code 36
  • 37.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT