Intro to PySpark: Python Data Analysis at scale in the Cloud

Welcome to ServerlessToronto.org
“Home of Less IT Mess”
1
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…

Serverless is not just about the Tech:
2
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
developers
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)

Upcoming #ServerlessTO Online Meetups
3
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?

Feature Talk
Jonathan Rioux, Head Data Scientist at
EPAM Systems & author of Manning book
PySpark in Action
4

Getting acquainted
with PySpark
1/49

If you have not ﬁlled the Meetup survey, now is the
time to do it!
(Also copied in the chat)
https://forms.gle/6cyWGVY4L4GJvsXh7
2/49

Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
3/49

Hi! I'm Jonathan
Head of DS @ EPAM Canada
3/49

Hi! I'm Jonathan
Author of PySpark in Action →
3/49

Hi! I'm Jonathan
Author of PySpark in Action →
<3 Spark, <3 <3 Python
3/49

Goals of this presentation
6/49

Share my love of (Py)Spark
6/49

Explain where PySpark shines
6/49

Introduce the Python + Spark interop
6/49

Get you excited about using PySpark
6/49

Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
6/49

What I expect from you
You know a little bit of Python
7/49

You know what SQL is
7/49

You know what SQL is
You won't hesitate to ask questions :-)
7/49

What is Spark
Spark is a uniﬁed analytics engine for large-scale
data processing
8/49

What is Spark (bis)
Spark can be thought of a data factory that you
(mostly) program like a cohesive computer.
9/49

Spark as an analytics factory
11/49

Data manipulation uses the
same vocabulary as SQL
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
.count("*")
)
13/49

(
my_table
.groupby("age")
.count("*")
)
select
13/49

(
my_table
.groupby("age")
.count("*")
)
where
13/49

.groupby("age")
(
my_table
.count("*")
)
group by
13/49

.count("*")
(
my_table
.groupby("age")
)
count
13/49

I mean, you can legitimately use
SQL
spark.sql("""
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
)
group by age""")
14/49

Data manipulation and machine
learning with a uent API
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Read a text ﬁle
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Select the column value, where each element is
splitted (space as a separator). Alias to line.
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Explode each element of line into its own record.
Alias to word.
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Lower-case each word
15/49

"[a-z']*", 0).alias("word"))
results = (
.count()
)
Extract only the ﬁrst group of lower-case letters from
each word.
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Keep only the records where the word is not the
empty string.
15/49

results = (
"[a-z']*", 0).alias("word"))
.count()
)
Group by word
15/49

.count()
results = (
"[a-z']*", 0).alias("word"))
)
Count the number of records in each group
15/49

   
Scala is not the only player in
town
16/49

Summoning PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
19/49

Summoning PySpark

).getOrCreate()
A SparkSession is your entry point to distributed data manipulation
19/49

Summoning PySpark
).getOrCreate()

We create our SparkSession with an optional library to access BigQuery as a data source.
19/49

Reading data
from functools import reduce
from pyspark.sql import DataFrame

def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)

gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
20/49

Reading data
return (
.load()
)
from functools import reduce
from pyspark.sql import DataFrame

gsod = (
reduce(
We create a helper function to read our code from BigQuery.
20/49

Reading data
gsod = (
reduce(
)
)

return (
.load()
)

A DataFrame is a regular Python object.
20/49

Using the power of the schema
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
gsod.printSchema()
The schema will give us the column names and their types.
21/49

And showing data
gsod = gsod.select("stn", "year", "mo", "da", "temp")

gsod.show(5)

# Approximately 5 seconds waiting
# +------+----+---+---+----+
# | stn|year| mo| da|temp|
# +------+----+---+---+----+
# |359250|2010| 02| 25|25.2|
# |359250|2010| 05| 25|65.0|
# |386130|2010| 02| 19|35.4|
# |386130|2010| 03| 15|52.2|
# |386130|2010| 01| 21|37.9|
# +------+----+---+---+----+
# only showing top 5 rows
22/49

What happens behind the scenes?
23/49

Any data frame transformation will be stored until we need the
data.
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
24/49

Transformations
select
Actions
25/49

Transformations
select
ﬁlter
Actions
25/49

Transformations
select
ﬁlter
group by
Actions
25/49

Transformations
select
ﬁlter
group by
partition
Actions
25/49

Transformations
select
ﬁlter
group by
partition
Actions
write
25/49

Transformations
select
ﬁlter
group by
partition
Actions
write
show
25/49

Transformations
select
ﬁlter
group by
partition
Actions
write
show
count
25/49

Transformations
select
ﬁlter
group by
partition
Actions
write
show
count
toPandas
25/49

Something a little more complex
import pyspark.sql.functions as F

stations = (
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.load()
)

# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.where(F.col("country").isNotNull())
.groupBy("country")
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
26/49

27/49

28/49

Python or SQL?
gsod.createTempView("gsod")
stations.createTempView("stations")

spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
29/49

Python or SQL?

spark.sql("""
from gsod
inner join stations
group by country
""").show(5)
We register the data frames as Spark SQL tables.
29/49

Python or SQL?

spark.sql("""
from gsod
inner join stations
group by country
""").show(5)
We then can query using SQL without leaving Python!
29/49

Python and SQL!
(
spark.sql(
"""
from gsod
inner join stations
group by country"""
)
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
.show(5)
)
30/49

Scalar UDF
import pandas as pd
import pyspark.sql.types as T

@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9

f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
34/49

Scalar UDF
import pandas as pd


# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
PySpark types are objects in the pyspark.sql.types modules.
34/49

Scalar UDF
import pandas as pd


# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
We promote a regular Python function to a User Deﬁned Function via a
decorator.
34/49

Scalar UDF
import pandas as pd


# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
A simple function on pandas Series
34/49

Scalar UDF
import pandas as pd


# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
Still unit-testable :-)
34/49

Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)

# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
35/49

Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)

# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
A UDF can be used like any PySpark function.
35/49

Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.

If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49

Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.

If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
A regular, fun, harmless function on (pandas) DataFrames
36/49

Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)

gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)

gsod.show(5, False)

# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
37/49

Grouped Map UDF
)

)

gsod.show(5, False)

# +------+----+---+---+----+------------------+
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We provide PySpark the schema we expect our function to return
37/49

Grouped Map UDF
)
)

gsod.show(5, False)

# +------+----+---+---+----+------------------+
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We just have to partition (using group), and then applyInPandas!
37/49

You are not limited library-wise
from sklearn.linear_model import LinearRegression

def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
LinearRegression()
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
.coef_[0]
)
39/49

result = gsod.groupby("stn", "year", "mo").agg(
rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias(
"rt_chg_temp"
)
)

result.show(5, False)
# +------+----+---+---------------------+
# |stn |year|mo |rt_chg_temp |
# +------+----+---+---------------------+
# |010250|2018|12 |-0.01014397905759162 |
# |011120|2018|11 |-0.01704736746691528 |
# |011150|2018|10 |-0.013510329829648423|
# |011510|2018|03 |0.020159116598556657 |
# |011800|2018|06 |0.012645501680677372 |
# +------+----+---+---------------------+
40/49

From the README.md
import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame(
{
'x':range(3),
'y':['a','b','b'],
'z':['a','b','b'],
}
)

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']
44/49

"Serverless" Spark?
Cost-eﬀective for sporadic
runs
47/49

"Serverless" Spark?
runs
Scales easily
47/49

"Serverless" Spark?
runs
Scales easily
Simpliﬁed maintenance
47/49

"Serverless" Spark?
runs
Scales easily
Easy to become expensive
47/49

"Serverless" Spark?
runs
Scales easily
Sometimes confusing
pricing model
47/49

"Serverless" Spark?
runs
Scales easily
Sometimes confusing
pricing model
Uneven documentation
47/49

Join www.ServerlessToronto.org
Home of “Less IT Mess”

Intro to PySpark: Python Data Analysis at scale in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud (20)

More from Daniel Zivkovic

More from Daniel Zivkovic (20)

Recently uploaded

Recently uploaded (20)

Intro to PySpark: Python Data Analysis at scale in the Cloud