SlideShare a Scribd company logo
Welcome to
“Home of Less IT Mess”
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…
Serverless is not just about the Tech:
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)
Upcoming #ServerlessTO Online Meetups
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
Feature Talk
Jonathan Rioux, Head Data Scientist at
EPAM Systems & author of Manning book
PySpark in Action
Getting acquainted
with PySpark
If you have not filled the Meetup survey, now is the
time to do it!
(Also copied in the chat)
Hi! I'm Jonathan
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
<3 Spark, <3 <3 Python
Goals of this presentation
Goals of this presentation
Share my love of (Py)Spark
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
What I expect from you
What I expect from you
You know a little bit of Python
What I expect from you
You know a little bit of Python
You know what SQL is
What I expect from you
You know a little bit of Python
You know what SQL is
You won't hesitate to ask questions :-)
What is Spark
Spark is a unified analytics engine for large-scale
data processing
What is Spark (bis)
Spark can be thought of a data factory that you
(mostly) program like a cohesive computer.
Spark under the hood
Spark as an analytics factory
Why is Pyspark
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
Data manipulation uses the
same vocabulary as SQL
.where(col("age") > 21)
.select("id", "first_name", "last_name", "age")
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
group by
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
I mean, you can legitimately use
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
group by age""")
Data manipulation and machine
learning with a uent API
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Data manipulation and machine
learning with a uent API"./data/Ch02/1342-0.txt")
results = (
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Read a text file
Data manipulation and machine
learning with a uent API
.select(F.split(F.col("value"), " ").alias("line"))
results = ("./data/Ch02/1342-0.txt")
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Select the column value, where each element is
splitted (space as a separator). Alias to line.
Data manipulation and machine
learning with a uent API
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Explode each element of line into its own record.
Alias to word.
Data manipulation and machine
learning with a uent API
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Lower-case each word
Data manipulation and machine
learning with a uent API
"[a-z']*", 0).alias("word"))
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.where(F.col("word") != "")
Extract only the first group of lower-case letters from
each word.
Data manipulation and machine
learning with a uent API
.where(F.col("word") != "")
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
Keep only the records where the word is not the
empty string.
Data manipulation and machine
learning with a uent API
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Group by word
Data manipulation and machine
learning with a uent API
results = ("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
Count the number of records in each group
         
Scala is not the only player in
Let's code!
Summoning PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
Summoning PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(
A SparkSession is your entry point to distributed data manipulation
Summoning PySpark
spark = SparkSession.builder.config(
from pyspark.sql import SparkSession
We create our SparkSession with an optional library to access BigQuery as a data source.
Reading data
from functools import reduce
from pyspark.sql import DataFrame
def read_df_from_bq(year):
return ("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
gsod = (
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
Reading data
def read_df_from_bq(year):
return ("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
from functools import reduce
from pyspark.sql import DataFrame
gsod = (
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
We create a helper function to read our code from BigQuery.
Reading data
gsod = (
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
def read_df_from_bq(year):
return ("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
A DataFrame is a regular Python object.
Using the power of the schema
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
Using the power of the schema
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
The schema will give us the column names and their types.
And showing data
gsod ="stn", "year", "mo", "da", "temp")
# Approximately 5 seconds waiting
# +------+----+---+---+----+
# | stn|year| mo| da|temp|
# +------+----+---+---+----+
# |359250|2010| 02| 25|25.2|
# |359250|2010| 05| 25|65.0|
# |386130|2010| 02| 19|35.4|
# |386130|2010| 03| 15|52.2|
# |386130|2010| 01| 21|37.9|
# +------+----+---+---+----+
# only showing top 5 rows
What happens behind the scenes?
Any data frame transformation will be stored until we need the
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
Transformations Actions
group by
group by
group by
group by
group by
group by
Something a little more complex
import pyspark.sql.functions as F
stations = ("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.option("credentialsFile", "bq-key.json")
# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
read, join, where, groupby, avg/count, where, orderby, show
read, join, where, groupby, avg/count, where, orderby, show
Python or SQL?
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
Python or SQL?
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
We register the data frames as Spark SQL tables.
Python or SQL?
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
We then can query using SQL without leaving Python!
Python and SQL!
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
group by country"""
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
Python ⇄Spark
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
PySpark types are objects in the pyspark.sql.types modules.
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
We promote a regular Python function to a User Defined Function via a
Scalar UDF
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
import pandas as pd
import pyspark.sql.types as T
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
A simple function on pandas Series
Scalar UDF
f_to_c.func(pd.Series(range(32, 213)))
import pandas as pd
import pyspark.sql.types as T
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
Still unit-testable :-)
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))"temp", "temp_c").distinct().show(5)
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))"temp", "temp_c").distinct().show(5)
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
A UDF can be used like any PySpark function.
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
A regular, fun, harmless function on (pandas) DataFrames
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
If the temperature is constant for the whole window,
defaults to 0.5.
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
), False)
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
), False)
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We provide PySpark the schema we expect our function to return
Grouped Map UDF
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
), False)
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We just have to partition (using group), and then applyInPandas!
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
), False)
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
You are not limited library-wise
from sklearn.linear_model import LinearRegression
def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
result = gsod.groupby("stn", "year", "mo").agg(
rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias(
), False)
# +------+----+---+---------------------+
# |stn |year|mo |rt_chg_temp |
# +------+----+---+---------------------+
# |010250|2018|12 |-0.01014397905759162 |
# |011120|2018|11 |-0.01704736746691528 |
# |011150|2018|10 |-0.013510329829648423|
# |011510|2018|03 |0.020159116598556657 |
# |011800|2018|06 |0.012645501680677372 |
# +------+----+---+---------------------+
# only showing top 5 rows
Not fan of the
From the
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame(
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
(Py)Spark in the
"Serverless" Spark?
"Serverless" Spark?
Cost-effective for sporadic
"Serverless" Spark?
Cost-effective for sporadic
Scales easily
"Serverless" Spark?
Cost-effective for sporadic
Scales easily
Simplified maintenance
"Serverless" Spark?
Cost-effective for sporadic
Scales easily
Simplified maintenance
Easy to become expensive
"Serverless" Spark?
Cost-effective for sporadic
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
"Serverless" Spark?
Cost-effective for sporadic
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
Uneven documentation
Thank you!
Home of “Less IT Mess”

More Related Content

What's hot

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Spark Summit
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
Krishna Sangeeth KS
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
Russell Jurney
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community

What's hot (20)

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Russell Jurney
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
Chris Fregly
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Spain
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
GeeksLab Odessa
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Juantomás García Molina
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
Chris Fregly

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015

More from Daniel Zivkovic

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Daniel Zivkovic
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
Daniel Zivkovic
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
Daniel Zivkovic
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
Daniel Zivkovic
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Daniel Zivkovic
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
Daniel Zivkovic
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
Daniel Zivkovic
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Daniel Zivkovic
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
Daniel Zivkovic
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic

More from Daniel Zivkovic (20)

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Tier1 app
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming

Intro to PySpark: Python Data Analysis at scale in the Cloud

  • 1. Welcome to “Home of Less IT Mess” 1 Introduce Yourself ☺ - Why are you here? - Looking for work? - Offering work? Our feature presentation “Intro to PySpark” starts at 6:20pm…
  • 2. Serverless is not just about the Tech: 2 Serverless is New Agile & Mindset Serverless Dev (gluing other people’s APIs and managed services) We're obsessed to creating business value (meaningful MVPs, products), by helping Startups & empowering Business users! We build bridges between Serverless Community (“Dev leg”), and Front-end & Voice- First folks (“UX leg”), and empower UX developers Achieve agility NOT by “sprinting” faster (like in Scrum), but by working smarter (by using bigger building blocks and less Ops)
  • 3. Upcoming #ServerlessTO Online Meetups 3 1. Accelerating with a Cloud Contact Center – Patrick Kolencherry Sr. Product Marketing Manager, and Karla Nussbaumer, Head of Technical Marketing at Twilio **JULY 9 @ 6pm ** 2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
  • 4. Feature Talk Jonathan Rioux, Head Data Scientist at EPAM Systems & author of Manning book PySpark in Action 4
  • 6. If you have not filled the Meetup survey, now is the time to do it! (Also copied in the chat) 2/49
  • 8. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast 3/49
  • 9. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada 3/49
  • 10. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → 3/49
  • 11. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → <3 Spark, <3 <3 Python 3/49
  • 12. 4/49
  • 13. 5/49
  • 14. Goals of this presentation 6/49
  • 15. Goals of this presentation Share my love of (Py)Spark 6/49
  • 16. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines 6/49
  • 17. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop 6/49
  • 18. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 6/49
  • 19. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 36,000 ft overview: Managed Spark in the Cloud 6/49
  • 20. What I expect from you 7/49
  • 21. What I expect from you You know a little bit of Python 7/49
  • 22. What I expect from you You know a little bit of Python You know what SQL is 7/49
  • 23. What I expect from you You know a little bit of Python You know what SQL is You won't hesitate to ask questions :-) 7/49
  • 24. What is Spark Spark is a unified analytics engine for large-scale data processing 8/49
  • 25. What is Spark (bis) Spark can be thought of a data factory that you (mostly) program like a cohesive computer. 9/49
  • 26. Spark under the hood 10/49
  • 27. Spark as an analytics factory 11/49
  • 29. Data manipulation uses the same vocabulary as SQL ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") .count("*") ) 13/49
  • 30. Data manipulation uses the same vocabulary as SQL .select("id", "first_name", "last_name", "age") ( my_table .where(col("age") > 21) .groupby("age") .count("*") ) select 13/49
  • 31. Data manipulation uses the same vocabulary as SQL .where(col("age") > 21) ( my_table .select("id", "first_name", "last_name", "age") .groupby("age") .count("*") ) where 13/49
  • 32. Data manipulation uses the same vocabulary as SQL .groupby("age") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .count("*") ) group by 13/49
  • 33. Data manipulation uses the same vocabulary as SQL .count("*") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") ) count 13/49
  • 34. I mean, you can legitimately use SQL spark.sql(""" select count(*) from ( select id, first_name, last_name, age from my_table where age > 21 ) group by age""") 14/49
  • 35. Data manipulation and machine learning with a uent API results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) 15/49
  • 36. Data manipulation and machine learning with a uent API"./data/Ch02/1342-0.txt") results = ( .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Read a text file 15/49
  • 37. Data manipulation and machine learning with a uent API .select(F.split(F.col("value"), " ").alias("line")) results = ("./data/Ch02/1342-0.txt") .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Select the column value, where each element is splitted (space as a separator). Alias to line. 15/49
  • 38. Data manipulation and machine learning with a uent API .select(F.explode(F.col("line")).alias("word")) results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Explode each element of line into its own record. Alias to word. 15/49
  • 39. Data manipulation and machine learning with a uent API .select(F.lower(F.col("word")).alias("word")) results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Lower-case each word 15/49
  • 40. Data manipulation and machine learning with a uent API .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Extract only the first group of lower-case letters from each word. 15/49
  • 41. Data manipulation and machine learning with a uent API .where(F.col("word") != "") results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .groupby(F.col("word")) .count() ) Keep only the records where the word is not the empty string. 15/49
  • 42. Data manipulation and machine learning with a uent API .groupby(F.col("word")) results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .count() ) Group by word 15/49
  • 43. Data manipulation and machine learning with a uent API .count() results = ("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) ) Count the number of records in each group 15/49
  • 44.           Scala is not the only player in town 16/49
  • 46. 18/49
  • 47. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() 19/49
  • 48. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() A SparkSession is your entry point to distributed data manipulation 19/49
  • 49. Summoning PySpark spark = SparkSession.builder.config( "spark.jars.packages", ("" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() from pyspark.sql import SparkSession   We create our SparkSession with an optional library to access BigQuery as a data source. 19/49
  • 50. Reading data from functools import reduce from pyspark.sql import DataFrame     def read_df_from_bq(year): return ("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] 20/49
  • 51. Reading data def read_df_from_bq(year): return ("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() ) from functools import reduce from pyspark.sql import DataFrame         gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] We create a helper function to read our code from BigQuery. 20/49
  • 52. Reading data gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] ) )     def read_df_from_bq(year): return ("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     A DataFrame is a regular Python object. 20/49
  • 53. Using the power of the schema gsod.printSchema() # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] 21/49
  • 54. Using the power of the schema # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] gsod.printSchema() The schema will give us the column names and their types. 21/49
  • 55. And showing data gsod ="stn", "year", "mo", "da", "temp")   # Approximately 5 seconds waiting # +------+----+---+---+----+ # | stn|year| mo| da|temp| # +------+----+---+---+----+ # |359250|2010| 02| 25|25.2| # |359250|2010| 05| 25|65.0| # |386130|2010| 02| 19|35.4| # |386130|2010| 03| 15|52.2| # |386130|2010| 01| 21|37.9| # +------+----+---+---+----+ # only showing top 5 rows 22/49
  • 56. What happens behind the scenes? 23/49
  • 57. Any data frame transformation will be stored until we need the data. Then, when we trigger an action, (Py)Spark will go and optimize the query plan, select the best physical plan and apply the transformation on the data. 24/49
  • 67. Something a little more complex import pyspark.sql.functions as F   stations = ("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.stations") .option("credentialsFile", "bq-key.json") .load() )   # We want to get the "hottest Countries" that have at least 60 measures answer = ( gsod.join(stations, gsod["stn"] == stations["usaf"]) .where(F.col("country").isNotNull()) .groupBy("country") .agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count")) ).where(F.col("count") > 12 * 5) read, join, where, groupby, avg/count, where, orderby, show 26/49
  • 68. read, join, where, groupby, avg/count, where, orderby, show 27/49
  • 69. read, join, where, groupby, avg/count, where, orderby, show 28/49
  • 70. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) 29/49
  • 71. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) We register the data frames as Spark SQL tables. 29/49
  • 72. Python or SQL?   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) gsod.createTempView("gsod") stations.createTempView("stations") We then can query using SQL without leaving Python! 29/49
  • 73. Python and SQL! ( spark.sql( """ select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf group by country""" ) .where("country is not null") .where("count > (12 * 5)") .orderby("avg_temp", ascending=False) .show(5) ) 30/49
  • 75. 32/49
  • 76. 33/49
  • 77. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 34/49
  • 78. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 PySpark types are objects in the pyspark.sql.types modules. 34/49
  • 79. Scalar UDF @F.pandas_udf(T.DoubleType()) import pandas as pd import pyspark.sql.types as T   def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 We promote a regular Python function to a User Defined Function via a decorator. 34/49
  • 80. Scalar UDF def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9 import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType())   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 A simple function on pandas Series 34/49
  • 81. Scalar UDF f_to_c.func(pd.Series(range(32, 213))) import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 Still unit-testable :-) 34/49
  • 82. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))"temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows 35/49
  • 83. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))"temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows A UDF can be used like any PySpark function. 35/49
  • 84. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 85. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) A regular, fun, harmless function on (pandas) DataFrames 36/49
  • 86. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 87. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ), False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 88. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ), False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We provide PySpark the schema we expect our function to return 37/49
  • 89. Grouped Map UDF gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ) scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" ), False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We just have to partition (using group), and then applyInPandas! 37/49
  • 90. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ), False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 91. 38/49
  • 92. You are not limited library-wise from sklearn.linear_model import LinearRegression     @F.pandas_udf(T.DoubleType()) def rate_of_change_temperature( day: pd.Series, temp: pd.Series ) -> float: """Returns the slope of the daily temperature for a given period of time.""" return ( LinearRegression() .fit(X=day.astype("int").values.reshape(-1, 1), y=temp) .coef_[0] ) 39/49
  • 93. result = gsod.groupby("stn", "year", "mo").agg( rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias( "rt_chg_temp" ) ), False) # +------+----+---+---------------------+ # |stn |year|mo |rt_chg_temp | # +------+----+---+---------------------+ # |010250|2018|12 |-0.01014397905759162 | # |011120|2018|11 |-0.01704736746691528 | # |011150|2018|10 |-0.013510329829648423| # |011510|2018|03 |0.020159116598556657 | # |011800|2018|06 |0.012645501680677372 | # +------+----+---+---------------------+ # only showing top 5 rows 40/49
  • 94. 41/49
  • 95. Not fan of the syntax? 42/49
  • 96. 43/49
  • 97. From the import databricks.koalas as ks import pandas as pd   pdf = pd.DataFrame( { 'x':range(3), 'y':['a','b','b'], 'z':['a','b','b'], } )   # Create a Koalas DataFrame from pandas DataFrame df = ks.from_pandas(pdf)   # Rename the columns df.columns = ['x', 'y', 'z1'] 44/49
  • 99. 46/49
  • 102. "Serverless" Spark? Cost-effective for sporadic runs Scales easily 47/49
  • 103. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance 47/49
  • 104. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive 47/49
  • 105. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model 47/49
  • 106. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model Uneven documentation 47/49