SlideShare a Scribd company logo
Welcome to ServerlessToronto.org
“Home of Less IT Mess”
1
Introduce Yourself ☺
- Why are you here?
- Looking for work?
- Offering work?
Our feature presentation “Intro to PySpark” starts at 6:20pm…
Serverless is not just about the Tech:
2
Serverless is New Agile & Mindset
Serverless Dev (gluing
other people’s APIs and
managed services)
We're obsessed to
creating business value
(meaningful MVPs,
products), by helping
Startups & empowering
Business users!
We build bridges
between Serverless
Community (“Dev leg”),
and Front-end & Voice-
First folks (“UX leg”),
and empower UX
developers
Achieve agility NOT by
“sprinting” faster (like in
Scrum), but by working
smarter (by using
bigger building blocks
and less Ops)
Upcoming #ServerlessTO Online Meetups
3
1. Accelerating with a Cloud Contact Center – Patrick Kolencherry
Sr. Product Marketing Manager, and Karla Nussbaumer, Head of
Technical Marketing at Twilio **JULY 9 @ 6pm **
2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
Feature Talk
Jonathan Rioux, Head Data Scientist at
EPAM Systems & author of Manning book
PySpark in Action
4
Getting acquainted
with PySpark
1/49
If you have not filled the Meetup survey, now is the
time to do it!
(Also copied in the chat)
https://forms.gle/6cyWGVY4L4GJvsXh7
2/49
Hi! I'm Jonathan
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
3/49
Hi! I'm Jonathan
Data Scientist, Engineer, Enthusiast
Head of DS @ EPAM Canada
Author of PySpark in Action →
<3 Spark, <3 <3 Python
3/49
4/49
5/49
Goals of this presentation
6/49
Goals of this presentation
Share my love of (Py)Spark
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
6/49
Goals of this presentation
Share my love of (Py)Spark
Explain where PySpark shines
Introduce the Python + Spark interop
Get you excited about using PySpark
36,000 ft overview: Managed Spark in the Cloud
6/49
What I expect from you
7/49
What I expect from you
You know a little bit of Python
7/49
What I expect from you
You know a little bit of Python
You know what SQL is
7/49
What I expect from you
You know a little bit of Python
You know what SQL is
You won't hesitate to ask questions :-)
7/49
What is Spark
Spark is a unified analytics engine for large-scale
data processing
8/49
What is Spark (bis)
Spark can be thought of a data factory that you
(mostly) program like a cohesive computer.
9/49
Spark under the hood
10/49
Spark as an analytics factory
11/49
Why is Pyspark
cool?
12/49
Data manipulation uses the
same vocabulary as SQL
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
.count("*")
)
13/49
Data manipulation uses the
same vocabulary as SQL
.select("id", "first_name", "last_name", "age")
(
my_table
.where(col("age") > 21)
.groupby("age")
.count("*")
)
select
13/49
Data manipulation uses the
same vocabulary as SQL
.where(col("age") > 21)
(
my_table
.select("id", "first_name", "last_name", "age")
.groupby("age")
.count("*")
)
where
13/49
Data manipulation uses the
same vocabulary as SQL
.groupby("age")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.count("*")
)
group by
13/49
Data manipulation uses the
same vocabulary as SQL
.count("*")
(
my_table
.select("id", "first_name", "last_name", "age")
.where(col("age") > 21)
.groupby("age")
)
count
13/49
I mean, you can legitimately use
SQL
spark.sql("""
select count(*) from (
select id, first_name, last_name, age
from my_table
where age > 21
)
group by age""")
14/49
Data manipulation and machine
learning with a uent API
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
15/49
Data manipulation and machine
learning with a uent API
spark.read.text("./data/Ch02/1342-0.txt")
results = (
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Read a text file
15/49
Data manipulation and machine
learning with a uent API
.select(F.split(F.col("value"), " ").alias("line"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Select the column value, where each element is
splitted (space as a separator). Alias to line.
15/49
Data manipulation and machine
learning with a uent API
.select(F.explode(F.col("line")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Explode each element of line into its own record.
Alias to word.
15/49
Data manipulation and machine
learning with a uent API
.select(F.lower(F.col("word")).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Lower-case each word
15/49
Data manipulation and machine
learning with a uent API
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
.count()
)
Extract only the first group of lower-case letters from
each word.
15/49
Data manipulation and machine
learning with a uent API
.where(F.col("word") != "")
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.groupby(F.col("word"))
.count()
)
Keep only the records where the word is not the
empty string.
15/49
Data manipulation and machine
learning with a uent API
.groupby(F.col("word"))
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.count()
)
Group by word
15/49
Data manipulation and machine
learning with a uent API
.count()
results = (
spark.read.text("./data/Ch02/1342-0.txt")
.select(F.split(F.col("value"), " ").alias("line"))
.select(F.explode(F.col("line")).alias("word"))
.select(F.lower(F.col("word")).alias("word"))
.select(F.regexp_extract(F.col("word"),
"[a-z']*", 0).alias("word"))
.where(F.col("word") != "")
.groupby(F.col("word"))
)
Count the number of records in each group
15/49
         
Scala is not the only player in
town
16/49
Let's code!
17/49
18/49
Summoning PySpark
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
19/49
Summoning PySpark
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
A SparkSession is your entry point to distributed data manipulation
19/49
Summoning PySpark
spark = SparkSession.builder.config(
"spark.jars.packages",
("com.google.cloud.spark:"
"spark-bigquery-with-dependencies_2.12:0.16.1")
).getOrCreate()
from pyspark.sql import SparkSession
 
We create our SparkSession with an optional library to access BigQuery as a data source.
19/49
Reading data
from functools import reduce
from pyspark.sql import DataFrame
 
 
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
 
 
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
20/49
Reading data
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
from functools import reduce
from pyspark.sql import DataFrame
 
 
 
 
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
We create a helper function to read our code from BigQuery.
20/49
Reading data
gsod = (
reduce(
DataFrame.union, [read_df_from_bq(year)
for year in range(2010, 2020)]
)
)
 
 
def read_df_from_bq(year):
return (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.gsod{year}")
.option("credentialsFile", "bq-key.json")
.load()
)
 
 
A DataFrame is a regular Python object.
20/49
Using the power of the schema
gsod.printSchema()
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
21/49
Using the power of the schema
# root
# |-- stn: string (nullable = true)
# |-- wban: string (nullable = true)
# |-- year: string (nullable = true)
# |-- mo: string (nullable = true)
# |-- da: string (nullable = true)
# |-- temp: double (nullable = true)
# |-- count_temp: long (nullable = true)
# |-- dewp: double (nullable = true)
# |-- count_dewp: long (nullable = true)
# |-- slp: double (nullable = true)
# |-- count_slp: long (nullable = true)
# |-- stp: double (nullable = true)
# |-- count_stp: long (nullable = true)
# |-- visib: double (nullable = true)
# [...]
gsod.printSchema()
The schema will give us the column names and their types.
21/49
And showing data
gsod = gsod.select("stn", "year", "mo", "da", "temp")
 
gsod.show(5)
 
# Approximately 5 seconds waiting
# +------+----+---+---+----+
# | stn|year| mo| da|temp|
# +------+----+---+---+----+
# |359250|2010| 02| 25|25.2|
# |359250|2010| 05| 25|65.0|
# |386130|2010| 02| 19|35.4|
# |386130|2010| 03| 15|52.2|
# |386130|2010| 01| 21|37.9|
# +------+----+---+---+----+
# only showing top 5 rows
22/49
What happens behind the scenes?
23/49
Any data frame transformation will be stored until we need the
data.
Then, when we trigger an action, (Py)Spark will go and optimize
the query plan, select the best physical plan and apply the
transformation on the data.
24/49
Transformations Actions
25/49
Transformations
select
Actions
25/49
Transformations
select
filter
Actions
25/49
Transformations
select
filter
group by
Actions
25/49
Transformations
select
filter
group by
partition
Actions
25/49
Transformations
select
filter
group by
partition
Actions
write
25/49
Transformations
select
filter
group by
partition
Actions
write
show
25/49
Transformations
select
filter
group by
partition
Actions
write
show
count
25/49
Transformations
select
filter
group by
partition
Actions
write
show
count
toPandas
25/49
Something a little more complex
import pyspark.sql.functions as F
 
stations = (
spark.read.format("bigquery")
.option("table", f"bigquery-public-data.noaa_gsod.stations")
.option("credentialsFile", "bq-key.json")
.load()
)
 
# We want to get the "hottest Countries" that have at least 60 measures
answer = (
gsod.join(stations, gsod["stn"] == stations["usaf"])
.where(F.col("country").isNotNull())
.groupBy("country")
.agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count"))
).where(F.col("count") > 12 * 5)
read, join, where, groupby, avg/count, where, orderby, show
26/49
read, join, where, groupby, avg/count, where, orderby, show
27/49
read, join, where, groupby, avg/count, where, orderby, show
28/49
Python or SQL?
gsod.createTempView("gsod")
stations.createTempView("stations")
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
29/49
Python or SQL?
gsod.createTempView("gsod")
stations.createTempView("stations")
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
We register the data frames as Spark SQL tables.
29/49
Python or SQL?
 
spark.sql("""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
where country is not null
group by country
having count > (12 * 5)
order by avg_temp desc
""").show(5)
gsod.createTempView("gsod")
stations.createTempView("stations")
We then can query using SQL without leaving Python!
29/49
Python and SQL!
(
spark.sql(
"""
select country, avg(temp) avg_temp, count(*) count
from gsod
inner join stations
on gsod.stn = stations.usaf
group by country"""
)
.where("country is not null")
.where("count > (12 * 5)")
.orderby("avg_temp", ascending=False)
.show(5)
)
30/49
Python ⇄Spark
31/49
32/49
33/49
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
34/49
Scalar UDF
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
PySpark types are objects in the pyspark.sql.types modules.
34/49
Scalar UDF
@F.pandas_udf(T.DoubleType())
import pandas as pd
import pyspark.sql.types as T
 
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
We promote a regular Python function to a User Defined Function via a
decorator.
34/49
Scalar UDF
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
 
f_to_c.func(pd.Series(range(32, 213)))
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
A simple function on pandas Series
34/49
Scalar UDF
f_to_c.func(pd.Series(range(32, 213)))
import pandas as pd
import pyspark.sql.types as T
 
@F.pandas_udf(T.DoubleType())
def f_to_c(degrees: pd.Series) -> pd.Series:
"""Transforms Farhenheit to Celcius."""
return (degrees - 32) * 5 / 9
 
# 0 0.000000
# 1 0.555556
# 2 1.111111
# 3 1.666667
# 4 2.222222
# ...
# 176 97.777778
# 177 98.333333
Still unit-testable :-)
34/49
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
 
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
35/49
Scalar UDF
gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp")))
gsod.select("temp", "temp_c").distinct().show(5)
 
# +-----+-------------------+
# | temp| temp_c|
# +-----+-------------------+
# | 37.2| 2.8888888888888906|
# | 85.9| 29.944444444444443|
# | 53.5| 11.944444444444445|
# | 71.6| 21.999999999999996|
# |-27.6|-33.111111111111114|
# +-----+-------------------+
# only showing top 5 rows
A UDF can be used like any PySpark function.
35/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
A regular, fun, harmless function on (pandas) DataFrames
36/49
Grouped Map UDF
def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame:
"""Returns a simple normalization of the temperature for a site.
 
If the temperature is constant for the whole window,
defaults to 0.5.
"""
temp = temp_by_day.temp
answer = temp_by_day[["stn", "year", "mo", "da", "temp"]]
if temp.min() == temp.max():
return answer.assign(temp_norm=0.5)
return answer.assign(
temp_norm=(temp - temp.min()) / (temp.max() - temp.min())
)
36/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
37/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We provide PySpark the schema we expect our function to return
37/49
Grouped Map UDF
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
We just have to partition (using group), and then applyInPandas!
37/49
Grouped Map UDF
scale_temp_schema = (
"stn string, year string, mo string, "
"da string, temp double, temp_norm double"
)
 
gsod = gsod.groupby("stn", "year", "mo").applyInPandas(
scale_temperature, schema=scale_temp_schema
)
 
gsod.show(5, False)
 
# +------+----+---+---+----+------------------+
# |stn |year|mo |da |temp|temp_norm |
# +------+----+---+---+----+------------------+
# |008268|2010|07 |22 |87.4|0.0 |
# |008268|2010|07 |21 |89.6|1.0 |
# |008401|2011|11 |01 |68.2|0.7960000000000003|
37/49
38/49
You are not limited library-wise
from sklearn.linear_model import LinearRegression
 
 
@F.pandas_udf(T.DoubleType())
def rate_of_change_temperature(
day: pd.Series,
temp: pd.Series
) -> float:
"""Returns the slope of the daily temperature
for a given period of time."""
return (
LinearRegression()
.fit(X=day.astype("int").values.reshape(-1, 1), y=temp)
.coef_[0]
)
39/49
result = gsod.groupby("stn", "year", "mo").agg(
rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias(
"rt_chg_temp"
)
)
 
result.show(5, False)
# +------+----+---+---------------------+
# |stn |year|mo |rt_chg_temp |
# +------+----+---+---------------------+
# |010250|2018|12 |-0.01014397905759162 |
# |011120|2018|11 |-0.01704736746691528 |
# |011150|2018|10 |-0.013510329829648423|
# |011510|2018|03 |0.020159116598556657 |
# |011800|2018|06 |0.012645501680677372 |
# +------+----+---+---------------------+
# only showing top 5 rows
40/49
41/49
Not fan of the
syntax?
42/49
43/49
From the README.md
import databricks.koalas as ks
import pandas as pd
 
pdf = pd.DataFrame(
{
'x':range(3),
'y':['a','b','b'],
'z':['a','b','b'],
}
)
 
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
 
# Rename the columns
df.columns = ['x', 'y', 'z1']
44/49
(Py)Spark in the
cloud
45/49
46/49
"Serverless" Spark?
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
47/49
"Serverless" Spark?
Cost-effective for sporadic
runs
Scales easily
Simplified maintenance
Easy to become expensive
Sometimes confusing
pricing model
Uneven documentation
47/49
Thank you!
48/49
Join www.ServerlessToronto.org
Home of “Less IT Mess”

More Related Content

What's hot

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Spark Summit
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
Krishna Sangeeth KS
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
Russell Jurney
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Databricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Edureka!
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 

What's hot (20)

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Provectus
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Russell Jurney
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
Chris Fregly
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Spain
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
GeeksLab Odessa
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Juantomás García Molina
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
Chris Fregly
 

Similar to Intro to PySpark: Python Data Analysis at scale in the Cloud (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 

More from Daniel Zivkovic

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Daniel Zivkovic
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
Daniel Zivkovic
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
Daniel Zivkovic
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
Daniel Zivkovic
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Daniel Zivkovic
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
Daniel Zivkovic
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
Daniel Zivkovic
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
Daniel Zivkovic
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Daniel Zivkovic
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
Daniel Zivkovic
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Daniel Zivkovic
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
 

More from Daniel Zivkovic (20)

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 

Intro to PySpark: Python Data Analysis at scale in the Cloud

  • 1. Welcome to ServerlessToronto.org “Home of Less IT Mess” 1 Introduce Yourself ☺ - Why are you here? - Looking for work? - Offering work? Our feature presentation “Intro to PySpark” starts at 6:20pm…
  • 2. Serverless is not just about the Tech: 2 Serverless is New Agile & Mindset Serverless Dev (gluing other people’s APIs and managed services) We're obsessed to creating business value (meaningful MVPs, products), by helping Startups & empowering Business users! We build bridges between Serverless Community (“Dev leg”), and Front-end & Voice- First folks (“UX leg”), and empower UX developers Achieve agility NOT by “sprinting” faster (like in Scrum), but by working smarter (by using bigger building blocks and less Ops)
  • 3. Upcoming #ServerlessTO Online Meetups 3 1. Accelerating with a Cloud Contact Center – Patrick Kolencherry Sr. Product Marketing Manager, and Karla Nussbaumer, Head of Technical Marketing at Twilio **JULY 9 @ 6pm ** 2. Your Presentation ☺ ** WHY NOT SHARE THE KNOWLEDGE?
  • 4. Feature Talk Jonathan Rioux, Head Data Scientist at EPAM Systems & author of Manning book PySpark in Action 4
  • 6. If you have not filled the Meetup survey, now is the time to do it! (Also copied in the chat) https://forms.gle/6cyWGVY4L4GJvsXh7 2/49
  • 8. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast 3/49
  • 9. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada 3/49
  • 10. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → 3/49
  • 11. Hi! I'm Jonathan Data Scientist, Engineer, Enthusiast Head of DS @ EPAM Canada Author of PySpark in Action → <3 Spark, <3 <3 Python 3/49
  • 12. 4/49
  • 13. 5/49
  • 14. Goals of this presentation 6/49
  • 15. Goals of this presentation Share my love of (Py)Spark 6/49
  • 16. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines 6/49
  • 17. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop 6/49
  • 18. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 6/49
  • 19. Goals of this presentation Share my love of (Py)Spark Explain where PySpark shines Introduce the Python + Spark interop Get you excited about using PySpark 36,000 ft overview: Managed Spark in the Cloud 6/49
  • 20. What I expect from you 7/49
  • 21. What I expect from you You know a little bit of Python 7/49
  • 22. What I expect from you You know a little bit of Python You know what SQL is 7/49
  • 23. What I expect from you You know a little bit of Python You know what SQL is You won't hesitate to ask questions :-) 7/49
  • 24. What is Spark Spark is a unified analytics engine for large-scale data processing 8/49
  • 25. What is Spark (bis) Spark can be thought of a data factory that you (mostly) program like a cohesive computer. 9/49
  • 26. Spark under the hood 10/49
  • 27. Spark as an analytics factory 11/49
  • 29. Data manipulation uses the same vocabulary as SQL ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") .count("*") ) 13/49
  • 30. Data manipulation uses the same vocabulary as SQL .select("id", "first_name", "last_name", "age") ( my_table .where(col("age") > 21) .groupby("age") .count("*") ) select 13/49
  • 31. Data manipulation uses the same vocabulary as SQL .where(col("age") > 21) ( my_table .select("id", "first_name", "last_name", "age") .groupby("age") .count("*") ) where 13/49
  • 32. Data manipulation uses the same vocabulary as SQL .groupby("age") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .count("*") ) group by 13/49
  • 33. Data manipulation uses the same vocabulary as SQL .count("*") ( my_table .select("id", "first_name", "last_name", "age") .where(col("age") > 21) .groupby("age") ) count 13/49
  • 34. I mean, you can legitimately use SQL spark.sql(""" select count(*) from ( select id, first_name, last_name, age from my_table where age > 21 ) group by age""") 14/49
  • 35. Data manipulation and machine learning with a uent API results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) 15/49
  • 36. Data manipulation and machine learning with a uent API spark.read.text("./data/Ch02/1342-0.txt") results = ( .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Read a text file 15/49
  • 37. Data manipulation and machine learning with a uent API .select(F.split(F.col("value"), " ").alias("line")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Select the column value, where each element is splitted (space as a separator). Alias to line. 15/49
  • 38. Data manipulation and machine learning with a uent API .select(F.explode(F.col("line")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Explode each element of line into its own record. Alias to word. 15/49
  • 39. Data manipulation and machine learning with a uent API .select(F.lower(F.col("word")).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Lower-case each word 15/49
  • 40. Data manipulation and machine learning with a uent API .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) .count() ) Extract only the first group of lower-case letters from each word. 15/49
  • 41. Data manipulation and machine learning with a uent API .where(F.col("word") != "") results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .groupby(F.col("word")) .count() ) Keep only the records where the word is not the empty string. 15/49
  • 42. Data manipulation and machine learning with a uent API .groupby(F.col("word")) results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .count() ) Group by word 15/49
  • 43. Data manipulation and machine learning with a uent API .count() results = ( spark.read.text("./data/Ch02/1342-0.txt") .select(F.split(F.col("value"), " ").alias("line")) .select(F.explode(F.col("line")).alias("word")) .select(F.lower(F.col("word")).alias("word")) .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word")) .where(F.col("word") != "") .groupby(F.col("word")) ) Count the number of records in each group 15/49
  • 44.           Scala is not the only player in town 16/49
  • 46. 18/49
  • 47. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() 19/49
  • 48. Summoning PySpark from pyspark.sql import SparkSession   spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() A SparkSession is your entry point to distributed data manipulation 19/49
  • 49. Summoning PySpark spark = SparkSession.builder.config( "spark.jars.packages", ("com.google.cloud.spark:" "spark-bigquery-with-dependencies_2.12:0.16.1") ).getOrCreate() from pyspark.sql import SparkSession   We create our SparkSession with an optional library to access BigQuery as a data source. 19/49
  • 50. Reading data from functools import reduce from pyspark.sql import DataFrame     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] 20/49
  • 51. Reading data def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() ) from functools import reduce from pyspark.sql import DataFrame         gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] We create a helper function to read our code from BigQuery. 20/49
  • 52. Reading data gsod = ( reduce( DataFrame.union, [read_df_from_bq(year) for year in range(2010, 2020)] ) )     def read_df_from_bq(year): return ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.gsod{year}") .option("credentialsFile", "bq-key.json") .load() )     A DataFrame is a regular Python object. 20/49
  • 53. Using the power of the schema gsod.printSchema() # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] 21/49
  • 54. Using the power of the schema # root # |-- stn: string (nullable = true) # |-- wban: string (nullable = true) # |-- year: string (nullable = true) # |-- mo: string (nullable = true) # |-- da: string (nullable = true) # |-- temp: double (nullable = true) # |-- count_temp: long (nullable = true) # |-- dewp: double (nullable = true) # |-- count_dewp: long (nullable = true) # |-- slp: double (nullable = true) # |-- count_slp: long (nullable = true) # |-- stp: double (nullable = true) # |-- count_stp: long (nullable = true) # |-- visib: double (nullable = true) # [...] gsod.printSchema() The schema will give us the column names and their types. 21/49
  • 55. And showing data gsod = gsod.select("stn", "year", "mo", "da", "temp")   gsod.show(5)   # Approximately 5 seconds waiting # +------+----+---+---+----+ # | stn|year| mo| da|temp| # +------+----+---+---+----+ # |359250|2010| 02| 25|25.2| # |359250|2010| 05| 25|65.0| # |386130|2010| 02| 19|35.4| # |386130|2010| 03| 15|52.2| # |386130|2010| 01| 21|37.9| # +------+----+---+---+----+ # only showing top 5 rows 22/49
  • 56. What happens behind the scenes? 23/49
  • 57. Any data frame transformation will be stored until we need the data. Then, when we trigger an action, (Py)Spark will go and optimize the query plan, select the best physical plan and apply the transformation on the data. 24/49
  • 67. Something a little more complex import pyspark.sql.functions as F   stations = ( spark.read.format("bigquery") .option("table", f"bigquery-public-data.noaa_gsod.stations") .option("credentialsFile", "bq-key.json") .load() )   # We want to get the "hottest Countries" that have at least 60 measures answer = ( gsod.join(stations, gsod["stn"] == stations["usaf"]) .where(F.col("country").isNotNull()) .groupBy("country") .agg(F.avg("temp").alias("avg_temp"), F.count("*").alias("count")) ).where(F.col("count") > 12 * 5) read, join, where, groupby, avg/count, where, orderby, show 26/49
  • 68. read, join, where, groupby, avg/count, where, orderby, show 27/49
  • 69. read, join, where, groupby, avg/count, where, orderby, show 28/49
  • 70. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) 29/49
  • 71. Python or SQL? gsod.createTempView("gsod") stations.createTempView("stations")   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) We register the data frames as Spark SQL tables. 29/49
  • 72. Python or SQL?   spark.sql(""" select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf where country is not null group by country having count > (12 * 5) order by avg_temp desc """).show(5) gsod.createTempView("gsod") stations.createTempView("stations") We then can query using SQL without leaving Python! 29/49
  • 73. Python and SQL! ( spark.sql( """ select country, avg(temp) avg_temp, count(*) count from gsod inner join stations on gsod.stn = stations.usaf group by country""" ) .where("country is not null") .where("count > (12 * 5)") .orderby("avg_temp", ascending=False) .show(5) ) 30/49
  • 75. 32/49
  • 76. 33/49
  • 77. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 34/49
  • 78. Scalar UDF import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 PySpark types are objects in the pyspark.sql.types modules. 34/49
  • 79. Scalar UDF @F.pandas_udf(T.DoubleType()) import pandas as pd import pyspark.sql.types as T   def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 We promote a regular Python function to a User Defined Function via a decorator. 34/49
  • 80. Scalar UDF def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9 import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType())   f_to_c.func(pd.Series(range(32, 213))) # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 A simple function on pandas Series 34/49
  • 81. Scalar UDF f_to_c.func(pd.Series(range(32, 213))) import pandas as pd import pyspark.sql.types as T   @F.pandas_udf(T.DoubleType()) def f_to_c(degrees: pd.Series) -> pd.Series: """Transforms Farhenheit to Celcius.""" return (degrees - 32) * 5 / 9   # 0 0.000000 # 1 0.555556 # 2 1.111111 # 3 1.666667 # 4 2.222222 # ... # 176 97.777778 # 177 98.333333 Still unit-testable :-) 34/49
  • 82. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows 35/49
  • 83. Scalar UDF gsod = gsod.withColumn("temp_c", f_to_c(F.col("temp"))) gsod.select("temp", "temp_c").distinct().show(5)   # +-----+-------------------+ # | temp| temp_c| # +-----+-------------------+ # | 37.2| 2.8888888888888906| # | 85.9| 29.944444444444443| # | 53.5| 11.944444444444445| # | 71.6| 21.999999999999996| # |-27.6|-33.111111111111114| # +-----+-------------------+ # only showing top 5 rows A UDF can be used like any PySpark function. 35/49
  • 84. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 85. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) A regular, fun, harmless function on (pandas) DataFrames 36/49
  • 86. Grouped Map UDF def scale_temperature(temp_by_day: pd.DataFrame) -> pd.DataFrame: """Returns a simple normalization of the temperature for a site.   If the temperature is constant for the whole window, defaults to 0.5. """ temp = temp_by_day.temp answer = temp_by_day[["stn", "year", "mo", "da", "temp"]] if temp.min() == temp.max(): return answer.assign(temp_norm=0.5) return answer.assign( temp_norm=(temp - temp.min()) / (temp.max() - temp.min()) ) 36/49
  • 87. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 88. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We provide PySpark the schema we expect our function to return 37/49
  • 89. Grouped Map UDF gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema ) scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )     gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| We just have to partition (using group), and then applyInPandas! 37/49
  • 90. Grouped Map UDF scale_temp_schema = ( "stn string, year string, mo string, " "da string, temp double, temp_norm double" )   gsod = gsod.groupby("stn", "year", "mo").applyInPandas( scale_temperature, schema=scale_temp_schema )   gsod.show(5, False)   # +------+----+---+---+----+------------------+ # |stn |year|mo |da |temp|temp_norm | # +------+----+---+---+----+------------------+ # |008268|2010|07 |22 |87.4|0.0 | # |008268|2010|07 |21 |89.6|1.0 | # |008401|2011|11 |01 |68.2|0.7960000000000003| 37/49
  • 91. 38/49
  • 92. You are not limited library-wise from sklearn.linear_model import LinearRegression     @F.pandas_udf(T.DoubleType()) def rate_of_change_temperature( day: pd.Series, temp: pd.Series ) -> float: """Returns the slope of the daily temperature for a given period of time.""" return ( LinearRegression() .fit(X=day.astype("int").values.reshape(-1, 1), y=temp) .coef_[0] ) 39/49
  • 93. result = gsod.groupby("stn", "year", "mo").agg( rate_of_change_temperature(gsod["da"], gsod["temp_norm"]).alias( "rt_chg_temp" ) )   result.show(5, False) # +------+----+---+---------------------+ # |stn |year|mo |rt_chg_temp | # +------+----+---+---------------------+ # |010250|2018|12 |-0.01014397905759162 | # |011120|2018|11 |-0.01704736746691528 | # |011150|2018|10 |-0.013510329829648423| # |011510|2018|03 |0.020159116598556657 | # |011800|2018|06 |0.012645501680677372 | # +------+----+---+---------------------+ # only showing top 5 rows 40/49
  • 94. 41/49
  • 95. Not fan of the syntax? 42/49
  • 96. 43/49
  • 97. From the README.md import databricks.koalas as ks import pandas as pd   pdf = pd.DataFrame( { 'x':range(3), 'y':['a','b','b'], 'z':['a','b','b'], } )   # Create a Koalas DataFrame from pandas DataFrame df = ks.from_pandas(pdf)   # Rename the columns df.columns = ['x', 'y', 'z1'] 44/49
  • 99. 46/49
  • 102. "Serverless" Spark? Cost-effective for sporadic runs Scales easily 47/49
  • 103. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance 47/49
  • 104. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive 47/49
  • 105. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model 47/49
  • 106. "Serverless" Spark? Cost-effective for sporadic runs Scales easily Simplified maintenance Easy to become expensive Sometimes confusing pricing model Uneven documentation 47/49