Apache Spark v3.0.0

Jean-Georges Perrin • @jgperrin
Apache Spark v3.0.0
What’s new? A very personal view.

Jean-Georges Perrin
Software since 1983 $>0 1995
Big Data since 1984 $>0 2006
AI since 1994 $>0 2010
x12
@jgperrin • http://jgp.ai

Sources:

Wikipedia: https://en.wikipedia.org/wiki/Big_data

IBM: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
• volume

• variety

• velocity

• variability

• value
3
V4
5
Biiiiiiiig Data

Dataisconsidered
bigwhenitneeds
morethanone
computertobe
processed

Apps
Analytics
Distrib.
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?

HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?

DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer

DATA
Engineer
DATA
Scientist
SQL
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson Studio

Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks

A few more figures
Who does not like performance ﬁgures?
• Databricks:

• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)

• >90% of all Spark API are Spark SQL, regardless of language used

• Community:

• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc


Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html

Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html 
3400+ Jira tickets

Highlights in a nutshell
• Python

• Python v3 only (so long Python v2)

• Better error handling

• Koalas oﬀer better Pandas support (close to 80%)

• SQL

• Better ANSI SQL compliance

• Core

• Adaptive query execution, including partition pruning

• Java v11 support, Scala v2.12 only

• Hydrogen - hardware & accelerator aware scheduler
GPU support for model training

Optimizing the optimizer
Got to love Catalyst
• v1.x: rule

• v2.x: rule & cost (thanks to IBM)

• v3.x: rule & cost & runtime (thanks to Databricks & Intel)

• Dynamically coalescing shuﬄe partitions

• Dynamically switching join strategies

• Dynamically optimizing skew joins
Sources:

Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Adaptive query execution (AQE)
Yet another 3-letter acronym

SparkSession spark = SparkSession.builder()
.appName("Join using AQE")
.master("local[*]")
.config("spark.sql.adaptive.enabled", useAqe)
.getOrCreate();
…
Dataset<Row> institPerCountyDf = higherEdDf.join(
countyZipDf,
higherEdDf.col("zip").equalTo(countyZipDf.col("zip")),
"inner");
institPerCountyDf = institPerCountyDf.join(
censusDf,
institPerCountyDf.col("county").equalTo(censusDf.col("countyId")),
"left");
For the entire session
/jgperrin/net.jgp.books.spark.ch12
Chapter 12
Lab #302

In a free companion book
• sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133)

• any, every, some (SPARK-19851)

• bit_and, bit_or (SPARK-27879)

• bit_count (SPARK-29491)

• bit_xor (SPARK-29545)

• bool_and, bool_or (SPARK-30184)

• count_if (SPARK-27425)

• date_part (SPARK-28690)

• extract (SPARK-23903)

• forall (SPARK-27905)

• from_csv (SPARK-25393)

• make_date (SPARK-28432)

• make_interval (SPARK-29393)

• make_timestamp (SPARK-28459)

• map_entries (SPARK-23935)

• map_ﬁlter (SPARK-23937)

• map_zip_with (SPARK-23938)

• max_by, min_by (SPARK-27653)

• schema_of_csv (SPARK-25672)

• to_csv (SPARK-25638)

• transform_keys (SPARK-23939)

• transform_values (SPARK-23940)

• typeof (SPARK-29961)

• version (SPARK-29554)

• xxhash64 (SPARK-27099)
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
New static functions
http://jgp.ai/sia

Always a soup
• Finally a reference guide

• http://jgp.ai/sparksql

• EXPLAIN can be FORMATTED

• Proleptic Gregorian calendar,
based on Java 8

• Overflow check

• ANSI compatibility through
configuration flag
SQL

Ingestion
Who needs a push down?
• Already available in databases

• Allow to ﬁlter what you ingest, before you ingest it

• Equivalent but easier than ingesting and ﬁltering after

String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
Chapter 8
Lab #310

+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
Chapter 7
Lab #201

Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)

• Runtime will throw you oﬀ:

• Parsing dates

• Data sources (v2 on the way)

• Reference

• https://spark.apache.org/docs/latest/migration-guide.html

org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321

Mi-figue, mi-raisin
Mixed bags

The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources

Takeaways
• Apache Spark v3 is a major update, 3400+ patches

• Foundation for a rich data ecosystem

• Python increasingly popular, beats Scala

• Cornerstone for the lakehouse concept

Thank you! http://jgp.ai/sia
Join me for DataFriday:
http://jgp.ai/datafriday

Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0

• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0

• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr

• Soup by Valeria Boltneva from Pexels

Apache Spark v3.0.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark v3.0.0

Similar to Apache Spark v3.0.0 (20)

More from Jean-Georges Perrin

More from Jean-Georges Perrin (20)

Recently uploaded

Recently uploaded (20)

Apache Spark v3.0.0