SlideShare a Scribd company logo
Jean-Georges Perrin • @jgperrin
Apache Spark v3.0.0
What’s new? A very personal view.
Jean-Georges Perrin
Software since 1983 $>0 1995
Big Data since 1984 $>0 2006
AI since 1994 $>0 2010
x12
@jgperrin • http://jgp.ai
Sources:

Wikipedia: https://en.wikipedia.org/wiki/Big_data

IBM: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
• volume 

• variety 

• velocity 

• variability

• value
3
V4
5
Biiiiiiiig Data
Dataisconsidered
bigwhenitneeds
morethanone
computertobe
processed
Apps
Analytics
Distrib.
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson Studio
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks
A few more figures
Who does not like performance figures?
• Databricks:

• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)

• >90% of all Spark API are Spark SQL, regardless of language used

• Community:

• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
What’s new in v3?
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
3400+ Jira tickets
Highlights in a nutshell
• Python

• Python v3 only (so long Python v2)

• Better error handling

• Koalas offer better Pandas support (close to 80%)

• SQL

• Better ANSI SQL compliance

• Core

• Adaptive query execution, including partition pruning

• Java v11 support, Scala v2.12 only

• Hydrogen - hardware & accelerator aware scheduler
GPU support for model training
Optimizing the optimizer
Got to love Catalyst
• v1.x: rule

• v2.x: rule & cost (thanks to IBM)

• v3.x: rule & cost & runtime (thanks to Databricks & Intel)
• Dynamically coalescing shuffle partitions

• Dynamically switching join strategies

• Dynamically optimizing skew joins
Sources:

Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Adaptive query execution (AQE)
Yet another 3-letter acronym
SparkSession spark = SparkSession.builder()
.appName("Join using AQE")
.master("local[*]")
.config("spark.sql.adaptive.enabled", useAqe)
.getOrCreate();
…
Dataset<Row> institPerCountyDf = higherEdDf.join(
countyZipDf,
higherEdDf.col("zip").equalTo(countyZipDf.col("zip")),
"inner");
institPerCountyDf = institPerCountyDf.join(
censusDf,
institPerCountyDf.col("county").equalTo(censusDf.col("countyId")),
"left");
For the entire session
/jgperrin/net.jgp.books.spark.ch12
Chapter 12
Lab #302
In a free companion book
• sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133)

• any, every, some (SPARK-19851)

• bit_and, bit_or (SPARK-27879)

• bit_count (SPARK-29491)

• bit_xor (SPARK-29545)

• bool_and, bool_or (SPARK-30184)

• count_if (SPARK-27425)

• date_part (SPARK-28690)

• extract (SPARK-23903)

• forall (SPARK-27905)

• from_csv (SPARK-25393)

• make_date (SPARK-28432)

• make_interval (SPARK-29393)

• make_timestamp (SPARK-28459)

• map_entries (SPARK-23935)

• map_filter (SPARK-23937)

• map_zip_with (SPARK-23938)

• max_by, min_by (SPARK-27653)

• schema_of_csv (SPARK-25672)

• to_csv (SPARK-25638)

• transform_keys (SPARK-23939)

• transform_values (SPARK-23940)

• typeof (SPARK-29961)

• version (SPARK-29554)

• xxhash64 (SPARK-27099)
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
New static functions
http://jgp.ai/sia
Always a soup
• Finally a reference guide

• http://jgp.ai/sparksql

• EXPLAIN can be FORMATTED

• Proleptic Gregorian calendar,
based on Java 8

• Overflow check

• ANSI compatibility through
configuration flag
SQL
Ingestion
Who needs a push down?
• Already available in databases

• Allow to filter what you ingest, before you ingest it

• Equivalent but easier than ingesting and filtering after
String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
/jgperrin/net.jgp.books.spark.ch08
Chapter 8
Lab #310
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
/jgperrin/net.jgp.books.spark.ch07
Chapter 7
Lab #201
Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)

• Runtime will throw you off:

• Parsing dates

• Data sources (v2 on the way)

• Reference

• https://spark.apache.org/docs/latest/migration-guide.html
org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321
/jgperrin/net.jgp.books.spark.ch03
Mi-figue, mi-raisin
Mixed bags
The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources
Takeaways
• Apache Spark v3 is a major update, 3400+ patches

• Foundation for a rich data ecosystem

• Python increasingly popular, beats Scala

• Cornerstone for the lakehouse concept
Thank you! http://jgp.ai/sia
Join me for DataFriday:
http://jgp.ai/datafriday
Backup
Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0

• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0

• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr

• Soup by Valeria Boltneva from Pexels

More Related Content

What's hot

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
Spark Summit
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
MapR Technologies
 
Data stax academy
Data stax academyData stax academy
Data stax academy
Duyhai Doan
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
Anurag Patel
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
Databricks
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Lucidworks
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
Yahoo Developer Network
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
Matthias Niehoff
 

What's hot (20)

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 

Similar to Apache Spark v3.0.0

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
Sean Forgatch
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
HostedbyConfluent
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
K.Mohamed Faizal
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Kristofferson A
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
Alkin Tezuysal
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
tieleman
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
Charles Givre
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
Amazon Web Services
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
Kognitio
 
HotSpotコトハジメ
HotSpotコトハジメHotSpotコトハジメ
HotSpotコトハジメ
Yasumasa Suenaga
 
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
European Collaboration Summit
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
Bjoern Rost
 

Similar to Apache Spark v3.0.0 (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
HotSpotコトハジメ
HotSpotコトハジメHotSpotコトハジメ
HotSpotコトハジメ
 
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 

More from Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
Jean-Georges Perrin
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
Jean-Georges Perrin
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
Jean-Georges Perrin
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
Jean-Georges Perrin
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
Jean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
Jean-Georges Perrin
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
Jean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
Jean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Jean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
Jean-Georges Perrin
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
Jean-Georges Perrin
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
Jean-Georges Perrin
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
Jean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
Jean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
Jean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
Jean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
Jean-Georges Perrin
 

More from Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

Apache Spark v3.0.0

  • 1. Jean-Georges Perrin • @jgperrin Apache Spark v3.0.0 What’s new? A very personal view.
  • 2. Jean-Georges Perrin Software since 1983 $>0 1995 Big Data since 1984 $>0 2006 AI since 1994 $>0 2010 x12 @jgperrin • http://jgp.ai
  • 3.
  • 4.
  • 7.
  • 8. Apps Analytics Distrib. Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS An analytics operating system?
  • 9. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 10. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 11. DATA Engineer DATA Scientist Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders. Sources: Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
  • 13.
  • 14. Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Python rules in Notebooks
  • 15. A few more figures Who does not like performance figures? • Databricks: • Processes >5T records/day with Structured Streaming (introduced in Spark v2.0, stable in Spark v2.2) • >90% of all Spark API are Spark SQL, regardless of language used • Community: • Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB benchmark 5,000,000,000,001 Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
  • 17. Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
 Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html 3400+ Jira tickets
  • 18. Highlights in a nutshell • Python • Python v3 only (so long Python v2) • Better error handling • Koalas offer better Pandas support (close to 80%) • SQL • Better ANSI SQL compliance • Core • Adaptive query execution, including partition pruning • Java v11 support, Scala v2.12 only • Hydrogen - hardware & accelerator aware scheduler GPU support for model training
  • 19. Optimizing the optimizer Got to love Catalyst • v1.x: rule • v2.x: rule & cost (thanks to IBM) • v3.x: rule & cost & runtime (thanks to Databricks & Intel)
  • 20. • Dynamically coalescing shuffle partitions • Dynamically switching join strategies • Dynamically optimizing skew joins Sources: Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html Adaptive query execution (AQE) Yet another 3-letter acronym
  • 21. SparkSession spark = SparkSession.builder() .appName("Join using AQE") .master("local[*]") .config("spark.sql.adaptive.enabled", useAqe) .getOrCreate(); … Dataset<Row> institPerCountyDf = higherEdDf.join( countyZipDf, higherEdDf.col("zip").equalTo(countyZipDf.col("zip")), "inner"); institPerCountyDf = institPerCountyDf.join( censusDf, institPerCountyDf.col("county").equalTo(censusDf.col("countyId")), "left"); For the entire session /jgperrin/net.jgp.books.spark.ch12 Chapter 12 Lab #302
  • 22. In a free companion book • sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133) • any, every, some (SPARK-19851) • bit_and, bit_or (SPARK-27879) • bit_count (SPARK-29491) • bit_xor (SPARK-29545) • bool_and, bool_or (SPARK-30184) • count_if (SPARK-27425) • date_part (SPARK-28690) • extract (SPARK-23903) • forall (SPARK-27905) • from_csv (SPARK-25393) • make_date (SPARK-28432) • make_interval (SPARK-29393) • make_timestamp (SPARK-28459) • map_entries (SPARK-23935) • map_filter (SPARK-23937) • map_zip_with (SPARK-23938) • max_by, min_by (SPARK-27653) • schema_of_csv (SPARK-25672) • to_csv (SPARK-25638) • transform_keys (SPARK-23939) • transform_values (SPARK-23940) • typeof (SPARK-29961) • version (SPARK-29554) • xxhash64 (SPARK-27099) Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html New static functions http://jgp.ai/sia
  • 23. Always a soup • Finally a reference guide • http://jgp.ai/sparksql • EXPLAIN can be FORMATTED • Proleptic Gregorian calendar, based on Java 8 • Overflow check • ANSI compatibility through configuration flag SQL
  • 24. Ingestion Who needs a push down? • Already available in databases • Allow to filter what you ingest, before you ingest it • Equivalent but easier than ingesting and filtering after
  • 25. String sqlQuery = "select actor.first_name, actor.last_name, film.title, " + "film.description " + "from actor, film_actor, film " + "where actor.actor_id = film_actor.actor_id " + "and film_actor.film_id = film.film_id"; Dataset<Row> df = spark.read().jdbc( "jdbc:mysql://localhost:3306/sakila", "(" + sqlQuery + ") actor_film_alias", props); Will only ingest the result of the MySQL query /jgperrin/net.jgp.books.spark.ch08 Chapter 8 Lab #310
  • 26. +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT| | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? An i...| 12/28/2016|http://amzn.to/2vBxOe1| | 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav| … Dataset<Row> df = spark.read().format("csv") … .load("data/books.csv") .filter("authorId = 1”); +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ Will only ingest books where authorId is 1 /jgperrin/net.jgp.books.spark.ch07 Chapter 7 Lab #201
  • 27. Migration tips Yes, there are needed • Compilation will detect some (new Exception in structured streaming) • Runtime will throw you off: • Parsing dates • Data sources (v2 on the way) • Reference • https://spark.apache.org/docs/latest/migration-guide.html
  • 28. org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .getOrCreate(); spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"); or: SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .config("spark.sql.legacy.timeParserPolicy", "LEGACY") .getOrCreate(); Chapter 3 Lab #320 Lab #321 /jgperrin/net.jgp.books.spark.ch03
  • 30. The lakehouse is a full ecosystem Or is it an operating system? Streams Systems Files Other databases Systems Streams TBA? FilesOther databases Business Data science Data engineering Delta Lake & Delta Engine Outcome Processing & Storage Data sources
  • 31. Takeaways • Apache Spark v3 is a major update, 3400+ patches • Foundation for a rich data ecosystem • Python increasingly popular, beats Scala • Cornerstone for the lakehouse concept
  • 32. Thank you! http://jgp.ai/sia Join me for DataFriday: http://jgp.ai/datafriday
  • 34. Credits • World of Watson by Jean-Georges Perrin CC BY-SA 4.0 • Digital Garage by Jean-Georges Perrin CC BY-SA 4.0 • Figs, grapes and rosehips by Marco Verch Professional Photographer and Speaker, Flickr • Soup by Valeria Boltneva from Pexels