SlideShare a Scribd company logo
1 of 37
Download to read offline
New Developments
in Open Source Ecosystem:
Apache Spark 3.0, Koalas, Delta Lake
李 潇 (@gatorsmile)
2019-09-26 @ Hangzhou
Dr. Matei Zaharia
Spark 起源于加州伯克利分校
Spark 原创团队成立 Databricks
Spark 3.0
Spark 被捐赠给 Apache Foundation
Spark 开源
Spark 成为 Apache 顶级项目
2009 2010 2011 2012 2013 2015 2017 20182014 20192016
Spark 2.0 发布
10 Years
Stackoverflow: Q & A for professional and enthusiast programmers, 9.4 m visits/day
Source: https://insights.stackoverflow.com/trends?tags=pyspark%2Capache-flink%2Chadoop%2Capache-spark%2Chbase%2Capache-kafka%2Capache-beam%2Chive%2Ccassandra
Apache Spark
Apache Hadoop
PySpark
Apache Kafka
Apache Hive
Apache Cassandra
Apache Flink
Apache Hbase
Apache Beam
10年 累计排名
Apache Spark
PySpark
Apache Kafka
Apache Hive
Apache Hadoop
Apache Cassandra
Apache Beam
Apache Flink
Apache Hbase
当前 排名
88,615 responses
最受欢迎的平台之一:
开 发 者 调 查 报 告
3.0
Will resolve 2000+ JIRAs
Data Source API with
Catalog Supports
Adaptive Query
Execution
Spark Graph Accelerator-aware
Scheduling
Dynamic Partition
Pruning
ANSI SQL
Compliance
SQL Hints
Spark on
Kubernetes
JDK 11 Hadoop 3Vectorization in
SparkR
Scala 2.12
Query Optimization in Spark 2.X
Missing Statistics (一次性计算 ETL workloads)
Out-of-date Statistics (存储与计算分离)
Misestimated Costs (多样部署环境 + UDFs)
Query Optimization
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime
Dynamic Partition Pruning
Project
Join
Filter
Scan Scan
t1: a large fact table
with many partitions
t2.id < 2
t2: a dimension
table with a filter
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
t1.pKey = t2.pKey
Project
Join
Filter + Scan Scan
t1.pKey = t2.pKey
Scan all the partitions
t2.id < 2
Filter pushdown
Filter
t1.pkey IN (
SELECT t2.pKey
FROM t2
WHERE t2.id < 2)
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
Dynamic Partition Pruning
Project
Join
Filter + Scan Filter + Scan
Scan the affected partitions
t2.id < 2
Filter pushdown
t1.pKey in DPPFilterResult
Speed: 33 X faster Scan: 90+% less
Dynamic Partition Pruning
t1.pKey = t2.pKey
Adaptive Query Execution
adaptive planning
Adaptive Query Execution
SELECT *
FROM t1
JOIN t2
ON t1.key = t2.key
AND t2.col2 LIKE '9999%'
SortMergeJoin
Exchange
Scan Filter + Scan
t2.col2 LIKE
‘9999%’
t1.key = t2.key
SortSort
Exchange
QueryStrage0QueryStrage1
Adaptive Query Execution
SELECT *
FROM t1
JOIN t2
ON t1.key = t2.key
AND t2.col2 LIKE '9999%'
BroadcastHashJoin
Exchange
Scan Filter + Scan
t2.col2 LIKE
‘9999%’
t1.key = t2.key
BroadcastExchange
Exchange
QueryStrage0QueryStrage1
QueryStrage2
生 态
Dr. Ion Stoica & Dr. Matei Zaharia
18
BUSINESS INTELLIGENCE BIG DATA ANALYTICS UNIFYING DATA + AI
Autonomous Decisions
Advanced ML / AI - Make the
decision for me
Spark as a unifying technology -
Data processing at scale +
analytics + AI
Data Warehouses
Reactive Analytics - What just
happened?
Predictive Analytics
Proactive Analytics - Predict
what’s going to happen
数 据 科 学数 据 工 程
Accelerate innovation by unifying data
science and engineering
数 据 科 学数 据 工 程
数 据 科 学
Image source: Stack Overflow
Traffic for major programming languages
Image source: pypistats.org
koalas
koalas
PySpark
Column df['col'] df['col']
Mutability Mutable Immutable
Add a
column
df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b'])
Rename
columns
df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b'))
Value
count
df['col'].value_counts()
df.groupBy(df['col']).count()
.orderBy('count', ascending = False)
pandas
DataFrame
PySpark
DataFrame
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)
pandas PySpark
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
数 据 工 程
A New Standard for Building Data Lakes
Open Format Based on Parquet
With Transactions
Apache Spark™ APIs
Instead of parquet … … simply say delta
dataframe
.write
.format("parquet")
.save("/data")
dataframe
.write
.format("delta")
.save("/data")
Delta Lake ensures data reliability
Transactional
Log
Parquet Files
Streaming
● ACID Transactions
● Schema Enforcement
● Unified Batch & Streaming
● Time Travel/Data Snapshots
Key Features
High Quality & Reliable Data
always ready for analytics
Batch
Updates
Deletes
Merges
Improved reliability:
Petabyte-scale jobs
10x lower compute:
640 服 务 器 → 64!
Simpler, faster ETL:
84 jobs → 3 jobs
halved data latency
32
Easier transactional
updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Improved performance: Queries run faster
大 于 一 小 时 → 少 于 六 秒
33
Databricks Cluster
Data consistency and
integrity:
not available before
Increased data quality:
name match accuracy up
80% → 95%
Faster data loads:
一 天 → 二十分钟
Databricks Delta in Numbers!
日均 38+ PB
在 客 户 生 产 环 境
处 理 数 据 量
日均 400+ TB
某 一 大 型 客 户 的 生 产 环 境
新 入 库 数 据 量
koalas
Thank You
李潇 (lixiao@databricks.com)

More Related Content

What's hot

Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovDatabricks
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypetPyData
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumPyData
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in ActionIMC Institute
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)DataStax Academy
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Russell Spitzer
 
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Attila Szucs
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 

What's hot (20)

Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 

Similar to New Developments in Open Source Ecosystem: Apache Spark 3.0, Koalas, Delta Lake

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingMark Kerzner
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileDatabricks
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 

Similar to New Developments in Open Source Ecosystem: Apache Spark 3.0, Koalas, Delta Lake (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 

Recently uploaded

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 

Recently uploaded (20)

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 

New Developments in Open Source Ecosystem: Apache Spark 3.0, Koalas, Delta Lake

  • 1. New Developments in Open Source Ecosystem: Apache Spark 3.0, Koalas, Delta Lake 李 潇 (@gatorsmile) 2019-09-26 @ Hangzhou
  • 3. Spark 起源于加州伯克利分校 Spark 原创团队成立 Databricks Spark 3.0 Spark 被捐赠给 Apache Foundation Spark 开源 Spark 成为 Apache 顶级项目 2009 2010 2011 2012 2013 2015 2017 20182014 20192016 Spark 2.0 发布 10 Years
  • 4. Stackoverflow: Q & A for professional and enthusiast programmers, 9.4 m visits/day Source: https://insights.stackoverflow.com/trends?tags=pyspark%2Capache-flink%2Chadoop%2Capache-spark%2Chbase%2Capache-kafka%2Capache-beam%2Chive%2Ccassandra Apache Spark Apache Hadoop PySpark Apache Kafka Apache Hive Apache Cassandra Apache Flink Apache Hbase Apache Beam 10年 累计排名 Apache Spark PySpark Apache Kafka Apache Hive Apache Hadoop Apache Cassandra Apache Beam Apache Flink Apache Hbase 当前 排名
  • 6. 3.0
  • 7. Will resolve 2000+ JIRAs Data Source API with Catalog Supports Adaptive Query Execution Spark Graph Accelerator-aware Scheduling Dynamic Partition Pruning ANSI SQL Compliance SQL Hints Spark on Kubernetes JDK 11 Hadoop 3Vectorization in SparkR Scala 2.12
  • 8. Query Optimization in Spark 2.X Missing Statistics (一次性计算 ETL workloads) Out-of-date Statistics (存储与计算分离) Misestimated Costs (多样部署环境 + UDFs)
  • 9. Query Optimization Spark 1.x, Rule Spark 2.x, Rule + Cost Spark 3.0, Rule + Cost + Runtime
  • 10. Dynamic Partition Pruning Project Join Filter Scan Scan t1: a large fact table with many partitions t2.id < 2 t2: a dimension table with a filter SELECT t1.id, t2.pKey FROM t1 JOIN t2 ON t1.pKey = t2.pKey AND t2.id < 2 t1.pKey = t2.pKey
  • 11. Project Join Filter + Scan Scan t1.pKey = t2.pKey Scan all the partitions t2.id < 2 Filter pushdown Filter t1.pkey IN ( SELECT t2.pKey FROM t2 WHERE t2.id < 2) SELECT t1.id, t2.pKey FROM t1 JOIN t2 ON t1.pKey = t2.pKey AND t2.id < 2 Dynamic Partition Pruning
  • 12. Project Join Filter + Scan Filter + Scan Scan the affected partitions t2.id < 2 Filter pushdown t1.pKey in DPPFilterResult Speed: 33 X faster Scan: 90+% less Dynamic Partition Pruning t1.pKey = t2.pKey
  • 14. Adaptive Query Execution SELECT * FROM t1 JOIN t2 ON t1.key = t2.key AND t2.col2 LIKE '9999%' SortMergeJoin Exchange Scan Filter + Scan t2.col2 LIKE ‘9999%’ t1.key = t2.key SortSort Exchange QueryStrage0QueryStrage1
  • 15. Adaptive Query Execution SELECT * FROM t1 JOIN t2 ON t1.key = t2.key AND t2.col2 LIKE '9999%' BroadcastHashJoin Exchange Scan Filter + Scan t2.col2 LIKE ‘9999%’ t1.key = t2.key BroadcastExchange Exchange QueryStrage0QueryStrage1 QueryStrage2
  • 17. Dr. Ion Stoica & Dr. Matei Zaharia
  • 18. 18 BUSINESS INTELLIGENCE BIG DATA ANALYTICS UNIFYING DATA + AI Autonomous Decisions Advanced ML / AI - Make the decision for me Spark as a unifying technology - Data processing at scale + analytics + AI Data Warehouses Reactive Analytics - What just happened? Predictive Analytics Proactive Analytics - Predict what’s going to happen
  • 19. 数 据 科 学数 据 工 程
  • 20. Accelerate innovation by unifying data science and engineering 数 据 科 学数 据 工 程
  • 21. 数 据 科
  • 22. Image source: Stack Overflow Traffic for major programming languages
  • 24. Column df['col'] df['col'] Mutability Mutable Immutable Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b')) Value count df['col'].value_counts() df.groupBy(df['col']).count() .orderBy('count', ascending = False) pandas DataFrame PySpark DataFrame
  • 25. import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x*df.x) pandas PySpark
  • 26. import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x
  • 27. 数 据 工
  • 28. A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark™ APIs
  • 29. Instead of parquet … … simply say delta dataframe .write .format("parquet") .save("/data") dataframe .write .format("delta") .save("/data")
  • 30. Delta Lake ensures data reliability Transactional Log Parquet Files Streaming ● ACID Transactions ● Schema Enforcement ● Unified Batch & Streaming ● Time Travel/Data Snapshots Key Features High Quality & Reliable Data always ready for analytics Batch Updates Deletes Merges
  • 31. Improved reliability: Petabyte-scale jobs 10x lower compute: 640 服 务 器 → 64! Simpler, faster ETL: 84 jobs → 3 jobs halved data latency
  • 32. 32 Easier transactional updates: No downtime or consistency issues! Simple CDC: Easy with MERGE Improved performance: Queries run faster 大 于 一 小 时 → 少 于 六 秒
  • 33. 33 Databricks Cluster Data consistency and integrity: not available before Increased data quality: name match accuracy up 80% → 95% Faster data loads: 一 天 → 二十分钟
  • 34. Databricks Delta in Numbers! 日均 38+ PB 在 客 户 生 产 环 境 处 理 数 据 量 日均 400+ TB 某 一 大 型 客 户 的 生 产 环 境 新 入 库 数 据 量
  • 35.