SlideShare a Scribd company logo
New Developments
in Open Source Ecosystem:
Apache Spark 3.0, Koalas, Delta Lake
李 潇 (@gatorsmile)
2019-09-26 @ Hangzhou
Dr. Matei Zaharia
Spark 起源于加州伯克利分校
Spark 原创团队成立 Databricks
Spark 3.0
Spark 被捐赠给 Apache Foundation
Spark 开源
Spark 成为 Apache 顶级项目
2009 2010 2011 2012 2013 2015 2017 20182014 20192016
Spark 2.0 发布
10 Years
Stackoverflow: Q & A for professional and enthusiast programmers, 9.4 m visits/day
Source: https://insights.stackoverflow.com/trends?tags=pyspark%2Capache-flink%2Chadoop%2Capache-spark%2Chbase%2Capache-kafka%2Capache-beam%2Chive%2Ccassandra
Apache Spark
Apache Hadoop
PySpark
Apache Kafka
Apache Hive
Apache Cassandra
Apache Flink
Apache Hbase
Apache Beam
10年 累计排名
Apache Spark
PySpark
Apache Kafka
Apache Hive
Apache Hadoop
Apache Cassandra
Apache Beam
Apache Flink
Apache Hbase
当前 排名
88,615 responses
最受欢迎的平台之一:
开 发 者 调 查 报 告
3.0
Will resolve 2000+ JIRAs
Data Source API with
Catalog Supports
Adaptive Query
Execution
Spark Graph Accelerator-aware
Scheduling
Dynamic Partition
Pruning
ANSI SQL
Compliance
SQL Hints
Spark on
Kubernetes
JDK 11 Hadoop 3Vectorization in
SparkR
Scala 2.12
Query Optimization in Spark 2.X
Missing Statistics (一次性计算 ETL workloads)
Out-of-date Statistics (存储与计算分离)
Misestimated Costs (多样部署环境 + UDFs)
Query Optimization
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime
Dynamic Partition Pruning
Project
Join
Filter
Scan Scan
t1: a large fact table
with many partitions
t2.id < 2
t2: a dimension
table with a filter
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
t1.pKey = t2.pKey
Project
Join
Filter + Scan Scan
t1.pKey = t2.pKey
Scan all the partitions
t2.id < 2
Filter pushdown
Filter
t1.pkey IN (
SELECT t2.pKey
FROM t2
WHERE t2.id < 2)
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
Dynamic Partition Pruning
Project
Join
Filter + Scan Filter + Scan
Scan the affected partitions
t2.id < 2
Filter pushdown
t1.pKey in DPPFilterResult
Speed: 33 X faster Scan: 90+% less
Dynamic Partition Pruning
t1.pKey = t2.pKey
Adaptive Query Execution
adaptive planning
Adaptive Query Execution
SELECT *
FROM t1
JOIN t2
ON t1.key = t2.key
AND t2.col2 LIKE '9999%'
SortMergeJoin
Exchange
Scan Filter + Scan
t2.col2 LIKE
‘9999%’
t1.key = t2.key
SortSort
Exchange
QueryStrage0QueryStrage1
Adaptive Query Execution
SELECT *
FROM t1
JOIN t2
ON t1.key = t2.key
AND t2.col2 LIKE '9999%'
BroadcastHashJoin
Exchange
Scan Filter + Scan
t2.col2 LIKE
‘9999%’
t1.key = t2.key
BroadcastExchange
Exchange
QueryStrage0QueryStrage1
QueryStrage2
生 态
Dr. Ion Stoica & Dr. Matei Zaharia
18
BUSINESS INTELLIGENCE BIG DATA ANALYTICS UNIFYING DATA + AI
Autonomous Decisions
Advanced ML / AI - Make the
decision for me
Spark as a unifying technology -
Data processing at scale +
analytics + AI
Data Warehouses
Reactive Analytics - What just
happened?
Predictive Analytics
Proactive Analytics - Predict
what’s going to happen
数 据 科 学数 据 工 程
Accelerate innovation by unifying data
science and engineering
数 据 科 学数 据 工 程
数 据 科 学
Image source: Stack Overflow
Traffic for major programming languages
Image source: pypistats.org
koalas
koalas
PySpark
Column df['col'] df['col']
Mutability Mutable Immutable
Add a
column
df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b'])
Rename
columns
df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b'))
Value
count
df['col'].value_counts()
df.groupBy(df['col']).count()
.orderBy('count', ascending = False)
pandas
DataFrame
PySpark
DataFrame
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)
pandas PySpark
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
数 据 工 程
A New Standard for Building Data Lakes
Open Format Based on Parquet
With Transactions
Apache Spark™ APIs
Instead of parquet … … simply say delta
dataframe
.write
.format("parquet")
.save("/data")
dataframe
.write
.format("delta")
.save("/data")
Delta Lake ensures data reliability
Transactional
Log
Parquet Files
Streaming
● ACID Transactions
● Schema Enforcement
● Unified Batch & Streaming
● Time Travel/Data Snapshots
Key Features
High Quality & Reliable Data
always ready for analytics
Batch
Updates
Deletes
Merges
Improved reliability:
Petabyte-scale jobs
10x lower compute:
640 服 务 器 → 64!
Simpler, faster ETL:
84 jobs → 3 jobs
halved data latency
32
Easier transactional
updates:
No downtime or
consistency issues!
Simple CDC:
Easy with MERGE
Improved performance: Queries run faster
大 于 一 小 时 → 少 于 六 秒
33
Databricks Cluster
Data consistency and
integrity:
not available before
Increased data quality:
name match accuracy up
80% → 95%
Faster data loads:
一 天 → 二十分钟
Databricks Delta in Numbers!
日均 38+ PB
在 客 户 生 产 环 境
处 理 数 据 量
日均 400+ TB
某 一 大 型 客 户 的 生 产 环 境
新 入 库 数 据 量
koalas
Thank You
李潇 (lixiao@databricks.com)

More Related Content

What's hot

Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Databricks
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
PyData
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
PyData
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
IMC Institute
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
Julian Hyde
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
DataStax Academy
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Attila Szucs
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
Slim Bouguerra
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 

What's hot (20)

Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
Distributing C# Applications with Apache Spark (TechEd 2017, Prague)
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 

Similar to New developments in open source ecosystem spark3.0 koalas delta lake

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
Mark Kerzner
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
Databricks
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Provectus
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 

Similar to New developments in open source ecosystem spark3.0 koalas delta lake (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 

Recently uploaded

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 

Recently uploaded (20)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 

New developments in open source ecosystem spark3.0 koalas delta lake

  • 1. New Developments in Open Source Ecosystem: Apache Spark 3.0, Koalas, Delta Lake 李 潇 (@gatorsmile) 2019-09-26 @ Hangzhou
  • 3. Spark 起源于加州伯克利分校 Spark 原创团队成立 Databricks Spark 3.0 Spark 被捐赠给 Apache Foundation Spark 开源 Spark 成为 Apache 顶级项目 2009 2010 2011 2012 2013 2015 2017 20182014 20192016 Spark 2.0 发布 10 Years
  • 4. Stackoverflow: Q & A for professional and enthusiast programmers, 9.4 m visits/day Source: https://insights.stackoverflow.com/trends?tags=pyspark%2Capache-flink%2Chadoop%2Capache-spark%2Chbase%2Capache-kafka%2Capache-beam%2Chive%2Ccassandra Apache Spark Apache Hadoop PySpark Apache Kafka Apache Hive Apache Cassandra Apache Flink Apache Hbase Apache Beam 10年 累计排名 Apache Spark PySpark Apache Kafka Apache Hive Apache Hadoop Apache Cassandra Apache Beam Apache Flink Apache Hbase 当前 排名
  • 6. 3.0
  • 7. Will resolve 2000+ JIRAs Data Source API with Catalog Supports Adaptive Query Execution Spark Graph Accelerator-aware Scheduling Dynamic Partition Pruning ANSI SQL Compliance SQL Hints Spark on Kubernetes JDK 11 Hadoop 3Vectorization in SparkR Scala 2.12
  • 8. Query Optimization in Spark 2.X Missing Statistics (一次性计算 ETL workloads) Out-of-date Statistics (存储与计算分离) Misestimated Costs (多样部署环境 + UDFs)
  • 9. Query Optimization Spark 1.x, Rule Spark 2.x, Rule + Cost Spark 3.0, Rule + Cost + Runtime
  • 10. Dynamic Partition Pruning Project Join Filter Scan Scan t1: a large fact table with many partitions t2.id < 2 t2: a dimension table with a filter SELECT t1.id, t2.pKey FROM t1 JOIN t2 ON t1.pKey = t2.pKey AND t2.id < 2 t1.pKey = t2.pKey
  • 11. Project Join Filter + Scan Scan t1.pKey = t2.pKey Scan all the partitions t2.id < 2 Filter pushdown Filter t1.pkey IN ( SELECT t2.pKey FROM t2 WHERE t2.id < 2) SELECT t1.id, t2.pKey FROM t1 JOIN t2 ON t1.pKey = t2.pKey AND t2.id < 2 Dynamic Partition Pruning
  • 12. Project Join Filter + Scan Filter + Scan Scan the affected partitions t2.id < 2 Filter pushdown t1.pKey in DPPFilterResult Speed: 33 X faster Scan: 90+% less Dynamic Partition Pruning t1.pKey = t2.pKey
  • 14. Adaptive Query Execution SELECT * FROM t1 JOIN t2 ON t1.key = t2.key AND t2.col2 LIKE '9999%' SortMergeJoin Exchange Scan Filter + Scan t2.col2 LIKE ‘9999%’ t1.key = t2.key SortSort Exchange QueryStrage0QueryStrage1
  • 15. Adaptive Query Execution SELECT * FROM t1 JOIN t2 ON t1.key = t2.key AND t2.col2 LIKE '9999%' BroadcastHashJoin Exchange Scan Filter + Scan t2.col2 LIKE ‘9999%’ t1.key = t2.key BroadcastExchange Exchange QueryStrage0QueryStrage1 QueryStrage2
  • 17. Dr. Ion Stoica & Dr. Matei Zaharia
  • 18. 18 BUSINESS INTELLIGENCE BIG DATA ANALYTICS UNIFYING DATA + AI Autonomous Decisions Advanced ML / AI - Make the decision for me Spark as a unifying technology - Data processing at scale + analytics + AI Data Warehouses Reactive Analytics - What just happened? Predictive Analytics Proactive Analytics - Predict what’s going to happen
  • 19. 数 据 科 学数 据 工 程
  • 20. Accelerate innovation by unifying data science and engineering 数 据 科 学数 据 工 程
  • 21. 数 据 科
  • 22. Image source: Stack Overflow Traffic for major programming languages
  • 24. Column df['col'] df['col'] Mutability Mutable Immutable Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b')) Value count df['col'].value_counts() df.groupBy(df['col']).count() .orderBy('count', ascending = False) pandas DataFrame PySpark DataFrame
  • 25. import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x*df.x) pandas PySpark
  • 26. import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x
  • 27. 数 据 工
  • 28. A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark™ APIs
  • 29. Instead of parquet … … simply say delta dataframe .write .format("parquet") .save("/data") dataframe .write .format("delta") .save("/data")
  • 30. Delta Lake ensures data reliability Transactional Log Parquet Files Streaming ● ACID Transactions ● Schema Enforcement ● Unified Batch & Streaming ● Time Travel/Data Snapshots Key Features High Quality & Reliable Data always ready for analytics Batch Updates Deletes Merges
  • 31. Improved reliability: Petabyte-scale jobs 10x lower compute: 640 服 务 器 → 64! Simpler, faster ETL: 84 jobs → 3 jobs halved data latency
  • 32. 32 Easier transactional updates: No downtime or consistency issues! Simple CDC: Easy with MERGE Improved performance: Queries run faster 大 于 一 小 时 → 少 于 六 秒
  • 33. 33 Databricks Cluster Data consistency and integrity: not available before Increased data quality: name match accuracy up 80% → 95% Faster data loads: 一 天 → 二十分钟
  • 34. Databricks Delta in Numbers! 日均 38+ PB 在 客 户 生 产 环 境 处 理 数 据 量 日均 400+ TB 某 一 大 型 客 户 的 生 产 环 境 新 入 库 数 据 量
  • 35.