Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Urs Hölzle
@uhoelzle
@JeffDean @GCPcloud R.I.P. MapReduce. After having
served us well since 2003, today we removed the
rem...
❌
Separate Compute & Storage
AI & More than SQL
Open Source at Scale
Data
Warehouse
Hadoop
M/R
❌
❌
✔
✔
✔
Traditional RDBMS Opinion 2008
SQL & Optimization
Data Model & Catalog
ACID Transactions
✔
✔
✔
❌
❌
❌
❌
Separate Compute & Storage
More than SQL (i.e ML)
...
The Growing Apache Spark Ecosystem
3.0 Improved Optimizer and Catalog
ACID Transactions
Bringing Sparks Scale to Pandas
3.0
Improved Optimizer and Catalog
Spark 3.0: Pluggable Data Catalog
DataSourceV2
• Pluggable catalog integration
• Improved pushdown
• Unified APIs for stre...
Spark 3.0: Adaptive Query Execution
Make better optimization decisions during query execution.
Sort
Join
Sort
Join
Broadca...
Spark 3.0: Powerful Optimization
Scan
Filter
Join
Scan
Filter
Join
Dynamic partition pruning speeds up expensive joins.
Ta...
World Class Performance for Warehousing
Spark 3.0 Improves TPC-DS
Performance by as much as 17x!
Spark wins TPC-DS
perform...
And Much More…
3.0 PREVIEW COMING SOON!
Spark on ACID
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
???
Easy, right?
Events
AI & Reporting
Streaming
Analytics
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1...
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Valida...
Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Va...
Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
Let’s try it instead with
Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-ar...
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Si...
The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silve...
The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silve...
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Si...
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Si...
Delta Lake Community
~2+
Exabytes of Delta Read/Writes
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
...
Delta Lake beyond Spark
Announcing:
+
Delta Lake Joins the Linux Foundation!
Demo
Bringing the Power of
Apache Spark to Pandas
import pandas as pd
df = pd.read_csv('my_data.csv')
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import databricks...
10,000+
Downloads per day
204,452
Downloads this Sept
~100%
Month-over-month
download growth
21
Bi-weekly releases
Growing...
Challenge: increasing scale and
complexity of
data operations
Struggling with the
“Spark switch” from pandas
More than 10X...
Getting Started with Koalas
Docs and updates on github.com/databricks/koalas
Project docs are published on koalas.readthed...
Demo
The Spark Ecosystem is Exploding
Bringing the best characteristics of the Data Lake and
Traditional Relational Databases t...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

Download to read offline

<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

  1. 1. Urs Hölzle @uhoelzle @JeffDean @GCPcloud R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good. MR was a seminal idea in 2003 but we've learned a lot since then. [There are new systems that] express pipelines more naturally with less code, and you get both batch and streaming from the same code. 2019
  2. 2. ❌ Separate Compute & Storage AI & More than SQL Open Source at Scale Data Warehouse Hadoop M/R ❌ ❌ ✔ ✔ ✔
  3. 3. Traditional RDBMS Opinion 2008
  4. 4. SQL & Optimization Data Model & Catalog ACID Transactions ✔ ✔ ✔ ❌ ❌ ❌ ❌ Separate Compute & Storage More than SQL (i.e ML) Open Source at Scale Data Warehouse Hadoop M/R ❌ ❌ ✔ ✔ ✔ ✔ ✔ ✔ ✔ Data Science at Scale ❌❌ ✔ 3.0
  5. 5. The Growing Apache Spark Ecosystem 3.0 Improved Optimizer and Catalog ACID Transactions Bringing Sparks Scale to Pandas
  6. 6. 3.0 Improved Optimizer and Catalog
  7. 7. Spark 3.0: Pluggable Data Catalog DataSourceV2 • Pluggable catalog integration • Improved pushdown • Unified APIs for streaming and batch df.writeTo("catalog.db.table") .overwrite($”year" === "2019")
  8. 8. Spark 3.0: Adaptive Query Execution Make better optimization decisions during query execution. Sort Join Sort Join Broadcast No expensive Sort!
  9. 9. Spark 3.0: Powerful Optimization Scan Filter Join Scan Filter Join Dynamic partition pruning speeds up expensive joins. Talk later today!
  10. 10. World Class Performance for Warehousing Spark 3.0 Improves TPC-DS Performance by as much as 17x! Spark wins TPC-DS performance top spot! 0 5 10 15 20 25-v2.4 17-v2.4 15-v2.4 42-v2.4 6-v2.4 58-v2.4 56-v2.4 54-v2.4 71-v2.4 33-v2.4 60-v2.4 55-v2.4 52-v2.4 SpeedUp
  11. 11. And Much More… 3.0 PREVIEW COMING SOON!
  12. 12. Spark on ACID
  13. 13. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ???
  14. 14. Easy, right? Events AI & Reporting Streaming Analytics
  15. 15. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1
  16. 16. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2
  17. 17. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  18. 18. Reprocessing Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  19. 19. Wasting Time & Money Solving Systems Problems Instead of Extracting Value From Data
  20. 20. Let’s try it instead with
  21. 21. Reprocessing Challenges of the Data Lake Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  22. 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels * The Architecture
  23. 23. The Architecture Data Lake AI & Reporting Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streaming Analytics Full ACID Transactions Focus on your data flow, instead of worrying about failures.
  24. 24. The Architecture Data Lake AI & Reporting Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streaming Analytics Powered by Unifies streaming / batch. Convert existing jobs with minimal modifications.
  25. 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake • Low-latency or manually triggered • Eliminates management of schedules and jobs The Architecture
  26. 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis UPDATE DELETE MERGE OVERWRITE • Retention • Corrections • Change data capture INSERT The Architecture Delta Lake also supports batch jobs and standard DML
  27. 27. Delta Lake Community ~2+ Exabytes of Delta Read/Writes 3700+ Orgs using Delta 0 5,000 10,000 15,000 20,000 M arch April M ay June July AugustSeptem ber Downloads
  28. 28. Delta Lake beyond Spark
  29. 29. Announcing: + Delta Lake Joins the Linux Foundation!
  30. 30. Demo
  31. 31. Bringing the Power of Apache Spark to Pandas
  32. 32. import pandas as pd df = pd.read_csv('my_data.csv') df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_delta(‘/lake/data') df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x This works great on my laptop… … but what if I have more data?
  33. 33. 10,000+ Downloads per day 204,452 Downloads this Sept ~100% Month-over-month download growth 21 Bi-weekly releases Growing Koalas Ecosystem
  34. 34. Challenge: increasing scale and complexity of data operations Struggling with the “Spark switch” from pandas More than 10X faster with less than 1% code changes How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas
  35. 35. Getting Started with Koalas Docs and updates on github.com/databricks/koalas Project docs are published on koalas.readthedocs.io pip install koalas conda install koalasOR
  36. 36. Demo
  37. 37. The Spark Ecosystem is Exploding Bringing the best characteristics of the Data Lake and Traditional Relational Databases together: Tomorrow:
  • PauloCera

    Feb. 6, 2020
  • ssuser529b3b

    Jan. 16, 2020
  • pavan_cheruvu

    Dec. 18, 2019

<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>

Views

Total views

3,101

On Slideshare

0

From embeds

0

Number of embeds

16

Actions

Downloads

155

Shares

0

Comments

0

Likes

3

×