Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

1,735 views

Published on

<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>

Published in: Data & Analytics
  • Be the first to comment

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

  1. 1. Urs Hölzle @uhoelzle @JeffDean @GCPcloud R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good. MR was a seminal idea in 2003 but we've learned a lot since then. [There are new systems that] express pipelines more naturally with less code, and you get both batch and streaming from the same code. 2019
  2. 2. ❌ Separate Compute & Storage AI & More than SQL Open Source at Scale Data Warehouse Hadoop M/R ❌ ❌ ✔ ✔ ✔
  3. 3. Traditional RDBMS Opinion 2008
  4. 4. SQL & Optimization Data Model & Catalog ACID Transactions ✔ ✔ ✔ ❌ ❌ ❌ ❌ Separate Compute & Storage More than SQL (i.e ML) Open Source at Scale Data Warehouse Hadoop M/R ❌ ❌ ✔ ✔ ✔ ✔ ✔ ✔ ✔ Data Science at Scale ❌❌ ✔ 3.0
  5. 5. The Growing Apache Spark Ecosystem 3.0 Improved Optimizer and Catalog ACID Transactions Bringing Sparks Scale to Pandas
  6. 6. 3.0 Improved Optimizer and Catalog
  7. 7. Spark 3.0: Pluggable Data Catalog DataSourceV2 • Pluggable catalog integration • Improved pushdown • Unified APIs for streaming and batch df.writeTo("catalog.db.table") .overwrite($”year" === "2019")
  8. 8. Spark 3.0: Adaptive Query Execution Make better optimization decisions during query execution. Sort Join Sort Join Broadcast No expensive Sort!
  9. 9. Spark 3.0: Powerful Optimization Scan Filter Join Scan Filter Join Dynamic partition pruning speeds up expensive joins. Talk later today!
  10. 10. World Class Performance for Warehousing Spark 3.0 Improves TPC-DS Performance by as much as 17x! Spark wins TPC-DS performance top spot! 0 5 10 15 20 25-v2.4 17-v2.4 15-v2.4 42-v2.4 6-v2.4 58-v2.4 56-v2.4 54-v2.4 71-v2.4 33-v2.4 60-v2.4 55-v2.4 52-v2.4 SpeedUp
  11. 11. And Much More… 3.0 PREVIEW COMING SOON!
  12. 12. Spark on ACID
  13. 13. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ???
  14. 14. Easy, right? Events AI & Reporting Streaming Analytics
  15. 15. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1
  16. 16. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2
  17. 17. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  18. 18. Reprocessing Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  19. 19. Wasting Time & Money Solving Systems Problems Instead of Extracting Value From Data
  20. 20. Let’s try it instead with
  21. 21. Reprocessing Challenges of the Data Lake Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  22. 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels * The Architecture
  23. 23. The Architecture Data Lake AI & Reporting Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streaming Analytics Full ACID Transactions Focus on your data flow, instead of worrying about failures.
  24. 24. The Architecture Data Lake AI & Reporting Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streaming Analytics Powered by Unifies streaming / batch. Convert existing jobs with minimal modifications.
  25. 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake • Low-latency or manually triggered • Eliminates management of schedules and jobs The Architecture
  26. 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis UPDATE DELETE MERGE OVERWRITE • Retention • Corrections • Change data capture INSERT The Architecture Delta Lake also supports batch jobs and standard DML
  27. 27. Delta Lake Community ~2+ Exabytes of Delta Read/Writes 3700+ Orgs using Delta 0 5,000 10,000 15,000 20,000 M arch April M ay June July AugustSeptem ber Downloads
  28. 28. Delta Lake beyond Spark
  29. 29. Announcing: + Delta Lake Joins the Linux Foundation!
  30. 30. Demo
  31. 31. Bringing the Power of Apache Spark to Pandas
  32. 32. import pandas as pd df = pd.read_csv('my_data.csv') df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_delta(‘/lake/data') df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x This works great on my laptop… … but what if I have more data?
  33. 33. 10,000+ Downloads per day 204,452 Downloads this Sept ~100% Month-over-month download growth 21 Bi-weekly releases Growing Koalas Ecosystem
  34. 34. Challenge: increasing scale and complexity of data operations Struggling with the “Spark switch” from pandas More than 10X faster with less than 1% code changes How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas
  35. 35. Getting Started with Koalas Docs and updates on github.com/databricks/koalas Project docs are published on koalas.readthedocs.io pip install koalas conda install koalasOR
  36. 36. Demo
  37. 37. The Spark Ecosystem is Exploding Bringing the best characteristics of the Data Lake and Traditional Relational Databases together: Tomorrow:

×