Successfully reported this slideshow.
Your SlideShare is downloading. ×

Delta Lake: Open Source Reliability w/ Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 39 Ad

Delta Lake: Open Source Reliability w/ Apache Spark

Download to read offline

As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup

Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup

Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Delta Lake: Open Source Reliability w/ Apache Spark (20)

Advertisement

Recently uploaded (20)

Delta Lake: Open Source Reliability w/ Apache Spark

  1. 1. Delta Lake: Open Source Reliability with Apache Spark Sajith Appukuttan
  2. 2. 1. Collect Everything • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing 3. Data Science & Machine Learning 2. Store it all in the Data Lake The Promise of the Data Lake Garbage In Garbage Stored Garbage Out �� �� �� ���� �� ��
  3. 3. What does a typical data lake project look like?
  4. 4. Evolution of a Cutting-Edge Data Lake Events ? AI & Reporting Streaming Analytics Data Lake
  5. 5. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics Data Lake
  6. 6. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1
  7. 7. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2
  8. 8. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  9. 9. Reprocessing Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  10. 10. Wasting Time & Money Solving Systems Problems Instead of Extracting Value From Data
  11. 11. Data Lake Distractions No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery ✗ No quality enforcement creates inconsistent and unusable data No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming
  12. 12. Let’s try it instead with
  13. 13. Reprocessing Challenges of the Data Lake Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2
  14. 14. AI & Reporting Streaming Analytics The Architecture Data Lake CSV, JSON, TXT… Kinesis
  15. 15. AI & Reporting Streaming Analytics The Architecture Data Lake CSV, JSON, TXT… Kinesis Full ACID Transactions on your Big Data Focus on your data flow, instead of worrying about failures.
  16. 16. AI & Reporting Streaming Analytics The Architecture Data Lake CSV, JSON, TXT… Kinesis Open Standards, Open Source (Apache License) Store petabytes of data without worries of lock-in. Growing community including Presto, Spark and more.
  17. 17. AI & Reporting Streaming Analytics The Architecture Data Lake CSV, JSON, TXT… Kinesis Powered by Unifies Streaming / Batch. Convert existing jobs with minimal modifications.
  18. 18. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels *
  19. 19. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis •Dumping ground for raw data •Often with long retention (years) •Avoid error-prone parsing ��
  20. 20. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Intermediate data with some cleanup applied. Queryable for easy debugging!
  21. 21. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Clean data, ready for consumption. Read with Spark or Presto* *Coming Soon
  22. 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake •Low-latency or manually triggered •Eliminates management of schedules and jobs
  23. 23. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Delta Lake also supports batch jobs and standard DML UPDATE DELETE MERGE OVERWRITE • Retention • Corrections • GDPR INSERT *DML released in 0.3.0
  24. 24. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Easy to recompute when business logic changes: • Clear tables • Restart streams DELETE DELETE
  25. 25. Who is using ?
  26. 26. Used by 1000s of organizations worldwide > 1 exabyte processed last month alone
  27. 27. 27 →
  28. 28. How do I use ?
  29. 29. dataframe .write .format("delta") .save("/data") Get Started with Delta using Spark APIs dataframe .write .format("parquet") .save("/data") Instead of parquet... … simply say delta Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0 Maven
  30. 30. How does work?
  31. 31. Sign up for Databricks Community Edition Go to: databricks.com/try and choose Community Edition
  32. 32. Notebooks 01 - Delta Lake Primer https://dbricks.co/dlw-01 02 - Delta Lake - Introducing ML https://dbricks.co/dlw-02 03 - Delta Lake - XGBoost 0.81 https://dbricks.co/dlw-03
  33. 33. Join the Delta Lake Community! Slack Channel | Mailing List
  34. 34. Apache Spark™ • Use Cases • Research • Technical Deep Dives AI • Productionizing ML • Deep Learning Fields • Data Science • Data Engineering • Enterprise 1700+ ATTENDEES Practitioners: Data Scientists, Data Engineers, Analysts, Architects Leaders: Engineering Management, VPs, Heads of Analytics & Data, CxOs TRACKS databricks.com/sparkaisummit/europe CODE: Databricks20
  35. 35. Build your own Delta Lake at https://delta.io

×