Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 18 Ad

Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake

Download to read offline

Time travel is now possible with Delta Lake! We will uncover how Delta Lake makes Time Travel possible and why it matters to you. Through presentation, notebooks, and code, we will showcase several common applications and how they can improve your modern data engineering pipelines. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark(TM). It provides snapshot isolation for concurrent read/writes. Enables efficient upserts, deletes and immediate rollback capabilities. It allows background file optimization through compaction and Z-Order partitioning achieving up to 100x performance improvements. In this presentation you will learn: What challenges Delta Lake solves How Delta Lake works under the hood Applications of new Delta Time Travel capability

Time travel is now possible with Delta Lake! We will uncover how Delta Lake makes Time Travel possible and why it matters to you. Through presentation, notebooks, and code, we will showcase several common applications and how they can improve your modern data engineering pipelines. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark(TM). It provides snapshot isolation for concurrent read/writes. Enables efficient upserts, deletes and immediate rollback capabilities. It allows background file optimization through compaction and Z-Order partitioning achieving up to 100x performance improvements. In this presentation you will learn: What challenges Delta Lake solves How Delta Lake works under the hood Applications of new Delta Time Travel capability

Advertisement
Advertisement

More Related Content

More from Databricks (20)

Recently uploaded (20)

Advertisement

Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Kyle Weller Product Manager Azure Databricks - Microsoft Applications of Time Travel with Delta Lake #UnifiedDataAnalytics #SparkAISummit
  3. 3. Common Data Challenges Gartner estimates > 65% big data projects fail XCustomer Data Click Streams Unstructured Sensors (IoT) Etc WHY?
  4. 4. Interactive Poll
  5. 5. Complexities Spark Solves Other Spark Challenges: Concurrency Multiple readers and writers Ensuring atomic transactions, consistency, and isolation Updates & Rollbacks GDPR User delete requests or other Upserts Data rollback or snapshots for audits The Small Files Problem Performance degradation Complex cleanup often incurs in downtime Complex Data Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order Complex Systems Diverse storage systems (Kafka, Azure Storage,Event Hubs, SQL DW, …) System failures Complex Workloads Combining streaming with interactive queries Machine learning Solved
  6. 6. Delta Table = Parquet + Transaction Log + Indexes/Stats Delta Table Versioned Parquet FilesIndexes & Stats Delta Log - Reliable Data Lakes at Scale ACID Transaction Guarantees • Atomic, Consistent, Isolated, Durable Versioned parquet files with transaction log • Snapshot isolation for multiple concurrent read/writes • Immediate rollback capabilities Efficient Upserts (Updates+Inserts) with MERGE command • GDPR DSR requests • Change Data Capture Time Travel
  7. 7. parquet parquet delta delta - Easy to Use
  8. 8. Delta Table = Parquet + Transaction Log + Indexes/Stats Delta Table Versioned Parquet FilesIndexes & Stats Delta Log - Time Travel Applications Include: a • Audit Data Changes • Data reproducibility • Data pipeline debugging • Immediate rollback capabilities
  9. 9. - Time Travel, Audit Applications Audit Data Changes • History of all operations are recorded for audit history • Audit operation types, userIds, clusterIds, notebookIds, timestamps and versions
  10. 10. - Time Travel, Data Reproducilibility Data reproducibility Reproduce query results and reports • Go back to the exact same data that was used to train an ML model version in the past.
  11. 11. - Time Travel, Rollbacks
  12. 12. Delta at scale in the cloud 12
  13. 13. Fast, easy, and collaborative Apache Spark™-based analytics platform Built with your needs in mind Role-based access controls Effortless autoscaling Live collaboration Enterprise-grade SLAs Best-in-class notebooks Simple job scheduling Seamlessly integrated with the Azure Portfolio Increase productivity Build on a secure, trusted cloud Scale without limits Azure Databricks– Introduction
  14. 14. Sensors and IoT (unstructured) Ingest Store Process Serve Cosmos DB Apps Azure Data Lake Storage Logs (unstructured) Azure Data Factory Azure Databricks Media (unstructured) Files (unstructured) Business/custom apps (structured) Azure SQL Data Warehouse Power BI Azure Event Hub Azure IoT Hub Kafka Delta FormatRaw Format + AzureDatabricks– DeltaLakeatScaleonAzure
  15. 15. AzureDatabricks– DeltaLakeatScaleonAzure Azure Data Factory Polybase Azure SQL Data Warehouse Azure Event Hub Azure IoT Hub Kafka Raw Format Step 1 Load raw data to Azure Data Lake Storage Step 2 Use Azure Databricks to 1. Combine streaming and batch 2. Save data as Delta format Delta Format (Bronze Table) Cosmos DB Apps Step 3 Use Azure Databricks to 1. Join, enrich, clean, transform data 2. Develop, train, and score ML models with Azure ML + MLFlow Azure Data Lake Storage Delta Format (Silver Table) + Delta Format (Gold Table) Step 4 Load data into serving layers like 1. SQL Data Warehouse for enterprise BI scenarios. 2. Cosmos DB for real-time Apps Power BI Sensors and IoT (unstructured) Logs (unstructured) Media (unstructured) Files (unstructured) Business/custom apps (structured)
  16. 16. Demo 16
  17. 17. http://bit.ly/adbrelnote https://docs.azuredatabricks.net/ https://delta.io Learn More https://aka.ms/AzureDatabricksBestPractices AzureDatabricks +
  18. 18. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×