SlideShare a Scribd company logo
Data Time Travel by
Delta Time Machine
Burak Yavuz | Software Engineer
Vini Jaiswal | Customer Success Engineer
Who are we?
● Software Engineer @ Databricks
“We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering - Stanford University
● BS in Mechanical Engineering - Bogazici University, Turkey
● Customer Success Engineer @ Databricks
“Making Customers Successful with their data and ML/AI use cases”
● Data Science Lead - Citi | Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
● BS in Electrical Engineering - Rajiv Gandhi Technology University, India
Vini Jaiswal
Burak Yavuz
Agenda
Intro to Time Travel
Time Travel Use Cases
▪ Data Archiving
▪ Rollbacks
▪ Governance
▪ Reproducing ML experiments
Solving with Delta
Demo - Riding the time machine
Introduction to Time Travel
What might time travel look like?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Data
Archiving
Governance Rollbacks Reproduce
Experiments
Time Travel Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
● May need to store data for many years (7+)
Governance
8
Flights
Delays per
airplane
Planes
Weather
● What if records need to be forgotten with respect to Data Subject request
● And, at the same time, how do you stay in compliance with international
regulations?
Flights
(JSON)
events per
second
Kinesis
Planes
(CSV)
slow
changing
S3
Weather
(JSON)
every 5
minutes a
new dump
on S3
Rollbacks
9
Flights
Planes
Weather
Flights
(JSON)
events per
second
Event
Hubs
Planes
(CSV)
slow
changing
Blob
Weather
(JSON)
every 5
minutes a new
dump on Blob
What if a new job is deployed that
accidentally specifies
.mode(“overwrite”)
New job with .mode(“overwrite”)
Delays per
airplane
All
historic
data gone
Reproduce Experiments
● Reproducibility is the cornerstone of all scientific inquiry
● In order for a machine learning model to be improved, a data scientist
must first reproduce the results of the model.
Reproduce
Experiments
Solving with Delta
For more info check out
Diving Into Delta Lake:
Unpacking the Transaction Log
Wednesday (Nov 11) 15:00 GMT
Transaction Protocol
▪ Serializable ACID Writes
▪ Snapshot Isolation
▪ Scalability to billions of partitions or files
▪ Incremental processing
Computing Delta’s State
000000.json
000001.json
000002.json
000003.json
000004.json
000005.json
000006.json
000007.json
listFrom
version 0
Cache version
7
Update Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Set Transaction – records an idempotent txn id
Change Protocol – upgrades the version of the txn protocol
Result: Current Metadata, List of Files, List of Txns, Version
Table = Result of a set of actions
Computing Delta’s State
000000.json
...
000007.json
000008.json
000009.json
0000010.json
0000010.checkpoint.parquet
0000011.json
0000012.json
Cache version
12
listFrom
version 0
Computing Delta’s State
0000010.checkpoint.parquet
0000011.json
0000012.json
0000013.json
0000014.json
Cache version
14
listFrom
version 10
Time Travelling by version
SELECT * FROM my_table VERSION AS OF 1071;
SELECT * FROM my_table@v1071 -- no backticks to specify @
spark.read.option("versionAsOf", 1071).load("/some/path")
spark.read.load("/some/path@v1071")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28';
SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS
spark.read.option("timestampAsOf", "1492-10-28").load("/some/path")
spark.read.load("/some/path@14921028000000000")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Commit timestamps come from storage system modification timestamps
375-01-01
1453-05-29
1923-10-29
1920-04-23
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Timestamps can be out of order. We adjust by adding 1 millisecond to the
previous commit’s timestamp.
375-01-01
1453-05-29
1923-10-29
1920-04-23
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Price is right rules: Pick closest commit with timestamp that doesn’t exceed
the user’s timestamp.
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
1492-10-28
deltaLog.getSnapshotAt(1071)
Back to the Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
○ Should you be storing changes (CDC) or the latest snapshot?
● May need to store data for many years (7+)
○ How do you make it cost efficient?
What might time travel look like?
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1926'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1972'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1880'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '2018'
0.82
Better to save data by year and query with a predicate instead of using time travel.
Slowly Changing Dimensions (SCD)
- Type 1: Only keep latest data
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Helsingborg 2012
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Barcelona 2020
To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
Problems with SCD Type 1 + Time Travel
● Trade-off between data recency, query performance, and storage
costs
○ Data recency requires many frequent updates
○ Better query performance requires regular compaction of the data
○ The two above lead to many copies of the data
○ Many copies of the data lead to prohibitive storage costs
● Time Travel requires older copies of the data to exist
Slowly Changing Dimensions (SCD)
- Type 2: Insert row for each change
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 Y
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 N
Henrik Larsson September 20, 1971 Barcelona 2020 Y
To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
Governance
31
DESCRIBE HISTORY my_table
Rollbacks
● Undoing work (restoring an old version of the table)
RESTORE my_table TO TIMESTAMP AS OF '2020-11-10'
● Replaying Structured Streaming Pipelines
RESTORE target_table TO TIMESTAMP AS OF '2020-11-10'
spark.readStream.format("delta")
.option("startingTimestamp", "2020-11-10")
.load(path)
// fix logic
.writeStream
.option("checkpointLocation", "<new_location>")
.table("target_table")
Rollbacks
Rollback accidental bad writes
INSERT INTO my_table
SELECT * FROM my_table
TIMESTAMP AS OF
date_sub(current_date(), 1)
Fix incorrect updates as follows:
MERGE INTO my_table target
USING my_table TIMESTAMP AS OF
date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *
Reproduce Experiments
● Use Time Travel to ensure all experiments run on the same snapshot
of the table
○ SELECT * FROM my_table VERSION AS OF 1071;
○ SELECT * FROM my_table@v1071
● Archive a blessed snapshot using CLONE
○ CREATE TABLE my_table_xmas
○ CLONE my_table VERSION AS OF 1071
Reproduce Experiments & reports with MLflow
Reproduce Experiments & reports
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
Reproduce experiments & reports
Time Series Analytics
If you want to find out how many new customers were added
over the last week
SELECT
count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table
TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_table
DEMO - Riding the time machine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
David Walker
 
Mdm
MdmMdm
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Web Intelligence - Tutorial1
Web Intelligence - Tutorial1Web Intelligence - Tutorial1
Web Intelligence - Tutorial1
Obily W
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
SAP BW - Master data load via flat file
SAP BW - Master data load via flat fileSAP BW - Master data load via flat file
SAP BW - Master data load via flat file
Yasmin Ashraf
 
Lo extraction part 4 update methods
Lo extraction   part 4 update methodsLo extraction   part 4 update methods
Lo extraction part 4 update methodsJNTU University
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
Databricks
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Hybrid provider based on dso using real time data acquisition in sap bw 7.30
Hybrid provider based on dso using real time data acquisition in sap bw 7.30Hybrid provider based on dso using real time data acquisition in sap bw 7.30
Hybrid provider based on dso using real time data acquisition in sap bw 7.30
Sabyasachi Das
 
Intro for Power BI
Intro for Power BIIntro for Power BI
Intro for Power BI
Martin X
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
Nikkia Carter
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Introduction to power BI
Introduction to power BIIntroduction to power BI
Introduction to power BI
Ramar Bose
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Lo extraction part 1 sd overview
Lo extraction   part 1 sd overviewLo extraction   part 1 sd overview
Lo extraction part 1 sd overviewJNTU University
 
Analysis process designer (apd) part 1
Analysis process designer (apd) part   1Analysis process designer (apd) part   1
Analysis process designer (apd) part 1dejavee
 

What's hot (20)

Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
Mdm
MdmMdm
Mdm
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Web Intelligence - Tutorial1
Web Intelligence - Tutorial1Web Intelligence - Tutorial1
Web Intelligence - Tutorial1
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
SAP BW - Master data load via flat file
SAP BW - Master data load via flat fileSAP BW - Master data load via flat file
SAP BW - Master data load via flat file
 
Lo extraction part 4 update methods
Lo extraction   part 4 update methodsLo extraction   part 4 update methods
Lo extraction part 4 update methods
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Hybrid provider based on dso using real time data acquisition in sap bw 7.30
Hybrid provider based on dso using real time data acquisition in sap bw 7.30Hybrid provider based on dso using real time data acquisition in sap bw 7.30
Hybrid provider based on dso using real time data acquisition in sap bw 7.30
 
Intro for Power BI
Intro for Power BIIntro for Power BI
Intro for Power BI
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Introduction to power BI
Introduction to power BIIntroduction to power BI
Introduction to power BI
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Lo extraction part 1 sd overview
Lo extraction   part 1 sd overviewLo extraction   part 1 sd overview
Lo extraction part 1 sd overview
 
Analysis process designer (apd) part 1
Analysis process designer (apd) part   1Analysis process designer (apd) part   1
Analysis process designer (apd) part 1
 

Similar to Data Time Travel by Delta Time Machine

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
Databricks
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
Carlo Carandang
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
Databricks
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
Jeremy Schneider
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspective
Marcin Szymaniuk
 
Spark3
Spark3Spark3
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
Vengata Guruswamy
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
Shima Zahmatkesh
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august edition
transight
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
Big Data Value Association
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
The Statistical and Applied Mathematical Sciences Institute
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
Keshav Murthy
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
DataStax
 

Similar to Data Time Travel by Delta Time Machine (20)

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspective
 
Spark3
Spark3Spark3
Spark3
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august edition
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 

Recently uploaded (20)

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 

Data Time Travel by Delta Time Machine

  • 1. Data Time Travel by Delta Time Machine Burak Yavuz | Software Engineer Vini Jaiswal | Customer Success Engineer
  • 2. Who are we? ● Software Engineer @ Databricks “We make your streams come true” ● Apache Spark Committer ● MS in Management Science & Engineering - Stanford University ● BS in Mechanical Engineering - Bogazici University, Turkey ● Customer Success Engineer @ Databricks “Making Customers Successful with their data and ML/AI use cases” ● Data Science Lead - Citi | Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas ● BS in Electrical Engineering - Rajiv Gandhi Technology University, India Vini Jaiswal Burak Yavuz
  • 3. Agenda Intro to Time Travel Time Travel Use Cases ▪ Data Archiving ▪ Rollbacks ▪ Governance ▪ Reproducing ML experiments Solving with Delta Demo - Riding the time machine
  • 5. What might time travel look like? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 7. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ● May need to store data for many years (7+)
  • 8. Governance 8 Flights Delays per airplane Planes Weather ● What if records need to be forgotten with respect to Data Subject request ● And, at the same time, how do you stay in compliance with international regulations? Flights (JSON) events per second Kinesis Planes (CSV) slow changing S3 Weather (JSON) every 5 minutes a new dump on S3
  • 9. Rollbacks 9 Flights Planes Weather Flights (JSON) events per second Event Hubs Planes (CSV) slow changing Blob Weather (JSON) every 5 minutes a new dump on Blob What if a new job is deployed that accidentally specifies .mode(“overwrite”) New job with .mode(“overwrite”) Delays per airplane All historic data gone
  • 10. Reproduce Experiments ● Reproducibility is the cornerstone of all scientific inquiry ● In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. Reproduce Experiments
  • 12. For more info check out Diving Into Delta Lake: Unpacking the Transaction Log Wednesday (Nov 11) 15:00 GMT
  • 13. Transaction Protocol ▪ Serializable ACID Writes ▪ Snapshot Isolation ▪ Scalability to billions of partitions or files ▪ Incremental processing
  • 15. Update Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Set Transaction – records an idempotent txn id Change Protocol – upgrades the version of the txn protocol Result: Current Metadata, List of Files, List of Txns, Version Table = Result of a set of actions
  • 18. Time Travelling by version SELECT * FROM my_table VERSION AS OF 1071; SELECT * FROM my_table@v1071 -- no backticks to specify @ spark.read.option("versionAsOf", 1071).load("/some/path") spark.read.load("/some/path@v1071") deltaLog.getSnapshotAt(1071)
  • 19. Time Travelling by timestamp SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28'; SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS spark.read.option("timestampAsOf", "1492-10-28").load("/some/path") spark.read.load("/some/path@14921028000000000") deltaLog.getSnapshotAt(1071)
  • 20. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Commit timestamps come from storage system modification timestamps 375-01-01 1453-05-29 1923-10-29 1920-04-23
  • 21. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Timestamps can be out of order. We adjust by adding 1 millisecond to the previous commit’s timestamp. 375-01-01 1453-05-29 1923-10-29 1920-04-23 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001
  • 22. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Price is right rules: Pick closest commit with timestamp that doesn’t exceed the user’s timestamp. 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001 1492-10-28 deltaLog.getSnapshotAt(1071)
  • 23. Back to the Use Cases
  • 24. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ○ Should you be storing changes (CDC) or the latest snapshot? ● May need to store data for many years (7+) ○ How do you make it cost efficient?
  • 25. What might time travel look like? 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82 Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
  • 26. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 27. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1926' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1972' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1880' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '2018' 0.82 Better to save data by year and query with a predicate instead of using time travel.
  • 28. Slowly Changing Dimensions (SCD) - Type 1: Only keep latest data First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Helsingborg 2012 First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Barcelona 2020 To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
  • 29. Problems with SCD Type 1 + Time Travel ● Trade-off between data recency, query performance, and storage costs ○ Data recency requires many frequent updates ○ Better query performance requires regular compaction of the data ○ The two above lead to many copies of the data ○ Many copies of the data lead to prohibitive storage costs ● Time Travel requires older copies of the data to exist
  • 30. Slowly Changing Dimensions (SCD) - Type 2: Insert row for each change First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 Y First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 N Henrik Larsson September 20, 1971 Barcelona 2020 Y To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
  • 32. Rollbacks ● Undoing work (restoring an old version of the table) RESTORE my_table TO TIMESTAMP AS OF '2020-11-10' ● Replaying Structured Streaming Pipelines RESTORE target_table TO TIMESTAMP AS OF '2020-11-10' spark.readStream.format("delta") .option("startingTimestamp", "2020-11-10") .load(path) // fix logic .writeStream .option("checkpointLocation", "<new_location>") .table("target_table")
  • 33. Rollbacks Rollback accidental bad writes INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) Fix incorrect updates as follows: MERGE INTO my_table target USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source ON source.userId = target.userId WHEN MATCHED THEN UPDATE SET *
  • 34. Reproduce Experiments ● Use Time Travel to ensure all experiments run on the same snapshot of the table ○ SELECT * FROM my_table VERSION AS OF 1071; ○ SELECT * FROM my_table@v1071 ● Archive a blessed snapshot using CLONE ○ CREATE TABLE my_table_xmas ○ CLONE my_table VERSION AS OF 1071
  • 35. Reproduce Experiments & reports with MLflow
  • 36. Reproduce Experiments & reports SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/") Reproduce experiments & reports
  • 37. Time Series Analytics If you want to find out how many new customers were added over the last week SELECT count(distinct userId) - ( SELECT count(distinct userId) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7)) FROM my_table
  • 38. DEMO - Riding the time machine
  • 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.