SlideShare a Scribd company logo
Rollback a table to an earlier version
PERFORMANCE OPTIMIZATIONS
TIME TRAVEL
View table details
Delete old files with Vacuum
Clone a Delta Lake table
Interoperability with Python / DataFrames
Run SQL queries from Python
Modify data retention settings for Delta Lake table
-- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+.
RESTORE tableName VERSION AS OF 0
RESTORE tableName TIMESTAMP AS OF "2020-12-18"
Delta Lake is an open source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads.
delta.io | Documentation | GitHub | Delta Lake on Databricks
WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk'
DELETE FROM tableName WHERE "date < '2017-01-01"
MERGE INTO logs
USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
THEN INSERT *
-- Add "Not null" constraint:
ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL
-- Add "Check" constraint:
ALTER TABLE tableName
ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01"
-- Drop constraint:
ALTER TABLE tableName DROP CONSTRAINT dateWithinRange
ALTER TABLE tableName ADD COLUMNS (
col_name data_type
[FIRST|AFTER colA_name])
MERGE INTO target
USING updates
ON target.Id = updates.Id
WHEN MATCHED AND target.delete_flag = "true" THEN
DELETE
WHEN MATCHED THEN
UPDATE SET * -- star notation means all columns
WHEN NOT MATCHED THEN
INSERT (date, Id, data) -- or, use INSERT *
VALUES (date, Id, data)
INSERT INTO TABLE tableName VALUES (
(8003, "Kim Jones", "2020-12-18", 3.875),
(8004, "Tim Jones", "2020-12-20", 3.750)
);
-- Insert using SELECT statement
INSERT INTO tableName SELECT * FROM sourceTable
-- Atomically replace all data in table with new values
INSERT OVERWRITE loan_by_state_delta VALUES (...)
DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE
Update rows that match a predicate condition
Delete rows that match a predicate condition
Insert values directly into table
Upsert (update + insert) using MERGE
Alter table schema — add columns
Insert with Deduplication using MERGE
Alter table — add constraint
DESCRIBE DETAIL tableName
DESCRIBE FORMATTED tableName
-- logRetentionDuration -> how long transaction log history
is kept, deletedFileRetentionDuration -> how long ago a file
must have been deleted before being a candidate for VACCUM.
ALTER TABLE tableName
SET TBLPROPERTIES(
delta.logRetentionDuration = "interval 30 days",
delta.deletedFileRetentionDuration = "interval 7 days"
);
SHOW TBLPROPERTIES tableName;
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
-- Read name-based table from Hive metastore into DataFrame
df = spark.table("tableName")
-- Read path-based table into DataFrame
df = spark.read.format("delta").load("/path/to/delta_table")
-- Deep clones copy data from source, shallow clones don't.
CREATE TABLE [dbName.] targetName
[SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0]
[LOCATION "path/to/table"]
-- specify location only for path-based tables
VACUUM tableName [RETAIN num HOURS] [DRY RUN]
UTILITY METHODS
*Databricks Delta Lake feature
OPTIMIZE tableName
[ZORDER BY (colNameA, colNameB)]
*Databricks Delta Lake feature
ALTER TABLE [table_name | delta.`path/to/delta_table`]
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
*Databricks Delta Lake feature
CACHE SELECT * FROM tableName
-- or:
CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0
Compact data files with Optimize and Z-Order
Auto-optimize tables
Cache frequently queried data in Delta Cache
DESCRIBE HISTORY tableName
SELECT * FROM tableName VERSION AS OF 12
EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11
SELECT * FROM tableName VERSION AS OF 0
SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0
SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18"
View transaction log (aka Delta Log)
Query historical versions of Delta Lake tables
Find changes between 2 versions of table
-- Managed database is saved in the Hive metastore.
Default database is named "default".
DROP DATABASE IF EXISTS dbName;
CREATE DATABASE dbName;
USE dbName -- This command avoids having to specify
dbName.tableName every time instead of just tableName.
/* You can refer to Delta Tables by table name, or by
path. Table name is the preferred way, since named tables
are managed in the Hive Metastore (i.e., when you DROP a
named table, the data is dropped also — not the case for
path-based tables.) */
SELECT * FROM [dbName.] tableName
CREATE TABLE [dbName.] tableName
USING DELTA
AS SELECT * FROM tableName | parquet.`path/to/data`
[LOCATION `/path/to/table`]
-- using location = unmanaged table
-- by table name
CONVERT TO DELTA [dbName.]tableName
[PARTITIONED BY (col_name1 col_type1, col_name2
col_type2)]
-- path-based tables
CONVERT TO DELTA parquet.`/path/to/table` -- note backticks
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)]
SELECT * FROM delta.`path/to/delta_table` -- note
backticks
CREATE TABLE [dbName.] tableName (
id INT [NOT NULL],
name STRING,
date DATE,
int_rate FLOAT)
USING DELTA
[PARTITIONED BY (time, date)] -- optional
COPY INTO [dbName.] targetTable
FROM (SELECT * FROM "/path/to/table")
FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc.
CREATE AND QUERY DELTA TABLES
Create and use managed database
Query Delta Lake table by table name (preferred)
Query Delta Lake table by path
Convert Parquet table to Delta Lake format in place
Create table, define schema explicitly with SQL DDL
Create Delta Lake table as SELECT * with no upfront
schema definition
Copy new data into Delta Lake table (with idempotent retries)
TIME TRAVEL (CONTINUED)
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
spark.sql("DESCRIBE HISTORY tableName")
deltaTable.vacuum() # vacuum files older than default
retention period (7 days)
deltaTable.vacuum(100) # vacuum files not required by
versions more than 100 hours old
deltaTable.clone(target="/path/to/delta_table/",
isShallow=True, replace=True)
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
UTILITY METHODS
WITH PYTHON
Convert Parquet table to Delta Lake format in place
Run Spark SQL queries in Python
Compact old files with Vacuum
Clone a Delta Lake table
Get DataFrame representation of a Delta Lake table
Run SQL queries on Delta Lake tables
fullHistoryDF = deltaTable.history()
# choose only one option: versionAsOf, or timestampAsOf
df = (spark.read.format("delta")
.option("versionAsOf", 0)
.option("timestampAsOf", "2020-12-18")
.load("/path/to/delta_table"))
TIME TRAVEL
View transaction log (aka Delta Log)
Query historical versions of Delta Lake tables
PERFORMANCE OPTIMIZATIONS
*Databricks Delta Lake feature
spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]")
*Databricks Delta Lake feature. For existing tables:
spark.sql("ALTER TABLE [table_name |
delta.`path/to/delta_table`]
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
To enable auto-optimize for all new Delta Lake tables:
spark.sql("SET spark.databricks.delta.properties.
defaults.autoOptimize.optimizeWrite = true")
*Databricks Delta Lake feature
spark.sql("CACHE SELECT * FROM tableName")
-- or:
spark.sql("CACHE SELECT colA, colB FROM tableName
WHERE colNameA > 0")
Compact data files with Optimize and Z-Order
Auto-optimize tables
Cache frequently queried data in Delta Cache
WORKING WITH DELTATABLES
WORKING WITH DELTA TABLES
# A DeltaTable is the entry point for interacting with
tables programmatically in Python — for example, to
perform updates or deletes.
from delta.tables import *
deltaTable = DeltaTable.forName(spark, tableName)
deltaTable = DeltaTable.forPath(spark,
delta.`path/to/table`)
CONVERT PARQUET TO DELTA LAKE
deltaTable = DeltaTable.convertToDelta(spark,
"parquet.`/path/to/parquet_table`")
partitionedDeltaTable = DeltaTable.convertToDelta(spark,
"parquet.`/path/to/parquet_table`", "part int")
df1 = spark.read.format("delta").load(pathToTable)
df2 = spark.read.format("delta").option("versionAsOf",
2).load("/path/to/delta_table")
df1.exceptAll(df2).show()
deltaTable.restoreToVersion(0)
deltaTable.restoreToTimestamp('2020-12-01')
Find changes between 2 versions of a table
Rollback a table by version or timestamp
Delta Lake is an open source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads.
delta.io | Documentation | GitHub | API reference | Databricks
df = spark.createDataFrame(pdf)
# where pdf is a pandas DF
# then save DataFrame in Delta Lake format as shown below
# read by path
df = (spark.read.format("parquet"|"csv"|"json"|etc.)
.load("/path/to/delta_table"))
# read table from Hive metastore
df = spark.table("events")
# by path or by table name
df = (spark.readStream
.format("delta")
.schema(schema)
.table("events") | .load("/delta/events")
)
streamingQuery = (
df.writeStream.format("delta")
.outputMode("append"|"update"|"complete")
.option("checkpointLocation", "/path/to/checkpoints")
.trigger(once=True|processingTime="10 seconds")
.table("events") | .start("/delta/events")
)
(df.write.format("delta")
.mode("append"|"overwrite")
.partitionBy("date") # optional
.option("mergeSchema", "true") # option - evolve schema
.saveAsTable("events") | .save("/path/to/delta_table")
)
READS AND WRITES WITH DELTA LAKE
Read data from pandas DataFrame
Read data using Apache Spark™
Save DataFrame in Delta Lake format
Streaming reads (Delta table as streaming source)
Streaming writes (Delta table as a sink)
# predicate using SQL formatted string
deltaTable.delete("date < '2017-01-01'")
# predicate using Spark SQL functions
deltaTable.delete(col("date") < "2017-01-01")
# Available options for merges [see documentation for
details]:
.whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) |
.whenNotMatchedInsert(...) | .whenMatchedDelete(...)
(deltaTable.alias("target").merge(
source = updatesDF.alias("updates"),
condition = "target.eventId = updates.eventId")
.whenMatchedUpdateAll()
.whenNotMatchedInsert(
values = {
"date": "updates.date",
"eventId": "updates.eventId",
"data": "updates.data",
"count": 1
}
).execute()
)
(deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatchedInsertAll()
.execute()
)
# predicate using SQL formatted string
deltaTable.update(condition = "eventType = 'clk'",
set = { "eventType": "'click'" } )
# predicate using Spark SQL functions
deltaTable.update(condition = col("eventType") == "clk",
set = { "eventType": lit("click") } )
DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES
Delete rows that match a predicate condition
Update rows that match a predicate condition
Upsert (update + insert) using MERGE
Insert with Deduplication using MERGE
df = deltaTable.toDF()
TIME TRAVEL (CONTINUED)

More Related Content

What's hot

Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
Harald Erb
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
Ashish Thapliyal
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
qureshihamid
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Impetus Technologies
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptx
IDERA Software
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
confluent
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
BHASKAR CHAUDHURY
 

What's hot (20)

Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptx
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 

Similar to Delta Lake Cheat Sheet.pdf

DeltaLakeOperations.pdf
DeltaLakeOperations.pdfDeltaLakeOperations.pdf
DeltaLakeOperations.pdf
GCPAdmin
 
delta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdfdelta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdf
PUSHKAR ARYA
 
Creating a database
Creating a databaseCreating a database
Creating a database
Rahul Gupta
 
Les10 Creating And Managing Tables
Les10 Creating And Managing TablesLes10 Creating And Managing Tables
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
oracle documents
 
Oracle SQL AND PL/SQL
Oracle SQL AND PL/SQLOracle SQL AND PL/SQL
Oracle SQL AND PL/SQL
suriyae1
 
Deeply Declarative Data Pipelines
Deeply Declarative Data PipelinesDeeply Declarative Data Pipelines
Deeply Declarative Data Pipelines
HostedbyConfluent
 
Les10
Les10Les10
Introduction to Oracle Database.pptx
Introduction to Oracle Database.pptxIntroduction to Oracle Database.pptx
Introduction to Oracle Database.pptx
SiddhantBhardwaj26
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10
Umair Amjad
 
My sql Syntax
My sql SyntaxMy sql Syntax
My sql Syntax
Reka
 
Oracle-L11 using Oracle flashback technology-Mazenet solution
Oracle-L11 using  Oracle flashback technology-Mazenet solutionOracle-L11 using  Oracle flashback technology-Mazenet solution
Oracle-L11 using Oracle flashback technology-Mazenet solution
Mazenetsolution
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
MS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating DatabaseMS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating Database
sqlserver content
 
MS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseMS Sql Server: Manipulating Database
MS Sql Server: Manipulating Database
DataminingTools Inc
 
MS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating DatabaseMS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating Database
sqlserver content
 
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with ExamplesDML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
LGS, GBHS&IC, University Of South-Asia, TARA-Technologies
 
Introduction4 SQLite
Introduction4 SQLiteIntroduction4 SQLite
Introduction4 SQLite
Stanley Huang
 
COMPUTERS SQL
COMPUTERS SQL COMPUTERS SQL
COMPUTERS SQL
Rc Os
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL Support
Yue Chen
 

Similar to Delta Lake Cheat Sheet.pdf (20)

DeltaLakeOperations.pdf
DeltaLakeOperations.pdfDeltaLakeOperations.pdf
DeltaLakeOperations.pdf
 
delta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdfdelta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdf
 
Creating a database
Creating a databaseCreating a database
Creating a database
 
Les10 Creating And Managing Tables
Les10 Creating And Managing TablesLes10 Creating And Managing Tables
Les10 Creating And Managing Tables
 
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
 
Oracle SQL AND PL/SQL
Oracle SQL AND PL/SQLOracle SQL AND PL/SQL
Oracle SQL AND PL/SQL
 
Deeply Declarative Data Pipelines
Deeply Declarative Data PipelinesDeeply Declarative Data Pipelines
Deeply Declarative Data Pipelines
 
Les10
Les10Les10
Les10
 
Introduction to Oracle Database.pptx
Introduction to Oracle Database.pptxIntroduction to Oracle Database.pptx
Introduction to Oracle Database.pptx
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10
 
My sql Syntax
My sql SyntaxMy sql Syntax
My sql Syntax
 
Oracle-L11 using Oracle flashback technology-Mazenet solution
Oracle-L11 using  Oracle flashback technology-Mazenet solutionOracle-L11 using  Oracle flashback technology-Mazenet solution
Oracle-L11 using Oracle flashback technology-Mazenet solution
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
MS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating DatabaseMS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating Database
 
MS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseMS Sql Server: Manipulating Database
MS Sql Server: Manipulating Database
 
MS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating DatabaseMS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating Database
 
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with ExamplesDML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
 
Introduction4 SQLite
Introduction4 SQLiteIntroduction4 SQLite
Introduction4 SQLite
 
COMPUTERS SQL
COMPUTERS SQL COMPUTERS SQL
COMPUTERS SQL
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL Support
 

Recently uploaded

ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
6459astrid
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
dizzycaye
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)
Alireza Kamrani
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
mohamed Ibrahim
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 

Recently uploaded (20)

ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)Oracle Database Desupported Features on 23ai (Part B)
Oracle Database Desupported Features on 23ai (Part B)
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 

Delta Lake Cheat Sheet.pdf

  • 1. Rollback a table to an earlier version PERFORMANCE OPTIMIZATIONS TIME TRAVEL View table details Delete old files with Vacuum Clone a Delta Lake table Interoperability with Python / DataFrames Run SQL queries from Python Modify data retention settings for Delta Lake table -- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+. RESTORE tableName VERSION AS OF 0 RESTORE tableName TIMESTAMP AS OF "2020-12-18" Delta Lake is an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. delta.io | Documentation | GitHub | Delta Lake on Databricks WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE FROM tableName WHERE "date < '2017-01-01" MERGE INTO logs USING newDedupedLogs ON logs.uniqueId = newDedupedLogs.uniqueId WHEN NOT MATCHED THEN INSERT * -- Add "Not null" constraint: ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL -- Add "Check" constraint: ALTER TABLE tableName ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01" -- Drop constraint: ALTER TABLE tableName DROP CONSTRAINT dateWithinRange ALTER TABLE tableName ADD COLUMNS ( col_name data_type [FIRST|AFTER colA_name]) MERGE INTO target USING updates ON target.Id = updates.Id WHEN MATCHED AND target.delete_flag = "true" THEN DELETE WHEN MATCHED THEN UPDATE SET * -- star notation means all columns WHEN NOT MATCHED THEN INSERT (date, Id, data) -- or, use INSERT * VALUES (date, Id, data) INSERT INTO TABLE tableName VALUES ( (8003, "Kim Jones", "2020-12-18", 3.875), (8004, "Tim Jones", "2020-12-20", 3.750) ); -- Insert using SELECT statement INSERT INTO tableName SELECT * FROM sourceTable -- Atomically replace all data in table with new values INSERT OVERWRITE loan_by_state_delta VALUES (...) DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE Update rows that match a predicate condition Delete rows that match a predicate condition Insert values directly into table Upsert (update + insert) using MERGE Alter table schema — add columns Insert with Deduplication using MERGE Alter table — add constraint DESCRIBE DETAIL tableName DESCRIBE FORMATTED tableName -- logRetentionDuration -> how long transaction log history is kept, deletedFileRetentionDuration -> how long ago a file must have been deleted before being a candidate for VACCUM. ALTER TABLE tableName SET TBLPROPERTIES( delta.logRetentionDuration = "interval 30 days", delta.deletedFileRetentionDuration = "interval 7 days" ); SHOW TBLPROPERTIES tableName; spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") -- Read name-based table from Hive metastore into DataFrame df = spark.table("tableName") -- Read path-based table into DataFrame df = spark.read.format("delta").load("/path/to/delta_table") -- Deep clones copy data from source, shallow clones don't. CREATE TABLE [dbName.] targetName [SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0] [LOCATION "path/to/table"] -- specify location only for path-based tables VACUUM tableName [RETAIN num HOURS] [DRY RUN] UTILITY METHODS *Databricks Delta Lake feature OPTIMIZE tableName [ZORDER BY (colNameA, colNameB)] *Databricks Delta Lake feature ALTER TABLE [table_name | delta.`path/to/delta_table`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true) *Databricks Delta Lake feature CACHE SELECT * FROM tableName -- or: CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0 Compact data files with Optimize and Z-Order Auto-optimize tables Cache frequently queried data in Delta Cache DESCRIBE HISTORY tableName SELECT * FROM tableName VERSION AS OF 12 EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11 SELECT * FROM tableName VERSION AS OF 0 SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0 SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18" View transaction log (aka Delta Log) Query historical versions of Delta Lake tables Find changes between 2 versions of table -- Managed database is saved in the Hive metastore. Default database is named "default". DROP DATABASE IF EXISTS dbName; CREATE DATABASE dbName; USE dbName -- This command avoids having to specify dbName.tableName every time instead of just tableName. /* You can refer to Delta Tables by table name, or by path. Table name is the preferred way, since named tables are managed in the Hive Metastore (i.e., when you DROP a named table, the data is dropped also — not the case for path-based tables.) */ SELECT * FROM [dbName.] tableName CREATE TABLE [dbName.] tableName USING DELTA AS SELECT * FROM tableName | parquet.`path/to/data` [LOCATION `/path/to/table`] -- using location = unmanaged table -- by table name CONVERT TO DELTA [dbName.]tableName [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] -- path-based tables CONVERT TO DELTA parquet.`/path/to/table` -- note backticks [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] SELECT * FROM delta.`path/to/delta_table` -- note backticks CREATE TABLE [dbName.] tableName ( id INT [NOT NULL], name STRING, date DATE, int_rate FLOAT) USING DELTA [PARTITIONED BY (time, date)] -- optional COPY INTO [dbName.] targetTable FROM (SELECT * FROM "/path/to/table") FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc. CREATE AND QUERY DELTA TABLES Create and use managed database Query Delta Lake table by table name (preferred) Query Delta Lake table by path Convert Parquet table to Delta Lake format in place Create table, define schema explicitly with SQL DDL Create Delta Lake table as SELECT * with no upfront schema definition Copy new data into Delta Lake table (with idempotent retries) TIME TRAVEL (CONTINUED)
  • 2. spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") spark.sql("DESCRIBE HISTORY tableName") deltaTable.vacuum() # vacuum files older than default retention period (7 days) deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old deltaTable.clone(target="/path/to/delta_table/", isShallow=True, replace=True) spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") UTILITY METHODS WITH PYTHON Convert Parquet table to Delta Lake format in place Run Spark SQL queries in Python Compact old files with Vacuum Clone a Delta Lake table Get DataFrame representation of a Delta Lake table Run SQL queries on Delta Lake tables fullHistoryDF = deltaTable.history() # choose only one option: versionAsOf, or timestampAsOf df = (spark.read.format("delta") .option("versionAsOf", 0) .option("timestampAsOf", "2020-12-18") .load("/path/to/delta_table")) TIME TRAVEL View transaction log (aka Delta Log) Query historical versions of Delta Lake tables PERFORMANCE OPTIMIZATIONS *Databricks Delta Lake feature spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]") *Databricks Delta Lake feature. For existing tables: spark.sql("ALTER TABLE [table_name | delta.`path/to/delta_table`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true) To enable auto-optimize for all new Delta Lake tables: spark.sql("SET spark.databricks.delta.properties. defaults.autoOptimize.optimizeWrite = true") *Databricks Delta Lake feature spark.sql("CACHE SELECT * FROM tableName") -- or: spark.sql("CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0") Compact data files with Optimize and Z-Order Auto-optimize tables Cache frequently queried data in Delta Cache WORKING WITH DELTATABLES WORKING WITH DELTA TABLES # A DeltaTable is the entry point for interacting with tables programmatically in Python — for example, to perform updates or deletes. from delta.tables import * deltaTable = DeltaTable.forName(spark, tableName) deltaTable = DeltaTable.forPath(spark, delta.`path/to/table`) CONVERT PARQUET TO DELTA LAKE deltaTable = DeltaTable.convertToDelta(spark, "parquet.`/path/to/parquet_table`") partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`/path/to/parquet_table`", "part int") df1 = spark.read.format("delta").load(pathToTable) df2 = spark.read.format("delta").option("versionAsOf", 2).load("/path/to/delta_table") df1.exceptAll(df2).show() deltaTable.restoreToVersion(0) deltaTable.restoreToTimestamp('2020-12-01') Find changes between 2 versions of a table Rollback a table by version or timestamp Delta Lake is an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. delta.io | Documentation | GitHub | API reference | Databricks df = spark.createDataFrame(pdf) # where pdf is a pandas DF # then save DataFrame in Delta Lake format as shown below # read by path df = (spark.read.format("parquet"|"csv"|"json"|etc.) .load("/path/to/delta_table")) # read table from Hive metastore df = spark.table("events") # by path or by table name df = (spark.readStream .format("delta") .schema(schema) .table("events") | .load("/delta/events") ) streamingQuery = ( df.writeStream.format("delta") .outputMode("append"|"update"|"complete") .option("checkpointLocation", "/path/to/checkpoints") .trigger(once=True|processingTime="10 seconds") .table("events") | .start("/delta/events") ) (df.write.format("delta") .mode("append"|"overwrite") .partitionBy("date") # optional .option("mergeSchema", "true") # option - evolve schema .saveAsTable("events") | .save("/path/to/delta_table") ) READS AND WRITES WITH DELTA LAKE Read data from pandas DataFrame Read data using Apache Spark™ Save DataFrame in Delta Lake format Streaming reads (Delta table as streaming source) Streaming writes (Delta table as a sink) # predicate using SQL formatted string deltaTable.delete("date < '2017-01-01'") # predicate using Spark SQL functions deltaTable.delete(col("date") < "2017-01-01") # Available options for merges [see documentation for details]: .whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) | .whenNotMatchedInsert(...) | .whenMatchedDelete(...) (deltaTable.alias("target").merge( source = updatesDF.alias("updates"), condition = "target.eventId = updates.eventId") .whenMatchedUpdateAll() .whenNotMatchedInsert( values = { "date": "updates.date", "eventId": "updates.eventId", "data": "updates.data", "count": 1 } ).execute() ) (deltaTable.alias("logs").merge( newDedupedLogs.alias("newDedupedLogs"), "logs.uniqueId = newDedupedLogs.uniqueId") .whenNotMatchedInsertAll() .execute() ) # predicate using SQL formatted string deltaTable.update(condition = "eventType = 'clk'", set = { "eventType": "'click'" } ) # predicate using Spark SQL functions deltaTable.update(condition = col("eventType") == "clk", set = { "eventType": lit("click") } ) DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES Delete rows that match a predicate condition Update rows that match a predicate condition Upsert (update + insert) using MERGE Insert with Deduplication using MERGE df = deltaTable.toDF() TIME TRAVEL (CONTINUED)