Presented By:
Raviyanshu Singh
Software Consultant
Getting Started With
DeltaLake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes priorto
the session start time. We start on
time andconclude on time!
Feedback
Makesure to submita constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep yourmobiledevices in silent
mode, feel free to moveout of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoidunwantedchitchat during
the session.
Our Agenda
01 Why Delta Lake ?
02 Data Warehouse
03 Data Lake
04 Possible Solution
Delta Lake
05
05
06 Demo
Why Delta Lake?
Streaming Systems
Data source come through the systems
like Apache Kafka or Amazon Kinesis
Data Lakes
Data is stored for long periods of time in
data lake where it’s optimized for large
scale and low cost.
Data Warehouse
Valuable data is stored which are then again
optimized for high concurrency & reliability.
The modern data architecture uses the
blend of at least these three different
types of systems.
Data
Architecture
Data Warehouse
2013 2017 2018
â—Ź A data management system that stores current and
historical data from multiple sources in a business
friendly manner for easier insights and reporting.
â—Ź Data warehouses are typically used for business
intelligence (BI), reporting and data analysis.
Limitations
➔No support for video, audio, text
➔No support for data science
âž” ML Limited support for streaming Closed & proprietary
formats
ETL
(Extract Transform Load)
Data Source
Data Lake
2017 2018
â—Ź A central location that holds a large amount of data in its
native, raw format.
â—Ź Unstructured and semi-structured data like photos, video,
audio, and documents, which is essential for today’s machine
learning and advanced analytics use cases.
Limitations
➔Poor BI support Complex to set up
➔Poor performance
➔Lack of security features
➔Reliability issues
What’s the Solution?
A combination of DW & DL
Structured &
Unstructured Data
Data Lake
ETL
Metadata, Caching &
Indexing Layer
Data Validation
Data Warehousing
Reports, BI & Data
Science
Data Lakehouse
2017 2018
A system which merges the flexibility, low cost, and scale of
a data lake with the data management and ACID
transactions of data warehouses, addressing the limitations
of both.
Benefits
➔Don’t have to copy data to data lake and another copy to
some data warehouse
➔Cost savings, both in infrastructure and staff and
consulting overhead.
➔Scalability through underline cloud storage
➔Reliability through ACID transaction.
What is Delta Lake?
2018
â—Ź Delta Lake is a file-based open-source metadata layer
that enables building Lakehouse architecture on the top of
data lakes.
â—Ź It can run on existing data lakes and is fully compatible
with processing engines like Apache Spark
With Delta Lake -
➔Scalable metadata handling
➔ACID Transactions
➔Streaming and Batch unification
➔Time Travel (query an oldersnapshotof a Delta table)
➔Schema Enforcement
The Medallion Architecture
Ingestion Tables Refined Tables Feature/Agg Data Store
â—Ź No business rules or
transformations of any kind
â—Ź Should be fast and easy to
get new data to this layer
â—Ź Prioritize speed to market
and write performance- just
enough transformations
â—Ź Quality data expected
â—Ź Prioritize business use
cases and user experience
â—Ź Precalculated, business-
specific transformations
Features of Delta Lake
01 02
03 04
06
05
ACID Transactions
Data lake transactions done using processing
engine are committed for durability and
exposed to other readers in an atomic fashion.
Audit History
Transaction logs enables the full audit trail
of any changes made to the data
Schema
Enforcement
Automatically enforces schema
when writing and reading data
from lake
Unification of batch and
streaming
Table in Delta Lake is a batch table as well
as a streaming source and sink
Full DML Support
DML operations like deletes and updates,
but also complex data merge, or upsert
scenarios
Metadata Support
& Scaling
Leverages Spark distributedprocessing
power to handle all the metadata for
petabyte-scale tables with billions of files
at ease
Getting Started With
Delta Lake with
Spark-Shell
Delta Lake in
Pyspark
Delta Lake on
Databricks
1 2
3 4 Hello Delta Lake
Demo
Delta Lake
Best Practices
Choosethe rightpartition column:
If the cardinality of a column will be very high, do
not use that column for partitioning.
Amount of data in each partition. < 1GB
Improve performance on Delta Lake
Merge
Compact Files
A large number of small files should be rewritten
into a smaller number of larger files on a regular
basis. Thisis known as compaction.
Enhanced checkpoints for low latency
queries
Replace the content or schema of the
table.
Sometimesyou maywant to replace a Delta table.
Spark Caching
Differencebetween Delta Lake and
Parquet on ApacheSpark
Thank You !

Getting Started with Delta Lake on Databricks

  • 1.
    Presented By: Raviyanshu Singh SoftwareConsultant Getting Started With DeltaLake
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes priorto the session start time. We start on time andconclude on time! Feedback Makesure to submita constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep yourmobiledevices in silent mode, feel free to moveout of session in case you need to attend an urgent call. Avoid Disturbance Avoidunwantedchitchat during the session.
  • 3.
    Our Agenda 01 WhyDelta Lake ? 02 Data Warehouse 03 Data Lake 04 Possible Solution Delta Lake 05 05 06 Demo
  • 4.
    Why Delta Lake? StreamingSystems Data source come through the systems like Apache Kafka or Amazon Kinesis Data Lakes Data is stored for long periods of time in data lake where it’s optimized for large scale and low cost. Data Warehouse Valuable data is stored which are then again optimized for high concurrency & reliability. The modern data architecture uses the blend of at least these three different types of systems. Data Architecture
  • 5.
    Data Warehouse 2013 20172018 ● A data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. ● Data warehouses are typically used for business intelligence (BI), reporting and data analysis. Limitations ➔No support for video, audio, text ➔No support for data science ➔ ML Limited support for streaming Closed & proprietary formats ETL (Extract Transform Load) Data Source
  • 6.
    Data Lake 2017 2018 ●A central location that holds a large amount of data in its native, raw format. ● Unstructured and semi-structured data like photos, video, audio, and documents, which is essential for today’s machine learning and advanced analytics use cases. Limitations ➔Poor BI support Complex to set up ➔Poor performance ➔Lack of security features ➔Reliability issues
  • 7.
    What’s the Solution? Acombination of DW & DL Structured & Unstructured Data Data Lake ETL Metadata, Caching & Indexing Layer Data Validation Data Warehousing Reports, BI & Data Science
  • 8.
    Data Lakehouse 2017 2018 Asystem which merges the flexibility, low cost, and scale of a data lake with the data management and ACID transactions of data warehouses, addressing the limitations of both. Benefits ➔Don’t have to copy data to data lake and another copy to some data warehouse ➔Cost savings, both in infrastructure and staff and consulting overhead. ➔Scalability through underline cloud storage ➔Reliability through ACID transaction.
  • 9.
    What is DeltaLake? 2018 ● Delta Lake is a file-based open-source metadata layer that enables building Lakehouse architecture on the top of data lakes. ● It can run on existing data lakes and is fully compatible with processing engines like Apache Spark With Delta Lake - ➔Scalable metadata handling ➔ACID Transactions ➔Streaming and Batch unification ➔Time Travel (query an oldersnapshotof a Delta table) ➔Schema Enforcement
  • 10.
    The Medallion Architecture IngestionTables Refined Tables Feature/Agg Data Store â—Ź No business rules or transformations of any kind â—Ź Should be fast and easy to get new data to this layer â—Ź Prioritize speed to market and write performance- just enough transformations â—Ź Quality data expected â—Ź Prioritize business use cases and user experience â—Ź Precalculated, business- specific transformations
  • 11.
    Features of DeltaLake 01 02 03 04 06 05 ACID Transactions Data lake transactions done using processing engine are committed for durability and exposed to other readers in an atomic fashion. Audit History Transaction logs enables the full audit trail of any changes made to the data Schema Enforcement Automatically enforces schema when writing and reading data from lake Unification of batch and streaming Table in Delta Lake is a batch table as well as a streaming source and sink Full DML Support DML operations like deletes and updates, but also complex data merge, or upsert scenarios Metadata Support & Scaling Leverages Spark distributedprocessing power to handle all the metadata for petabyte-scale tables with billions of files at ease
  • 12.
    Getting Started With DeltaLake with Spark-Shell Delta Lake in Pyspark Delta Lake on Databricks 1 2 3 4 Hello Delta Lake
  • 13.
  • 14.
    Delta Lake Best Practices Choosetherightpartition column: If the cardinality of a column will be very high, do not use that column for partitioning. Amount of data in each partition. < 1GB Improve performance on Delta Lake Merge Compact Files A large number of small files should be rewritten into a smaller number of larger files on a regular basis. Thisis known as compaction. Enhanced checkpoints for low latency queries Replace the content or schema of the table. Sometimesyou maywant to replace a Delta table. Spark Caching Differencebetween Delta Lake and Parquet on ApacheSpark
  • 15.