We will explore how to leverage the Databricks Lakehouse Platform to productionalize ETL pipelines and also learn how to use Delta Live Tables with Spark SQL and PySpark to define and schedule pipelines that incrementally process new data from a variety of data sources into the Lakehouse, orchestrate tasks with Databricks Workflows, and promote code with Databricks Repos.
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during the session.
3. 1. Get Started with Databricks Data Science and
Engineering
2. Transform Data with Spark (SQL/PySpark)
3. Manage Data with Delta Lake
4. Build Data Pipelines with Delta Live Tables
(SQL/PySpark)
5. Deploy Workloads with Databricks Workflows
6. Manage Data Access for Analytics with Unity
Catalog
6. Computer Resources
Single Node
Needs only one VM instance hosting the driver with no
worker instances in this configuration.
Multi Node (Standard)
General purpose configuration consisting of VM instance
hosting a driver and at least one additional instance for the
worker.
Cluster Configuration
7. Types of Cluster
An all-purpose cluster is a shared cluster that is
manually started and terminated and can be
shared by multiple users and jobs.
Used for interactive development.
All-purpose clusters, such as ad hoc analysis,
data exploration, and development, are designed
for collaborative use. Multiple users can share
them.
All Purpose Clusters
A job cluster is created when the job or task starts
and terminated when the job or task ends.
Used for automating workloads.
Job clusters are specifically for running automated
jobs.
Job Clusters
12. Delta Lake is an open-source storage
layer that enables building a data
lakehouse on top of existing storage
systems over cloud objects.
Delta Lake brings ACID to object
storage.
Delta Lake is the default format for
tables created in Databricks.
ACID Properties of Delta Lakes
Atomicity
Consistency
Isolation
Durability
16. Trust your data Scale with reliability
Operate with agility
Declarative tools to build
batch and streaming data
pipelines
DLT has built-in declarative
quality controls
Declare quality expectations
and actions to take
Easily scale infrastructure
alongside your data
Introducing Delta Live Tables
18. A comprehensive task orchestration
service, fully managed and cloud-based,
designed specifically for Lakehouse
environments.
Workflows caters to data engineers,
data scientists, and analysts,
empowering them to construct
dependable workflows for data,
analytics, and AI across various cloud
platforms.
Databricks has 2
main task orchestration services-
Workflows
Delta Live Tables (DLTs)
19. DLT vs Workflow Jobs
Delta Live Tables Workflow Jobs
Source Notebooks only JARs, notebooks, DLTs,
applications written in
Scala, Java, Python
Dependencies Automatically determined Manually set
Cluster Self-provisioned Self-provisioned or existing
Timeouts and Retries Not supported Supported
Import Libraries Not supported Supported