Data Engineering
with Databricks
Purva Agrawal
Anshika Agrawal
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Get Started with Databricks Data Science and
Engineering
2. Transform Data with Spark (SQL/PySpark)
3. Manage Data with Delta Lake
4. Build Data Pipelines with Delta Live Tables
(SQL/PySpark)
5. Deploy Workloads with Databricks Workflows
6. Manage Data Access for Analytics with Unity
Catalog
Databricks Architecture
Computer Resources
 Single Node
Needs only one VM instance hosting the driver with no
worker instances in this configuration.
 Multi Node (Standard)
General purpose configuration consisting of VM instance
hosting a driver and at least one additional instance for the
worker.
Cluster Configuration
Types of Cluster
 An all-purpose cluster is a shared cluster that is
manually started and terminated and can be
shared by multiple users and jobs.
 Used for interactive development.
 All-purpose clusters, such as ad hoc analysis,
data exploration, and development, are designed
for collaborative use. Multiple users can share
them.
All Purpose Clusters
 A job cluster is created when the job or task starts
and terminated when the job or task ends.
 Used for automating workloads.
 Job clusters are specifically for running automated
jobs.
Job Clusters
Databricks Notebooks
Multi-Language
Collaborative
Ideal For Exploration
Reproducible
Get to Production
Faster
Enterprise Ready
Adaptable
02
Data objects in the Lakehouse
03
 Delta Lake is an open-source storage
layer that enables building a data
lakehouse on top of existing storage
systems over cloud objects.
 Delta Lake brings ACID to object
storage.
 Delta Lake is the default format for
tables created in Databricks.
ACID Properties of Delta Lakes
Atomicity
Consistency
Isolation
Durability
04
Medallion Architecture in the Lakehouse
Multi-hop in the Lakehouse
Trust your data Scale with reliability
Operate with agility
 Declarative tools to build
batch and streaming data
pipelines
 DLT has built-in declarative
quality controls
 Declare quality expectations
and actions to take
 Easily scale infrastructure
alongside your data
Introducing Delta Live Tables
05
 A comprehensive task orchestration
service, fully managed and cloud-based,
designed specifically for Lakehouse
environments.
 Workflows caters to data engineers,
data scientists, and analysts,
empowering them to construct
dependable workflows for data,
analytics, and AI across various cloud
platforms.
 Databricks has 2
main task orchestration services-
Workflows
Delta Live Tables (DLTs)
DLT vs Workflow Jobs
Delta Live Tables Workflow Jobs
Source Notebooks only JARs, notebooks, DLTs,
applications written in
Scala, Java, Python
Dependencies Automatically determined Manually set
Cluster Self-provisioned Self-provisioned or existing
Timeouts and Retries Not supported Supported
Import Libraries Not supported Supported
06
Data Governance
Data Access Control
Data Lineage
Data Access Audit
Data Discovery
Unified data and AI
assets
Unified existing
catalogs
Unified governance
across cloud
Unity Catalog Overview
Unity Catalog Metastore Elements
Unity Catalog Architecture
Data Engineering with Databricks Presentation

Data Engineering with Databricks Presentation

  • 1.
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.
    1. Get Startedwith Databricks Data Science and Engineering 2. Transform Data with Spark (SQL/PySpark) 3. Manage Data with Delta Lake 4. Build Data Pipelines with Delta Live Tables (SQL/PySpark) 5. Deploy Workloads with Databricks Workflows 6. Manage Data Access for Analytics with Unity Catalog
  • 5.
  • 6.
    Computer Resources  SingleNode Needs only one VM instance hosting the driver with no worker instances in this configuration.  Multi Node (Standard) General purpose configuration consisting of VM instance hosting a driver and at least one additional instance for the worker. Cluster Configuration
  • 7.
    Types of Cluster An all-purpose cluster is a shared cluster that is manually started and terminated and can be shared by multiple users and jobs.  Used for interactive development.  All-purpose clusters, such as ad hoc analysis, data exploration, and development, are designed for collaborative use. Multiple users can share them. All Purpose Clusters  A job cluster is created when the job or task starts and terminated when the job or task ends.  Used for automating workloads.  Job clusters are specifically for running automated jobs. Job Clusters
  • 8.
    Databricks Notebooks Multi-Language Collaborative Ideal ForExploration Reproducible Get to Production Faster Enterprise Ready Adaptable
  • 9.
  • 10.
    Data objects inthe Lakehouse
  • 11.
  • 12.
     Delta Lakeis an open-source storage layer that enables building a data lakehouse on top of existing storage systems over cloud objects.  Delta Lake brings ACID to object storage.  Delta Lake is the default format for tables created in Databricks. ACID Properties of Delta Lakes Atomicity Consistency Isolation Durability
  • 13.
  • 14.
  • 15.
  • 16.
    Trust your dataScale with reliability Operate with agility  Declarative tools to build batch and streaming data pipelines  DLT has built-in declarative quality controls  Declare quality expectations and actions to take  Easily scale infrastructure alongside your data Introducing Delta Live Tables
  • 17.
  • 18.
     A comprehensivetask orchestration service, fully managed and cloud-based, designed specifically for Lakehouse environments.  Workflows caters to data engineers, data scientists, and analysts, empowering them to construct dependable workflows for data, analytics, and AI across various cloud platforms.  Databricks has 2 main task orchestration services- Workflows Delta Live Tables (DLTs)
  • 19.
    DLT vs WorkflowJobs Delta Live Tables Workflow Jobs Source Notebooks only JARs, notebooks, DLTs, applications written in Scala, Java, Python Dependencies Automatically determined Manually set Cluster Self-provisioned Self-provisioned or existing Timeouts and Retries Not supported Supported Import Libraries Not supported Supported
  • 20.
  • 21.
    Data Governance Data AccessControl Data Lineage Data Access Audit Data Discovery
  • 22.
    Unified data andAI assets Unified existing catalogs Unified governance across cloud Unity Catalog Overview
  • 23.
  • 24.