Azure+Databricks+Course+Slide+Deck+V4.pdf

About Me
Ramesh Retnasamy
https://www.linkedin.com/in/ramesh-retnasamy/
Data Engineer/ Machine Learning Engineer

About this course
Azure Databricks
PySpark
Spark SQL
Delta Lake

About this course
Azure Data Lake Storage Gen2
Azure Data Factory
PowerBI
Azure Databricks
Azure Key Vault

Formula1 Cloud Data Platform
Report
Ingest
Ergast API ADLS
Ingested
Layer
Transform Analyze
ADLS
Raw Layer
ADLS
Presentation
Layer
ADF Pipelines

Who is this course for
University students
IT Developers from other disciplines
Data Architects
AWS/ GCP/ On-prem Data Engineers

Who is this course not for
You are not interested in hands-on learning approach
Your only focus is Azure Data Engineering Certification
You want to Learn Spark ML or Streaming
You want to Learn Scala or Java

Pre-requisites
All code and step-by-step instructions provided
Basic SQL knowledge required
Azure Account
Basic Python programming knowledge required
Cloud fundamentals will be beneficial, but not mandatory

Our Commitments
Ask Questions, I will answer J
Keeping the course up to date
Udemy life time access
Udemy 30 day money back guarantee

Databricks
2.Azure Portal
23.Azure Data Factory
Course Structure
Overviews
3.Azure Databricks
4. Clusters
5. Notebooks
8. Databricks Mounts
Spark (Python)
11. Data Ingestion 1
9.Project Overview 10.Spark Overview
14. Jobs
15. Transformation
16. Aggregations
Spark (SQL)
17. Temp Views
18. DDL
19. DML
20. Analysis
Delta Lake
22.Delta Lake
19.Incremental Load
24.Connecting Other Tools
Orchestration
21.Incremental Load
6. Data Lake Access
7. Securing Access

Introduction to Azure Databricks

Microsoft Azure
Databricks
Azure Databricks

Apache Spark
Distributed computing Platform
In-memory processing engine
Unified engine which supports SQL, streaming, ML and
graph processing
Apache Spark is a lightning-fast unified analytics engine for big
data processing and machine learning
100% Open source under Apache License
Simple and easy to use APIs

Spark Core
Apache Spark Architecture
Scala Python Java R
Spark
SQL
Spark
Streaming
Spark ML Spark Graph
Resilient Distributed Dataset (RDD)
Spark Standalone, YARN, Apache Mesos, Kubernetes
DataFrame / Dataset APIs
Spark SQL Engine
Catalyst Optimizer Tungsten

Databricks
SQL
Analytics
Clusters
Databases/
Tables
Workspace/
Notebook
Administrati
on Controls
Optimized
Spark (5x
faster)
Delta Lake
MLFlow

Databricks
Azure Databricks
MLFlow
Databases/
Tables
Optimized
Spark (5x
faster)
Clusters
Workspace
/ Notebook
Administra
tion
Controls
Unified Azure Portal +
Unified Billing
Azure Active Directory
Power BI
Azure ML
Azure Data Factory
Azure Dev Ops
Azure Data Lake
Azure Blob Storage
Azure Cosmos DB
Azure SQL Database
Azure Synapse
Data Services
Azure IoT Hub
Azure Event Hub
Messaging Services
Delta Lake
SQL
Analytics

Microsoft Azure
Databricks
Azure Databricks
Data
Services
Messaging
Services
Unified Azure
Portal + Unified
Billing
Azure Active
Directory
Power BI
Azure ML
Azure Data
Factory
Azure Dev
Ops
MLFlow
Databases/
Tables
Optimized
Spark (5x
faster)
Clusters
Workspace
/ Notebook
Administra
tion
Controls
Delta Lake
SQL
Analytics

Creating Azure Databricks Service

Control Plane (Databricks
Subscription)
Databricks UX
Azure Resource Manager
Data Plane (Customer Subscription)
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks Cluster Manager
Azure Databricks Architecture

Notebooks
Clusters
Jobs
Data
Models
Databricks Workspace Components

Databricks Clusters
What is Databricks Cluster
Cluster Types
Cluster Configuration
Creating a cluster
Cluster Pools
Pricing
Cost Control
Cluster Policy

Databricks Cluster
VM
VM
VM
VM
Driver
Worker Worker
Worker

Cluster Types
All Purpose
Created manually
Cheaper to run
Isolated just for the job
Suitable for automated workloads
Terminated at the end of the job
Created by Jobs
Job Cluster
Persistent
Suitable for interactive workloads
Shared among many users
Expensive to run

Multi Node
Single Node
Single/ Multi Node

Single User
Only One User Access
Supports Python, SQL, Scala, R
Access Mode
Shared
Multiple User Access
Only available in Premium. Supports Python, SQL
No Isolation Shared
Multiple User Access
Supports Python, SQL
Custom
Legacy Configuration
Single/ Multi Node

Databricks Runtime ML
Photon Runtime
Access Mode
Databricks Runtime
Spark Delta Lake
Ubuntu
Libraries
Scala, Java,
Python, R
GPU
Libraries
Other Databricks Services
Everything from
Databricks runtime
Popular ML Libraries (PyTorch, Keras,
TensorFlow, XGBoost etc)
Photon Engine
Databricks Runtime Light
Runtime option for only jobs not requiring advanced features
Everything from
Databricks runtime
Databricks Runtime
Single/ Multi Node

Access Mode
Databricks Runtime
Auto Termination Auto Termination
• Terminates the cluster after X minutes of inactivity
• Default value for Single Node and Standard clusters is 120 minutes
• Users can specify a value between 10 and 10000 mins as the duration
Single/ Multi Node

Access Mode
Databricks Runtime
Auto Termination
Auto Scaling Auto Scaling
• User specifies the min and max work nodes
• Auto scales between min and max based on the workload
• Not recommended for streaming workloads
Single/ Multi Node

Cluster VM Type/ Size
Memory Optimized
Storage Optimized
Compute Optimized
General Purpose
GPU Accelerated
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node

Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy

Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Simplifies the user interface
Enables standard users to create clusters
Achieves cost control
Only available on premium tier

Azure Databricks Cluster
Pricing

Azure Databricks Pricing Factors
Workload (All Purpose/ Jobs/ SQL/ Photon)
Tier (Premium/ Standard)
VM Type (General Purpose/ GPU/ Optimized)
Purchase Plan (Pay As You Go/ Pre-Purchase)
VM
VM
VM
VM
Driver
Worker Worker
Worker
Databricks Cluster

Azure Databricks Pricing Calculation
A Databricks Unit (DBU) is a normalized unit of processing power on the Databricks Lakehouse Platform used for
measurement and pricing purposes
VM
VM
VM
VM
Driver
Worker Worker
Worker
Databricks Cluster

No of
DBU
Price based on
workload/ tier
X = DBU Cost
1 (Driver) Price of VM
X =
Driver
Node Cost
No of
workers
Price of VM
X =
Worker
Node Cost
=
+
+
Total Cost of
the Cluster

VM
Driver
Single Node Cluster
Workload - All Purpose
Pricing Tier - Premium
VM Type - General Purpose Standard DS3_V2
Purchase Plan - Pay As You Go

https://azure.microsoft.com/en-gb/pricing/details/databricks/

https://azure.microsoft.com/en-gb/pricing/details/virtual-machines/linux/#pricing

No of
DBU
Price based on
workload/ tier
X = DBU Cost
0.75 0.55 0.4125
1 (Driver) Price of VM
X =
Driver
Node Cost
1 0.3510
0.3510
=
+
+
Total Cost of
the Cluster
$0.7635
per hour
No of
workers
Price of VM
X =
Worker
Node Cost
0 0.00

Estimated cost for doing the course
Azure Data Lake Storage
Azure Data Factory
Azure Databricks Job Cluster
Azure Databricks Cluster Pool
Azure Databricks All Purpose Cluster
Depends on the number of hours in use
Past student experience – 20 to 30 hours to complete the course
Past Pay As You Go student experience – $15 to $25
Within the credit offered by Free / Student Subscription
$0.76 per hour for a small single node cluster on premium tier

Cost Control
Azure Data Factory
Azure Databricks Job Cluster
Azure Databricks Cluster Pool
Azure Databricks All Purpose Cluster
None – Cost Negligible
None - Billed only for execution of the pipeline
None - Destroyed once the job completes
Delete the cluster pool at end of the lesson
Set Auto Termination to 20 minutes
Service Action to be taken
Set budget alerts on your subscription

Cluster Pools
Cluster 1
Pool
(idle instance 1 & max instance 2)
VM1
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool

Cluster 1
VM1
Pool
(idle instances 1 & max instances 2)
VM2
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool
Cluster Pools

Cluster 2
Cluster 1
VM1
Pool
(idle instances 1 & max instances 2)
VM2
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool
Cluster Pools

Cluster Policy
Admin User
Policy
Hide Attributes
Set Default Values
Fix Values
Config
GUI
Cluster
Simple User Interface
Standardize Cluster Configs
Achieve Cost Control
Empowers standard users

Cluster Policy
Admin User
Policy
Only available on Premium Tier
Public Preview (December 2022)
Config
GUI
Cluster

Databricks Notebooks
What’s a notebook
Creating a notebook
Magic Commands
Databricks Utilities
Import Project Solution Notebooks

Databricks Utilities
File System Utilities
Notebook Workflow Utilities
Widget Utilities
Secrets Utilities

Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
Service Principal
Storage Access Keys
Shared Access Signature (SAS Token)
Service Principal

Cluster Notebook
ADLS
Gen2
Storage Access Keys
Service Principal
Service Principal
Session Scoped Authentication

Cluster Notebook
ADLS
Gen2
Storage Access Keys
Service Principal
Service Principal
Cluster Scoped Authentication

Cluster Notebook
ADLS
Gen2
User Account
Service Principal
AAD Passthrough Authentication

Cluster Notebook
ADLS
Gen2
User Account
Service Principal
Unity Catalog
Unity Catalog

Create Azure Data Lake Gen2 Storage
Access Data Lake using SAS Token
Access Data Lake using Access Keys
Access Data Lake using Service Principal
Using Cluster Scoped Authentication
Access Data Lake using AAD Credential Pass-through
Recommended approach for the course

Create Azure Data Lake Gen2 Storage
(ADLS Gen2)

Access Azure Data Lake Gen2
using
Access Keys

Authenticate Using Access Keys
Gives full Access to the storage account
Keys can be rotated (regenerated)
Each storage account comes with 2 keys
ADLS Gen2

spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net", "<access key>")
Azure Data
Lake Gen2
Azure
Databricks
Access Key
Access Keys – Spark Configuration

Optimized for big data analytics
Offers better security
abfs (Azure Blob File System) driver
Microsoft Documentation -https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver
Azure Data
Lake Gen2
Azure
Databricks
Access Keys – abfs driver

abfs[s]:// container @storage_account_name.dfs.core.windows.net/ folder_path/ file_name
abfss://demo@formula1dl.dfs.core.windows.net/test/circuits.csv
abfss://demo@formula1dl.dfs.core.windows.net/test/
abfss://demo@formula1dl.dfs.core.windows.net/
Azure Data
Lake Gen2
Azure
Databricks
Access Keys – abfs driver

Azure Data
Lake Gen2
Azure
Databricks
spark.conf.set("fs.azure.account.key.<storage-account>.dfs.core.windows.net", "<access key>")
dbutils.fs.ls("abfss://demo@formula1dl.dfs.core.windows.net/")
Authenticate Using Access Keys

Access Azure Data Lake
using

Shared Access Signature
Restrict access to specific resource types/ services
ADLS Gen2
Provides fine grained access to the storage
Allow specific permissions
Restrict access to specific time period
Limit access to specific IP addresses
Recommended access pattern for external clients
Microsoft Documentation - https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview

Shared Access Signature
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"SAS")
Azure Data
Lake Gen2
Azure
Databricks
Shared Access
Signature (SAS)

Access Azure Data Lake
using
Service Principal

Service Principal
User Account Service Principal
Azure Active Directory Azure Subscription
Owner
Contributor
Reader
…
…
Custom
RBAC

Service Principal
User Account Service Principal
Azure Active Directory Azure Subscription
Owner
Contributor
Reader
…
…
Custom
RBAC
Azure
Databricks
1. Register Azure AD Application / Service Principal
2. Generate a secret/ password for the Application
4. Assign Role Storage Blob Data Contributor to the Data Lake
3. Set Spark Config with App/ Client Id, Directory/ Tenant Id & Secret

Session Scoped Authentication
Cluster
Notebook ADLS Gen2
User
Secrets

Cluster Scoped Authentication
Cluster
Notebook ADLS Gen2
User
Secrets

AAD Credential Passthrough
AAD User

Azure Subscription
AAD Credential Passthrough
AAD User
Owner
Contributor
Reader
…
…
Custom
RBAC
Azure
Databricks
Azure Data Lake
Gen2
Only available on Premium Workspaces

Recommended Access Pattern
For The Course
Azure Data
Lake Gen2
Azure
Databricks

Access Data Lake using SAS Token
Access Data Lake using Access Keys
Access Data Lake using Service Principal
Access Data Lake using Cluster Scoped Authentication
Access Data Lake using AAD Credential Pass-through
Securing Credentials & Secrets Databricks Mounts

Student Subscription
Company Subscription
without Access to AAD
Using Cluster Scoped Authentication (via
Access Keys)
Free Subscription
Pay-as-you-go
Subscription
Any other Subscription
with Access to AAD
Using Service Principal

Securing Secrets
Databricks Secret Scope Azure Key Vault

Secret scopes help store the credentials securely and reference them in
notebooks, clusters and jobs when required
Databricks backed Secret Scope Azure Key-vault backed Secret Scope
Securing Secrets – Section Overview

Azure Key-Vault
Databricks Secret Scope
Notebooks/Clusters/Jobs
Add secrets to the key vault
Create Databricks secret scope
Get secrets using dbutils.secrets.get

Create Azure Key Vault
Databricks Secrets Utility Overview
Create Databricks Secret Scope
Implement Secrets to Notebooks
Implement Secrets to Clusters
Azure Key-Vault
Databricks Secret Scope
Notebooks/Clusters/Jobs

Databricks Secrets Utility
(dbutils.secrets)

Implement Secrets Utility in

(Assignment)

Databricks Clusters

Databricks Mounts
What is Databricks File System (DBFS)
What are Databricks mounts
Mount ADLS container to Databricks

Databricks File System (DBFS)
Subscription)
Databricks UX
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks File System (DBFS) is a distributed file system mounted on the Databricks workspace
DBFS Root is the default storage for a Databricks workspace created during workspace deployment

DBFS Root
Default storage location, but not recommended for user data
Can be accessed via the web UI
Query results are stored here
Default storage location for managed tables
Backed by the default Azure Blob Storage

Subscription)
Databricks UX
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks Mounts
Customer Subscription
Azure Blob Storage
/mnt

Access data without requiring credentials
Access files using file semantics rather than storage URLs (e.g. /mnt/storage1)
Stores files to object storage (e.g. Azure Blob), so you get all the benefits from Azure
Benefits
Databricks Mounts
Recommended solution for access Azure Storage until the introduction
of Unity Catalog (Generally Available from end of 2022)

DBFS
Azure Data Lake Gen2
Databricks Mounts

DBFS
Service Principal
Provide
Required
Access
Create mount using
Credentials
/mnt/storage1
Databricks Mounts

DBFS
Databricks Mounts
/mnt/storage1

Mounting Azure Data Lake Storage Gen2

Project Overview
What is formula1
Formula1 data source & datasets
Prepare the data for the project
Project Requirements
Solution Architecture

Seasons
Race Weekend
Race Circuits
Teams/
Constructors
Drivers
Practice
Qualifying
Race
Qualifying Results
Laps
Pit Stops
Race Results
Driver Standings
Constructor
Standings
Formula1 Overview

Formula1 Data Source
http://ergast.com/mrd/

Ingest All 8 files into the data lake
Ingested data must have audit columns
Ingested data must be stored in columnar
format (i.e., Parquet)
Ingested data must have the schema applied
Must be able to analyze the ingested data via
SQL
Ingestion logic must be able to handle
incremental load
Data Ingestion Requirements

Join the key information required for reporting to
create a new table.
Transformed tables must have audit columns
Transformed data must be stored in columnar
format (i.e., Parquet)
Must be able to analyze the transformed data
via SQL
Transformation logic must be able to handle
incremental load
Data Transformation Requirements
Join the key information required for Analysis to
create a new table.

Driver Standings
Reporting Requirements
Constructor Standings

Dominant Drivers
Analysis Requirements
Dominant Teams
Visualize the outputs
Create Databricks Dashboards

Scheduled to run every Sunday 10PM
Scheduling Requirements
Ability to monitor pipelines
Ability to re-run failed pipelines
Ability to set-up alerts on failures

Ability to delete individual records
Other Non-Functional Requirements
Ability to see history and time travel
Ability to roll back to a previous version

Solution Architecture Overview

Report
Ingest
Ergast API ADLS
Ingested
Layer
Transform Analyze
Solution Architecture
ADLS
Raw Layer
ADLS
Presentation
Layer
ADF Pipelines

Azure Databricks Modern Analytics Architecture
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-
architecture

Databricks Architecture
https://databricks.com/solutions/data-pipelines

Spark Architecture
Driver Node
VM
Core
Core
Core
Core
Worker Node
VM
Core
Core
Core
Core
Worker Node
VM
Core
Core
Core
Core

Spark Architecture
Driver Node
Driver Program
Worker Node
Executor
Slot
Slot
Slot
Slot
Worker Node
Executor
Slot
Slot
Slot
Slot

Spark Architecture
Driver Node
Driver Program
Worker Node
Executor
Slot
Slot
Slot
Slot
Application
Job
Job
Stage
Stage
Stage
Task
Task
Task
Task
Worker Node
Executor
Slot
Slot
Slot
Slot

Spark Architecture
Driver Node
Driver Program
Worker Node
Executor
Task
Task
Task
Task
Application
Job
Job
Stage
Stage
Stage
Task
Task
Task
Task
Worker Node
Executor
Task
Task
Task
Task

Spark Architecture – Cluster Scaling
Driver Node
Driver Program
Worker Node
Executor
Core
Core
Core
Core
Worker Node
Executor
Core
Core
Core
Core
Core
Core Core
Core
Worker Node
Executor
Core
Core
Core
Core
Core
Core

Spark DataFrame
Source: https://www.bbc.co.uk/sport/formula1/drivers-world-championship/standings

Spark DataFrame
Read Trans Write
Data Source
(CSV, JSON etc)
Data Sink
(ORC, Parquet
etc)
DataFrame DataFrame
Data Sources API
- DataFrame
Reader Methods
DataFrame API Data Sources API
- DataFrame
Writer Methods

Data Ingestion Overview
Read Data Transform Data Write Data
DataFrameReader API DataFrameWriter API
Data Types
Row
Column
Functions
Window
Grouping
CSV
JSON
XML
JDBC
…
Parquet
Avro
Delta
JDBC
….
DataFrame API

Data Ingestion - Circuits
CSV Parquet

Data Ingestion – Races (Assignment)
CSV Parquet
withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))

Data Ingestion – Races (Partition By)
CSV Parquet
withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))
Partition By
race_year

Data Ingestion - Constructors
JSON Parquet

Data Ingestion - Drivers
JSON Parquet

Data Ingestion – Results (Assignment)
JSON Parquet
Partition By
race_id

Data Ingestion - Pitstops
Multi Line JSON Parquet

Data Ingestion - Laptimes
Multiple CSV Parquet

Data Ingestion – Qualifying (Assignment)
Multiple Multi
Line JSON
Parquet

Databricks Workflows
Include notebook
Defining notebook parameters
Notebook workflow
Databricks Jobs

Filter/ Join Transformations
Filter Transformation
Join Transformations
Apply Transformations to F1 Project

Join Transformation
Race Results

Set-up Environment
Presentation Layer

Aggregations
Simple Aggregations
Grouped Aggregations
Window Functions
Apply Aggregations to F1 Project

Hive Meta Store
Azure Data Lake
Hive Meta Store
Spark SQL
Databricks Default
External Meta Store (Azure SQL,
MySQL etc.)

Spark Databases/ Tables/ Views
Database
Table Views
Databricks
Workspace
Managed External

Report
Ingest
Ergast API ADLS
Ingested
Layer
Transform Analyze
Managed/ External Tables
ADLS
Raw Layer
ADLS
Presentation
Layer
ADF Pipelines
External
Tables
Managed
Tables

Spark SQL Introduction
SQL Basics
Simple Functions
Aggregate Functions
Window Functions
Joins

Dominant Drivers/ Teams Analysis
Create a table with the data required
Granularity of the data – race_year, driver, team
Rank the dominant drivers of all time/ last decade etc
Rank the dominant teams of all time/ last decade etc
Visualization of dominant drivers
Visualization of dominant teams

Incremental Load
Data Loading Patterns
F1 Project Load Pattern
Implementation

Data Load Types
Full Load Incremental Load

Day 1
Full Dataset
Race 1
Day 2
Race 1
Race 2
Day 3
Race 1
Race 2
Race 3
Day 4
Race 1
Race 2
Race 3
Race 4

Full Refresh/ Load – Day 1
Day 1
Race 1
Ingestion Transformation
Data Lake Data Lake
Race 1 Race 1

Day 1
Race 1
Day 2
Race 1
Race 2
Data Lake Data Lake
Race 1
Race 1
Race 2
Race 1
Race 1
Race 2

Day 1
Race 1
Day 2
Race 1
Race 2
Day 3
Race 1
Race 2
Race 3
Data Lake Data Lake
Race 1
Race 2
Race 1
Race 2
Race 3
Race 1
Race 2
Race 1
Race 2
Race 3

Day 1
Incremental Dataset
Race 1
Day 2
Race 2
Day 3
Race 3
Day 4
Race 4

Incremental Load – Day 1
Day 1
Race 1
Data Lake Data Lake
Race 1 Race 1

Day 1
Race 1
Day 2
Race 2
Data Lake Data Lake
Race 1
Race 2
Race 1
Race 2

Day 1
Race 1
Day 2
Race 2
Day 3
Race 3
Data Lake Data Lake
Race 1
Race 2
Race 3
Race 1
Race 2
Race 3

Hybrid Scenarios
Full Dataset received, but data loaded & transformed incrementally
Incremental dataset received, but data loaded & transformed in full
Data received contains both full and incremental files
Incremental data received. Ingested incrementally & transformed in full

Formula1 Scenario
Day 2
Race 1052
Day 3
Race 1053
Day 1
Race 1
Race 2
….
Race 1047

Formula1 Scenario / Solution 1
Day 2
Race 1052
Day 3
Race 1053
Day 1
Race 1
Race 2
….
Race 1047
Full Refresh Incremental Load
Incremental Load

Formula1 Scenario / Solution 2
Day 2
Race 1052
Day 3
Race 1053
Day 1
Race 1
Race 2
….
Race 1047
Incremental Load Incremental Load
Incremental Load

Current Solution
Ingestion
Overwrite the
data ingested
Transformation
Overwrite the
data
transformed
Old Data
New Data
Data Lake
Folder - raw
Data Lake
Folder - processed
Data Lake
Folder - presentation
Old Data
New Data
New Data

New Solution
Ingestion
1) Process the
data from a
particular sub
folder
2) Delete the
races for we
have
received data
from
ingested
3) Load the new
data received
Transformation
1) Identify the
dates for
which we
have
received new
data
2) Delete the
correspondin
g races from
the
presentation
layer
3) Process the
new data
received
Data Lake
Folder - raw
Data Lake
Folder - processed
Data Lake
Folder - presentation
race_date_1
race_date_2
race_date_3
Sub folders
Races 1-1047
Race 1052
Race 1053

Delta Lake
Pitfalls of Data Lakes
Lakehouse Architecture
Delta Lake Capabilities
Convert F1 project to Delta Lake

Data Warehouse
ETL
Data Consumers
Data Sources
Operational
Data
External
Data Data Warehouse
/ Mart

Data Warehouse
Lack of support for unstructured data
Longer to ingest new data
Proprietary data formats
Scalability
Expensive to store data
Lack of support for ML/ AI workloads

Data Lake
Ingest
Data Sources
Operational
Data
External
Data
Transform
Data Science/ ML
workloads
Data Lake Data Lake
BI Reports
Data Warehouse /
Mart

Data Lake
No support for ACID transactions
Failed jobs leave partial files
Inconsistent reads
Lack of ability remove data for GDPR etc
Unable to handle corrections to data
Unable to roll back any data.
Poor performance
Poor BI support
Complex to set-up
No history or versioning
Lambda architecture for streaming workloads

Data Lakehouse
Ingest
Data Sources
Operational
Data
External
Data
Transform
Data Science/ ML
workloads
Delta Lake Delta Lake
BI Reports

Data Lakehouse
Handles all types of data
Cheap cloud object storage
Uses open source format
Support for all types of workloads
Ability to use BI tools directly
ACID support
History & Versioning
Better performance
Simple architecture

Delta Lake
BI
Parquet Files
Transaction Log
Delta Engine
Spark
Delta Table
SQL Streaming
ML/ AI
Open file format
History, Versioning, ACID, Time Travel
Optimized engine for performance
Data security, Governance, Integrity, Perf.
Transformations, ML, Streaming etc.
Simple Architecture
Spark

Azure Data Factory
Overview
Creating Data Factory Service
Data Factory Components
Creating Pipelines
Creating Triggers

What is Azure Data Factory
A fully managed, serverless data integration solution for ingesting,
preparing and transforming all of your data at scale.

SaaS Apps
Multi Cloud
On Premises
The Data Problem
Data Formats

Ingest
Transform/
Analyze
Publish
Data Consumers
Data Sources
The Data Problem

Ingest
Transform/
Analyze
Publish
Data Consumers
Data Sources
The Data Problem
Transformation

What is Azure Data Factory
Fully Managed Service
Serverless
Data Integration Service
Data Transformation Service
Data Orchestration Service
A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.

What Azure Data Factory Is Not
Data Migration Tool
Data Streaming Service
Suitable for Complex Data Transformations
Data Storage Service

Storage
ADLS
SQL
Database
Linked Service
Compute
Azure
Databricks
Azure
HDInsight
Linked Service
Dataset
Activity
Pipeline
Activity
Trigger
Data Factory Components

Unity Catalog - Introduction
Unity Catalog
Unity Catalog Overview
Enable Unity Catalog Metastore
Cluster Configurations
Unity Catalog Object Model
Access External Data Lake

Unity Catalog
Unity Catalog is a Databricks offered unified
solution for implementing data governance in
the Data Lakehouse
Unity Catalog

Data Governance
Is the process of managing the availability, usability, integrity and
security of the data present in an enterprise
Ensures that the data is trust worthy and not misused
Controls access to the data for the users
Helps implement privacy regulations such as GDPR, CCPA etc

Data Governance
Unity Catalog
Data Access
Control
Data Lineage
Data Audit
Data
Discoverability

Without Unity Catalog
Users/ Groups
Hive
Metastore
Compute
Tables/ Views
Databricks Workspace 1
Users/
Groups
Hive
Metastore
Compute
Tables/ Views
Databricks Workspace 2
Data Lake Cloud Storage (ADLS Gen2)

Without Unity Catalog
User Management
Metastore
Compute
Databricks
Workspace 1
User Management
Metastore
Compute
Databricks
Workspace 2
User Management
Metastore
Compute
Databricks
Workspace n

With Unity Catalog
User Management
Metastore
Compute
Databricks
Workspace 1
User Management
Metastore
Compute
Databricks
Workspace 2
User Management Metastore
Databricks Unity Catalog
Without Unity Catalog With Unity Catalog
Compute
Databricks
Workspace 1
Compute
Databricks
Workspace 2
Databricks Account Databricks Account

Unity Catalog
Unity Catalog
Metastore
User Management
Data Lineage
Data Explorer
Audit Logs
Access Control

Unity Catalog
Unity Catalog
Data Access
Control
Data Audit
Data Lineage
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse

Unity Catalog Set-up
User Management Metastore
Compute
Databricks
Workspace 1
Compute
Databricks
Workspace 2

Metastore
Compute
Databricks
Workspace 1
Compute
Databricks
Workspace 2

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace

Default Storage
(ADLS Gen2 Container)
Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Managed Identity/
Service Principal

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Access Connector for
Databricks

Databricks
Workspace
Metastore
Databricks Unity
Catalog
Compute
Storage Blob Data
Contributor Role
Default Storage
Databricks

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Databricks
Create Access Connector
Create Azure Data Lake Gen2
Add role Storage Blob Data Contributor
Create Unit Catalog Metastore
Enable Databricks Workspace for Unity Catalog
Create Azure Databricks Workspace

Unity Catalog
Pre-requisites

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Databricks
Create Azure Data Lake Gen2
Add role Storage Blob Data Contributor
Create Unity Catalog Metastore
Create Databricks Workspace
Enable Databricks Workspace for Unity Catalog

Unity Catalog

Metastore
Top level container
Only one per region
Paired with Default ADLS Storage

Metastore
Catalog
Newly added to Unity Catalog
Logical container within Metastore

Metastore
Catalog
Schema (Database)
Next level container within Catalogs
Schemas and Databases are the same
Use Schema instead of Database

Metastore
Catalog
Schema (Database)
Table
View Function

Metastore
Catalog
Schema (Database)
Table
View Function
External Managed
Managed tables can only be Delta format
Stored in the default storage
Deleted data retained for 30 days
Benefits from automatic maintenance and
performance improvements

Metastore
Catalog
Schema (Database)
Table
View Function
External Managed
SELECT * FROM catalog.schema.table;
SELECT * FROM schema.table;

Metastore
Catalog
Schema (Database)
Table
View Function
External Managed
Storage Credential External Location Share Recipient Provider

Dev
Silver
Tables
Gold
Bronze
Tables Tables
Test
Silver
Tables
Gold
Bronze
Tables Tables
Prod
Silver
Tables
Gold
Bronze
Tables Tables
Formula1
Metastore
Catalog
Schema

Sports
Analytics
F1_Dev
Silver
Tables
Gold
Bronze
Tables Tables
F1_Test
Silver
Tables
Gold
Bronze
Tables Tables
F1_Prod
Silver
Tables
Gold
Bronze
Tables Tables
Metastore
Catalog
Schema

Sports
Analytics
F1_Test
Schemas
F1_Prod
F1_Dev
Schemas Schemas
Football_
Test
Schemas
Football_
Prod
Football_
Dev
Schemas Schemas
Cricket_Te
st
Schemas
Cricket_Pr
od
Cricket_D
ev
Schemas Schemas
Metastore
Catalog
Tables
Tables Tables Tables
Tables Tables

Sports
Analytics
F1_Sandb
ox
Schemas
Football_
Sandbox
Schemas
Cricket_S
andbox
Schemas
Metastore
Catalog

Sports
Analytic
s
F1_Test
Schemas
F1_Prod
F1_Dev
Schemas Schemas
Football
_Test
Schemas
Football
_Prod
Football
_Dev
Schemas Schemas
Cricket_
Test
Schemas
Cricket_
Prod
Cricket_
Dev
Schemas Schemas
Metastore
Catalog
Tables
Tables Tables
F1_Sand
box
Schemas
Tables
Football
_Sandbo
x
Schemas
Tables
Cricket_
Sandbox
Schemas
Tables

Accessing External Locations
Unity Catalog

Unity Catalog Configuration
Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Databricks
User

Metastore
Databricks Unity
Catalog
Compute
Databricks
Workspace
Storage Blob Data
Contributor Role
Default Storage
Databricks
User
External Data Lake
(ADLS Gen2)

Metastore
Storage Credential External Location
Storage Credential/ External Location
Credential object stored centrally in the metastore
Used on user’s behalf to access cloud storage
Created using Managed Identity/ Service Principal
Can be assigned access control policies
Object stored centrally at the metastore
Combines the ADLS path with the storage credential
Subjected to access control policies.

Databricks Unity
Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)

Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Managed Identity/
Service Principal

Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container

Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Databricks
Storage Blob Data
Contributor Role
Storage Credential
Create Azure Data Lake Storage
Assign Storage Blob Data
Contributor role
Create Storage Credential
Create External Location

Creating Storage Credential
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Databricks
Storage Blob Data
Contributor Role
Storage Credential
Contributor role

Creating External Location
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Databricks
Storage Blob Data
Contributor Role
Storage Credential
Contributor role

Bronze Silver Gold
Workflows
Notebooks Notebooks Notebooks
Mini Project - Overview
Drivers
Results
Drivers
Results
Driver_Wins

Mini Project - Overview
Create External Locations
Create Silver Tables (Managed)
Create Bronze Tables (External)
Create Catalog & Schemas
Create Gold Table (Managed)
Create Databricks Workflow
Create Storage Containers

Mini Project – Create External Locations

Mini Project – Create Catalogs & Schemas

Metastore
Catalog
Schema (Database)
Table
External Managed

Metastore
Catalog
Schema (Database)
Table
External Managed
Managed Location
(Mandatory)
Managed Location
(Optional)
Managed Location
(Optional)
Default Storage
External Data Lake
External Data Lake
1
2
3

Bronze Silver Gold
Workflows
Drivers
Results
Drivers
Results
Driver_Wins
External Managed Managed

Metastore
Catalog
Schema (Database)
Table
External Managed
Managed Location
(Mandatory)
Managed Location
(Optional)
Managed Location
(Optional)
Default Storage
External Data Lake
External Data Lake
1
2
3
formula1_dev
bronze silver gold

Metastore
Catalog
Schema (Database)
Table
External Managed
Managed Location
(Mandatory)
Managed Location
(Optional)
Managed Location
(Optional)
Default Storage
External Data Lake
1
2
3
formula1_dev
bronze silver gold

Mini Project – Create External Tables

Mini Project – Create Managed Tables

Mini Project – Create Workflow

Unity Catalog – Key capabilities
Unity Catalog
Data Discovery
Data Audit
Data Lineage
Data Access Control

Unity Catalog – Data Discoverability
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage

Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage
Unity Catalog – Data Discoverability

Unity Catalog – Data Audit
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage

Unity Catalog – Data Lineage
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage

Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage
Unity Catalog – Data Lineage

Data Lineage
Data Lineage is the process of following/ tracking the journey of data within a pipeline.
What is the origin
How has it changed/ transformed
What is the destination

Bronze Silver Gold
Workflows
Data Lineage

Data Lineage - Benefits
Root cause analysis
Better compliance
Trust & Reliability
Improved impact analysis

Data Lineage - Limitations
Available only for the last 30 days
Limited column level lineage
Only available for tables registered in Unity Catalog Metastore

Unity Catalog – Data Access Control
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage

Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Data Lineage
Unity Catalog – Data Access Control

Unity Catalog - Securable Objects

Unity Catalog - Security Model
Secure
Identity Securable Object
Users
Service Principal
Groups

Security Model – Role Based Access
Secure
Users
Service Principal
Groups
Roles
Account Admin
Metastore Admin
Object Owner
Object User

Security Model – Access Control List
Secure
Users
Service Principal
Groups
Privileges
CREATE
USAGE
…
SELECT

Metastore
Catalog
Schema (Database)
Table
View Function
External Managed
Storage Credential External Location Share Recipient Provider

Metastore
Catalog
Schema (Database)
Table
View Function
External Managed

Metastore
Catalog
Schema (Database)
Table
View Function
SELECT
SELECT
MODIFY
EXECUTE
USE SCHEMA
USE CATALOG

Metastore
Catalog
Schema (Database)
Table
View Function
SELECT
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
USE CATALOG

SELECT
Metastore
Catalog
Schema (Database)
Table
View Function
SELECT
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
USE CATALOG
CREATE SCHEMA
CREATE CATALOG
USE CATALOG
USE SCHEMA

Metastore
Catalog
Schema (Database)
Table
View Function
Access Control List – Inheritance Model
SELECT
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
USE CATALOG
CREATE SCHEMA
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
SELECT
MODIFY
EXECUTE
ALL PRIVILEGES
ALL PRIVILEGES
ALL PRIVILEGES
CREATE CATALOG

Metastore
READ FILES
WRITE FILES
CREATE EXTERNAL TABLE
CREATE EXTERNAL LOCATION
READ FILES
WRITE FILES
CREATE EXTERNAL TABLE
CREATE MANAGED STORAGE
ALL PRIVILEGES ALL PRIVILEGES

Version Date Description
1 July 2021 Initial Version
2 Dec 2022 Sections 3 to 5 updated for UI Changes
3 Mar 2023 Revamped Databricks Mounts and Access to ADLS sections as a whole
4 May 2023 Addition of Unity Catalog
Document History

Azure+Databricks+Course+Slide+Deck+V4.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Azure+Databricks+Course+Slide+Deck+V4.pdf

Similar to Azure+Databricks+Course+Slide+Deck+V4.pdf (20)

Recently uploaded

Recently uploaded (20)

Azure+Databricks+Course+Slide+Deck+V4.pdf