11. Who is this course for
University students
IT Developers from other disciplines
Data Architects
AWS/ GCP/ On-prem Data Engineers
12. Who is this course not for
You are not interested in hands-on learning approach
Your only focus is Azure Data Engineering Certification
You want to Learn Spark ML or Streaming
You want to Learn Scala or Java
13. Pre-requisites
All code and step-by-step instructions provided
Basic SQL knowledge required
Azure Account
Basic Python programming knowledge required
Cloud fundamentals will be beneficial, but not mandatory
14. Our Commitments
Ask Questions, I will answer J
Keeping the course up to date
Udemy life time access
Udemy 30 day money back guarantee
15. Databricks
2.Azure Portal
23.Azure Data Factory
Course Structure
Overviews
3.Azure Databricks
4. Clusters
5. Notebooks
8. Databricks Mounts
Spark (Python)
11. Data Ingestion 1
12. Data Ingestion 2
13. Data Ingestion 3
9.Project Overview 10.Spark Overview
14. Jobs
15. Transformation
16. Aggregations
Spark (SQL)
17. Temp Views
18. DDL
19. DML
20. Analysis
Delta Lake
22.Delta Lake
19.Incremental Load
24.Connecting Other Tools
Orchestration
21.Incremental Load
6. Data Lake Access
7. Securing Access
18. Apache Spark
Distributed computing Platform
In-memory processing engine
Unified engine which supports SQL, streaming, ML and
graph processing
Apache Spark is a lightning-fast unified analytics engine for big
data processing and machine learning
100% Open source under Apache License
Simple and easy to use APIs
24. Control Plane (Databricks
Subscription)
Databricks UX
Azure Resource Manager
Data Plane (Customer Subscription)
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks Cluster Manager
Azure Databricks Architecture
28. Cluster Types
All Purpose
Created manually
Cheaper to run
Isolated just for the job
Suitable for automated workloads
Terminated at the end of the job
Created by Jobs
Job Cluster
Persistent
Suitable for interactive workloads
Shared among many users
Expensive to run
31. Single User
Only One User Access
Supports Python, SQL, Scala, R
Access Mode
Shared
Multiple User Access
Only available in Premium. Supports Python, SQL
No Isolation Shared
Multiple User Access
Supports Python, SQL
Custom
Legacy Configuration
Single/ Multi Node
Cluster Configuration
32. Databricks Runtime ML
Photon Runtime
Access Mode
Databricks Runtime
Spark Delta Lake
Ubuntu
Libraries
Scala, Java,
Python, R
GPU
Libraries
Other Databricks Services
Everything from
Databricks runtime
Popular ML Libraries (PyTorch, Keras,
TensorFlow, XGBoost etc)
Photon Engine
Databricks Runtime Light
Runtime option for only jobs not requiring advanced features
Everything from
Databricks runtime
Databricks Runtime
Single/ Multi Node
Cluster Configuration
33. Access Mode
Databricks Runtime
Auto Termination Auto Termination
• Terminates the cluster after X minutes of inactivity
• Default value for Single Node and Standard clusters is 120 minutes
• Users can specify a value between 10 and 10000 mins as the duration
Single/ Multi Node
Cluster Configuration
34. Access Mode
Databricks Runtime
Auto Termination
Auto Scaling Auto Scaling
• User specifies the min and max work nodes
• Auto scales between min and max based on the workload
• Not recommended for streaming workloads
Single/ Multi Node
Cluster Configuration
35. Cluster VM Type/ Size
Memory Optimized
Storage Optimized
Compute Optimized
General Purpose
GPU Accelerated
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Configuration
36. Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Configuration
Cluster Policy
37. Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Configuration
Cluster Policy
38. Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Configuration
Cluster Policy
Simplifies the user interface
Enables standard users to create clusters
Achieves cost control
Only available on premium tier
41. Azure Databricks Pricing Factors
Workload (All Purpose/ Jobs/ SQL/ Photon)
Tier (Premium/ Standard)
VM Type (General Purpose/ GPU/ Optimized)
Purchase Plan (Pay As You Go/ Pre-Purchase)
VM
VM
VM
VM
Driver
Worker Worker
Worker
Databricks Cluster
42. Azure Databricks Pricing Calculation
A Databricks Unit (DBU) is a normalized unit of processing power on the Databricks Lakehouse Platform used for
measurement and pricing purposes
VM
VM
VM
VM
Driver
Worker Worker
Worker
Databricks Cluster
43. Azure Databricks Pricing Calculation
No of
DBU
Price based on
workload/ tier
X = DBU Cost
1 (Driver) Price of VM
X =
Driver
Node Cost
No of
workers
Price of VM
X =
Worker
Node Cost
=
+
+
Total Cost of
the Cluster
44. Azure Databricks Pricing Calculation
VM
Driver
Single Node Cluster
Workload - All Purpose
Pricing Tier - Premium
VM Type - General Purpose Standard DS3_V2
Purchase Plan - Pay As You Go
47. Azure Databricks Pricing Calculation
No of
DBU
Price based on
workload/ tier
X = DBU Cost
0.75 0.55 0.4125
1 (Driver) Price of VM
X =
Driver
Node Cost
1 0.3510
0.3510
=
+
+
Total Cost of
the Cluster
$0.7635
per hour
No of
workers
Price of VM
X =
Worker
Node Cost
0 0.00
48. Estimated cost for doing the course
Azure Data Lake Storage
Azure Data Factory
Azure Databricks Job Cluster
Azure Databricks Cluster Pool
Azure Databricks All Purpose Cluster
Depends on the number of hours in use
Past student experience – 20 to 30 hours to complete the course
Past Pay As You Go student experience – $15 to $25
Within the credit offered by Free / Student Subscription
$0.76 per hour for a small single node cluster on premium tier
49. Cost Control
Azure Data Lake Storage
Azure Data Factory
Azure Databricks Job Cluster
Azure Databricks Cluster Pool
Azure Databricks All Purpose Cluster
None – Cost Negligible
None - Billed only for execution of the pipeline
None - Destroyed once the job completes
Delete the cluster pool at end of the lesson
Set Auto Termination to 20 minutes
Service Action to be taken
Set budget alerts on your subscription
50. Cluster Pools
Cluster 1
Pool
(idle instance 1 & max instance 2)
VM1
Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool
51. Cluster 1
VM1
Pool
(idle instances 1 & max instances 2)
VM2
Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool
Cluster Pools
52. Cluster 2
Cluster 1
VM1
Pool
(idle instances 1 & max instances 2)
VM2
Cluster VM Type/ Size
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Single/ Multi Node
Cluster Policy
Cluster Pool
Cluster Pools
63. Azure Active Directory
Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
Service Principal
Storage Access Keys
Shared Access Signature (SAS Token)
Service Principal
64. Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
Storage Access Keys
Shared Access Signature (SAS Token)
Service Principal
Azure Active Directory
Service Principal
Session Scoped Authentication
65. Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
Storage Access Keys
Shared Access Signature (SAS Token)
Service Principal
Azure Active Directory
Service Principal
Cluster Scoped Authentication
66. Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
User Account
Service Principal
Azure Active Directory
AAD Passthrough Authentication
67. Access Azure Data Lake - Section Overview
Cluster Notebook
ADLS
Gen2
User Account
Service Principal
Azure Active Directory
Unity Catalog
Unity Catalog
68. Access Azure Data Lake - Section Overview
Create Azure Data Lake Gen2 Storage
Access Data Lake using SAS Token
Access Data Lake using Access Keys
Access Data Lake using Service Principal
Using Cluster Scoped Authentication
Access Data Lake using AAD Credential Pass-through
Recommended approach for the course
71. Authenticate Using Access Keys
Gives full Access to the storage account
Keys can be rotated (regenerated)
Each storage account comes with 2 keys
ADLS Gen2
73. Optimized for big data analytics
Offers better security
abfs (Azure Blob File System) driver
Microsoft Documentation -https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver
Azure Data
Lake Gen2
Azure
Databricks
Access Keys – abfs driver
77. Shared Access Signature
Restrict access to specific resource types/ services
ADLS Gen2
Provides fine grained access to the storage
Allow specific permissions
Restrict access to specific time period
Limit access to specific IP addresses
Recommended access pattern for external clients
Microsoft Documentation - https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview
80. Service Principal
User Account Service Principal
Azure Active Directory Azure Subscription
Owner
Contributor
Reader
…
…
Custom
RBAC
81. Service Principal
User Account Service Principal
Azure Active Directory Azure Subscription
Owner
Contributor
Reader
…
…
Custom
RBAC
Azure
Databricks
1. Register Azure AD Application / Service Principal
2. Generate a secret/ password for the Application
4. Assign Role Storage Blob Data Contributor to the Data Lake
3. Set Spark Config with App/ Client Id, Directory/ Tenant Id & Secret
86. Azure Subscription
AAD Credential Passthrough
AAD User
Azure Active Directory
Owner
Contributor
Reader
…
…
Custom
RBAC
Azure
Databricks
Azure Data Lake
Gen2
Only available on Premium Workspaces
88. Recommended Access Pattern
Access Data Lake using SAS Token
Access Data Lake using Access Keys
Access Data Lake using Service Principal
Access Data Lake using Cluster Scoped Authentication
Access Data Lake using AAD Credential Pass-through
Securing Credentials & Secrets Databricks Mounts
89. Recommended Access Pattern
Student Subscription
Company Subscription
without Access to AAD
Using Cluster Scoped Authentication (via
Access Keys)
Free Subscription
Pay-as-you-go
Subscription
Any other Subscription
with Access to AAD
Using Service Principal
91. Secret scopes help store the credentials securely and reference them in
notebooks, clusters and jobs when required
Databricks backed Secret Scope Azure Key-vault backed Secret Scope
Securing Secrets – Section Overview
92. Azure Key-Vault
Databricks Secret Scope
Notebooks/Clusters/Jobs
Add secrets to the key vault
Create Databricks secret scope
Get secrets using dbutils.secrets.get
Securing Secrets – Section Overview
100. Databricks Mounts
What is Databricks File System (DBFS)
What are Databricks mounts
Mount ADLS container to Databricks
101. Databricks File System (DBFS)
Control Plane (Databricks
Subscription)
Databricks UX
Azure Resource Manager
Data Plane (Customer Subscription)
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks Cluster Manager
Databricks File System (DBFS) is a distributed file system mounted on the Databricks workspace
DBFS Root is the default storage for a Databricks workspace created during workspace deployment
102. DBFS Root
Default storage location, but not recommended for user data
Can be accessed via the web UI
Query results are stored here
Default storage location for managed tables
Backed by the default Azure Blob Storage
104. Control Plane (Databricks
Subscription)
Databricks UX
Azure Resource Manager
Data Plane (Customer Subscription)
Databricks
Workspace
VNet
VM
VM
VM
VM
Azure Blob Storage
Clusters
DBFS
Azure AD
Databricks Cluster Manager
Databricks Mounts
Customer Subscription
Azure Blob Storage
Azure Data Lake Storage
/mnt
105. Access data without requiring credentials
Access files using file semantics rather than storage URLs (e.g. /mnt/storage1)
Stores files to object storage (e.g. Azure Blob), so you get all the benefits from Azure
Benefits
Databricks Mounts
Recommended solution for access Azure Storage until the introduction
of Unity Catalog (Generally Available from end of 2022)
119. Ingest All 8 files into the data lake
Ingested data must have audit columns
Ingested data must be stored in columnar
format (i.e., Parquet)
Ingested data must have the schema applied
Must be able to analyze the ingested data via
SQL
Ingestion logic must be able to handle
incremental load
Data Ingestion Requirements
120. Join the key information required for reporting to
create a new table.
Transformed tables must have audit columns
Transformed data must be stored in columnar
format (i.e., Parquet)
Must be able to analyze the transformed data
via SQL
Transformation logic must be able to handle
incremental load
Data Transformation Requirements
Join the key information required for Analysis to
create a new table.
123. Scheduled to run every Sunday 10PM
Scheduling Requirements
Ability to monitor pipelines
Ability to re-run failed pipelines
Ability to set-up alerts on failures
124. Ability to delete individual records
Other Non-Functional Requirements
Ability to see history and time travel
Ability to roll back to a previous version
138. Spark DataFrame
Read Trans Write
Data Source
(CSV, JSON etc)
Data Sink
(ORC, Parquet
etc)
DataFrame DataFrame
Data Sources API
- DataFrame
Reader Methods
DataFrame API Data Sources API
- DataFrame
Writer Methods
141. Ingest All 8 files into the data lake
Ingested data must have audit columns
Ingested data must be stored in columnar
format (i.e., Parquet)
Ingested data must have the schema applied
Must be able to analyze the ingested data via
SQL
Ingestion logic must be able to handle
incremental load
Data Ingestion Requirements
142. Data Ingestion Overview
Read Data Transform Data Write Data
DataFrameReader API DataFrameWriter API
Data Types
Row
Column
Functions
Window
Grouping
CSV
JSON
XML
JDBC
…
Parquet
Avro
Delta
JDBC
….
DataFrame API
144. Data Ingestion - Circuits
Read Data Transform Data Write Data
CSV Parquet
145. Data Ingestion – Races (Assignment)
Read Data Transform Data Write Data
CSV Parquet
withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))
146. Data Ingestion – Races (Partition By)
Read Data Transform Data Write Data
CSV Parquet
withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))
Partition By
race_year
147. Data Ingestion - Constructors
Read Data Transform Data Write Data
JSON Parquet
148. Data Ingestion - Drivers
Read Data Transform Data Write Data
JSON Parquet
149. Data Ingestion – Results (Assignment)
Read Data Transform Data Write Data
JSON Parquet
Partition By
race_id
150. Data Ingestion - Pitstops
Read Data Transform Data Write Data
Multi Line JSON Parquet
151. Data Ingestion - Laptimes
Read Data Transform Data Write Data
Multiple CSV Parquet
152. Data Ingestion – Qualifying (Assignment)
Read Data Transform Data Write Data
Multiple Multi
Line JSON
Parquet
173. Dominant Drivers/ Teams Analysis
Create a table with the data required
Granularity of the data – race_year, driver, team
Rank the dominant drivers of all time/ last decade etc
Rank the dominant teams of all time/ last decade etc
Visualization of dominant drivers
Visualization of dominant teams
176. Day 1
Full Dataset
Race 1
Day 2
Race 1
Race 2
Day 3
Race 1
Race 2
Race 3
Day 4
Race 1
Race 2
Race 3
Race 4
177. Full Refresh/ Load – Day 1
Day 1
Race 1
Ingestion Transformation
Data Lake Data Lake
Race 1 Race 1
178. Full Refresh/ Load – Day 2
Day 1
Race 1
Ingestion Transformation
Day 2
Race 1
Race 2
Data Lake Data Lake
Race 1
Race 1
Race 2
Race 1
Race 1
Race 2
179. Full Refresh/ Load – Day 3
Day 1
Race 1
Ingestion Transformation
Day 2
Race 1
Race 2
Day 3
Race 1
Race 2
Race 3
Data Lake Data Lake
Race 1
Race 2
Race 1
Race 2
Race 3
Race 1
Race 2
Race 1
Race 2
Race 3
181. Incremental Load – Day 1
Day 1
Race 1
Ingestion Transformation
Data Lake Data Lake
Race 1 Race 1
182. Incremental Load – Day 2
Day 1
Race 1
Ingestion Transformation
Day 2
Race 2
Data Lake Data Lake
Race 1
Race 2
Race 1
Race 2
183. Incremental Load – Day 3
Day 1
Race 1
Ingestion Transformation
Day 2
Race 2
Day 3
Race 3
Data Lake Data Lake
Race 1
Race 2
Race 3
Race 1
Race 2
Race 3
184. Hybrid Scenarios
Full Dataset received, but data loaded & transformed incrementally
Incremental dataset received, but data loaded & transformed in full
Data received contains both full and incremental files
Incremental data received. Ingested incrementally & transformed in full
189. Current Solution
Ingestion
Overwrite the
data ingested
Transformation
Overwrite the
data
transformed
Old Data
New Data
Data Lake
Folder - raw
Data Lake
Folder - processed
Data Lake
Folder - presentation
Old Data
New Data
New Data
190. New Solution
Ingestion
1) Process the
data from a
particular sub
folder
2) Delete the
races for we
have
received data
from
ingested
3) Load the new
data received
Transformation
1) Identify the
dates for
which we
have
received new
data
2) Delete the
correspondin
g races from
the
presentation
layer
3) Process the
new data
received
Data Lake
Folder - raw
Data Lake
Folder - processed
Data Lake
Folder - presentation
race_date_1
race_date_2
race_date_3
Sub folders
Races 1-1047
Race 1052
Race 1053
191. Delta Lake
Pitfalls of Data Lakes
Lakehouse Architecture
Delta Lake Capabilities
Convert F1 project to Delta Lake
194. Data Warehouse
Lack of support for unstructured data
Longer to ingest new data
Proprietary data formats
Scalability
Expensive to store data
Lack of support for ML/ AI workloads
196. Data Lake
No support for ACID transactions
Failed jobs leave partial files
Inconsistent reads
Lack of ability remove data for GDPR etc
Unable to handle corrections to data
Unable to roll back any data.
Poor performance
Poor BI support
Complex to set-up
No history or versioning
Lambda architecture for streaming workloads
198. Data Lakehouse
Handles all types of data
Cheap cloud object storage
Uses open source format
Support for all types of workloads
Ability to use BI tools directly
ACID support
History & Versioning
Better performance
Simple architecture
199. Delta Lake
BI
Parquet Files
Transaction Log
Delta Engine
Spark
Delta Table
SQL Streaming
ML/ AI
Open file format
History, Versioning, ACID, Time Travel
Optimized engine for performance
Data security, Governance, Integrity, Perf.
Transformations, ML, Streaming etc.
Simple Architecture
Spark
203. What is Azure Data Factory
A fully managed, serverless data integration solution for ingesting,
preparing and transforming all of your data at scale.
210. What is Azure Data Factory
Fully Managed Service
Serverless
Data Integration Service
Data Transformation Service
Data Orchestration Service
A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.
211. What Azure Data Factory Is Not
Data Migration Tool
Data Streaming Service
Suitable for Complex Data Transformations
Data Storage Service
217. Unity Catalog
Unity Catalog is a Databricks offered unified
solution for implementing data governance in
the Data Lakehouse
Unity Catalog
218. Unity Catalog
Unity Catalog is a Databricks offered unified
solution for implementing data governance in
the Data Lakehouse
Unity Catalog
219. Data Governance
Is the process of managing the availability, usability, integrity and
security of the data present in an enterprise
Ensures that the data is trust worthy and not misused
Controls access to the data for the users
Helps implement privacy regulations such as GDPR, CCPA etc
221. Without Unity Catalog
Users/ Groups
Hive
Metastore
Compute
Tables/ Views
Databricks Workspace 1
Users/
Groups
Hive
Metastore
Compute
Tables/ Views
Databricks Workspace 2
Data Lake Cloud Storage (ADLS Gen2)
222. Without Unity Catalog
User Management
Metastore
Compute
Databricks
Workspace 1
User Management
Metastore
Compute
Databricks
Workspace 2
User Management
Metastore
Compute
Databricks
Workspace n
223. With Unity Catalog
User Management
Metastore
Compute
Databricks
Workspace 1
User Management
Metastore
Compute
Databricks
Workspace 2
User Management Metastore
Databricks Unity Catalog
Without Unity Catalog With Unity Catalog
Compute
Databricks
Workspace 1
Compute
Databricks
Workspace 2
Databricks Account Databricks Account
224. With Unity Catalog
User Management
Metastore
Compute
Databricks
Workspace 1
User Management
Metastore
Compute
Databricks
Workspace 2
User Management Metastore
Databricks Unity Catalog
Without Unity Catalog With Unity Catalog
Compute
Databricks
Workspace 1
Compute
Databricks
Workspace 2
Databricks Account Databricks Account
226. Unity Catalog
Unity Catalog
Data Access
Control
Data Audit
Data Lineage
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
244. Metastore
Unity Catalog Object Model
Catalog
Schema (Database)
Next level container within Catalogs
Schemas and Databases are the same
Use Schema instead of Database
246. Metastore
Unity Catalog Object Model
Catalog
Schema (Database)
Table
View Function
External Managed
Managed tables can only be Delta format
Stored in the default storage
Deleted data retained for 30 days
Benefits from automatic maintenance and
performance improvements
247. Metastore
Unity Catalog Object Model
Catalog
Schema (Database)
Table
View Function
External Managed
SELECT * FROM catalog.schema.table;
SELECT * FROM schema.table;
258. Metastore
Storage Credential External Location
Storage Credential/ External Location
Credential object stored centrally in the metastore
Used on user’s behalf to access cloud storage
Created using Managed Identity/ Service Principal
Can be assigned access control policies
Object stored centrally at the metastore
Combines the ADLS path with the storage credential
Subjected to access control policies.
260. Accessing External Locations
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Managed Identity/
Service Principal
261. Accessing External Locations
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Access Connector for
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container
262. Accessing External Locations
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Access Connector for
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container
Create Azure Data Lake Storage
Assign Storage Blob Data
Contributor role
Create Storage Credential
Create External Location
Create Access Connector
263. Creating Storage Credential
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Access Connector for
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container
Create Azure Data Lake Storage
Assign Storage Blob Data
Contributor role
Create Storage Credential
Create External Location
Create Access Connector
264. Creating Storage Credential
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Access Connector for
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container
Create Azure Data Lake Storage
Assign Storage Blob Data
Contributor role
Create Storage Credential
Create External Location
Create Access Connector
265. Creating External Location
Databricks Unity Catalog
Compute
Databricks
Workspace
User
External Data Lake
(ADLS Gen2)
External Location
Storage Credential
Access Connector for
Databricks
Storage Blob Data
Contributor Role
Storage Credential
ADLS Storage Container
Create Azure Data Lake Storage
Assign Storage Blob Data
Contributor role
Create Storage Credential
Create External Location
Create Access Connector
279. Unity Catalog – Key capabilities
Unity Catalog
Data Discovery
Data Audit
Data Lineage
Data Access Control
280. Unity Catalog – Data Discoverability
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
281. Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
Unity Catalog – Data Discoverability
282. Unity Catalog – Data Audit
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
283. Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
Unity Catalog – Data Discoverability
284. Unity Catalog – Data Lineage
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
285. Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
Unity Catalog – Data Lineage
286. Data Lineage
Data Lineage is the process of following/ tracking the journey of data within a pipeline.
What is the origin
How has it changed/ transformed
What is the destination
288. Data Lineage - Benefits
Root cause analysis
Better compliance
Trust & Reliability
Improved impact analysis
289. Data Lineage - Limitations
Available only for the last 30 days
Limited column level lineage
Only available for tables registered in Unity Catalog Metastore
290. Unity Catalog – Data Access Control
Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
291. Unity Catalog
Data Access
Control
Data Audit
Data
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
Unity Catalog – Data Access Control
293. Unity Catalog - Security Model
Secure
Identity Securable Object
Users
Service Principal
Groups
294. Security Model – Role Based Access
Secure
Identity Securable Object
Users
Service Principal
Groups
Roles
Account Admin
Metastore Admin
Object Owner
Object User
295. Security Model – Access Control List
Secure
Identity Securable Object
Users
Service Principal
Groups
Privileges
CREATE
USAGE
…
SELECT
302. Metastore
Catalog
Schema (Database)
Table
View Function
Access Control List – Inheritance Model
SELECT
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
USE CATALOG
CREATE SCHEMA
SELECT
MODIFY
EXECUTE
USE SCHEMA
CREATE TABLE/
FUNCTION
SELECT
MODIFY
EXECUTE
ALL PRIVILEGES
ALL PRIVILEGES
ALL PRIVILEGES
CREATE CATALOG
303. Metastore
Storage Credential External Location
READ FILES
WRITE FILES
CREATE EXTERNAL TABLE
CREATE EXTERNAL LOCATION
READ FILES
WRITE FILES
CREATE EXTERNAL TABLE
CREATE MANAGED STORAGE
ALL PRIVILEGES ALL PRIVILEGES
Security Model – Access Control List
308. Version Date Description
1 July 2021 Initial Version
2 Dec 2022 Sections 3 to 5 updated for UI Changes
3 Mar 2023 Revamped Databricks Mounts and Access to ADLS sections as a whole
4 May 2023 Addition of Unity Catalog
Document History