Data Engineering with Databricks Presentation

Data Engineering
with Databricks
Purva Agrawal
Anshika Agrawal

Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.

1. Get Started with Databricks Data Science and
Engineering
2. Transform Data with Spark (SQL/PySpark)
3. Manage Data with Delta Lake
4. Build Data Pipelines with Delta Live Tables
(SQL/PySpark)
5. Deploy Workloads with Databricks Workflows
6. Manage Data Access for Analytics with Unity
Catalog

Computer Resources
 Single Node
Needs only one VM instance hosting the driver with no
worker instances in this configuration.
 Multi Node (Standard)
General purpose configuration consisting of VM instance
hosting a driver and at least one additional instance for the
worker.
Cluster Configuration

Types of Cluster
 An all-purpose cluster is a shared cluster that is
manually started and terminated and can be
shared by multiple users and jobs.
 Used for interactive development.
 All-purpose clusters, such as ad hoc analysis,
data exploration, and development, are designed
for collaborative use. Multiple users can share
them.
All Purpose Clusters
 A job cluster is created when the job or task starts
and terminated when the job or task ends.
 Used for automating workloads.
 Job clusters are specifically for running automated
jobs.
Job Clusters

Databricks Notebooks
Multi-Language
Collaborative
Ideal For Exploration
Reproducible
Get to Production
Faster
Enterprise Ready
Adaptable

 Delta Lake is an open-source storage
layer that enables building a data
lakehouse on top of existing storage
systems over cloud objects.
 Delta Lake brings ACID to object
storage.
 Delta Lake is the default format for
tables created in Databricks.
ACID Properties of Delta Lakes
Atomicity
Consistency
Isolation
Durability

Medallion Architecture in the Lakehouse

Trust your data Scale with reliability
Operate with agility
 Declarative tools to build
batch and streaming data
pipelines
 DLT has built-in declarative
quality controls
 Declare quality expectations
and actions to take
 Easily scale infrastructure
alongside your data
Introducing Delta Live Tables

 A comprehensive task orchestration
service, fully managed and cloud-based,
designed specifically for Lakehouse
environments.
 Workflows caters to data engineers,
data scientists, and analysts,
empowering them to construct
dependable workflows for data,
analytics, and AI across various cloud
platforms.
 Databricks has 2
main task orchestration services-
Workflows
Delta Live Tables (DLTs)

DLT vs Workflow Jobs
Delta Live Tables Workflow Jobs
Source Notebooks only JARs, notebooks, DLTs,
applications written in
Scala, Java, Python
Dependencies Automatically determined Manually set
Cluster Self-provisioned Self-provisioned or existing
Timeouts and Retries Not supported Supported
Import Libraries Not supported Supported

Data Governance
Data Access Control
Data Lineage
Data Access Audit
Data Discovery

Unified data and AI
assets
Unified existing
catalogs
Unified governance
across cloud
Unity Catalog Overview

Unity Catalog Metastore Elements

Data Engineering with Databricks Presentation

Data Engineering with Databricks Presentation

Recommended

Recommended

More Related Content

Similar to Data Engineering with Databricks Presentation

Similar to Data Engineering with Databricks Presentation (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Data Engineering with Databricks Presentation