Data Engineering A Deep Dive into Databricks

Data Engineering:
A Deep Dive into
Databricks
Presenter:
Mohika Rastogi
Sant Singh

Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.

1. Introduction
o What is Data Engineering
o Data Engineer vs Analyst vs Scientist
2. Central Repository
o Data Warehouse
o Data Lake
o Data Lakehouse
3. Databricks
o What is Databricks ?
o Use cases
o Managed Integration
o Delta Lake
o Delta Sharing
4. Apache Spark
5. Databricks Workspace
o Workspace Terminologies
6. Demo

Data Engineering
 Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale.
 Data engineering is the complex task of making raw data usable to data scientists and groups
within an organization.

Data Engineer vs Data Analyst vs Data Scientist
Data Scientist
A data scientist is someone who
uses their knowledge of statistics,
machine learning, and
programming to extract meaning
from data. They use their skills to
solve complex problems, identify
trends, and make predictions.
Data Analyst
A data analyst is someone who
collects, cleans, and analyzes
data to help businesses make
better decisions. They use
their skills to identify patterns
in data, and to create reports and
visualizations that help others
understand the data.
Data Engineer
A data engineer is someone who
builds and maintains the systems
that data scientists and data
analysts use to collect, store, and
analyze data. They use their
skills to design and build
data pipelines, and to ensure
that data is stored in a secure
and efficient way.

Central Repositories
Data Warehouse
A data lake is an ample storage that can store structured,
semi-structured, and raw data. The schema of the data is
not known as it is a schema-on-read.
Data Lake
A data warehouse is a central repository of business
data stored in structured format to help organizations
gain insights. Schema needs to be known before writing
data into a warehouse.

Data Lakehouse
 Data lakehouse is a realtively new architecture and it is combining the best of the both worlds —
data warehouses and data lakes.
 It serves as a single platform for data warehousing and data lakes. It has data management
features such as ACID transcation coming from a warehouse perspective and low cost storage
like a data lake.

Databricks
A unified, open analytics platform for
building, deploying, sharing, and
maintaining enterprise-grade data,
analytics, and AI solutions at scale.

Databricks
 An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to
collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
 Databricks was founded by creators of Apache Spark in 2013
 A one-stop product for all Data requirements, like Storage and Analysis.
 Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.

What is Databricks used for?
 Data processing workflows scheduling and
management
 Working in SQL
 Generating dashboards and visualizations
 Data ingestion
 Managing security, governance, and HA/DR
 Data discovery, annotation, and exploration
 Compute management
 Machine learning (ML) modeling and tracking
 ML model serving
 Source control with Git
The Databricks workspace provides a unified interface and tools for most data tasks, including:

Databricks for Data Engineering
 Simplified data ingestion
 Automated ETL processing
 Reliable workflow orchestration
 End-to-end observability and monitoring
 Next-generation data processing engine
 Foundation of governance, reliability and performance
Databricks excels in data engineering with
its unified platform, leveraging Apache
Spark for efficient processing and
scalability.

Managed integration with open source
The following technologies are open source projects founded by Databricks employees:
 Delta Lake
− Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks
Lakehouse Platform.
 Delta Sharing
− An open standard for secure data sharing.
 Apache Spark
− Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on
single-node machines or clusters.
 MLflow
− MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment,
and a central model registry

Delta Lake
 Delta Lake is the default storage format for all operations on Databricks.
 Delta Lake is open source software that extends Parquet data files with a file-based transaction
log for ACID transactions and scalable metadata handling.
 Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration
with Structured Streaming, allowing you to easily use a single copy of data for both batch and
streaming operations and providing incremental processing at scale.

Delta Sharing
 Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to
share data with other organizations regardless of which computing platforms they use.
 Databricks and the Linux Foundation developed Delta Sharing to provide the first open source
approach to data sharing across data, analytics and AI. Customers can share live data across
platforms, clouds and regions with strong security and governance.

Apache Spark
 Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
 PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write
Python and SQL-like commands to manipulate and analyze data in a distributed processing
environment.

Databricks Workspace
The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”.
The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments”
into “Folders”, and, provides access to “Data” and Computational Resources, such as -
“Clusters” and “Jobs”.
The Databricks “Workspace” can be managed using :-
1. Workspace UI
2. Databricks CLI
3. Databricks REST API

Databricks Workspace Terminology
01 02
03
05 06
04
Cluster is a “Set of Computational
Resources and Configurations”, on
which an organization’s Data
Engineering Workloads are run.
Cluster
Job is a way of running a “Notebook”
or a “JAR” either immediately or on a
“Scheduled Basis”.
Jobs
Every “Databricks Deployment” has a
“Central Hive Meta-store”, accessible
by all “Clusters” to persist “Table
Metadata”.
Hive Meta-store
“Notebook” is a “Web-Based Interface”
composed of a “Group of Cells” that
allow to execute coding commands.
Notebooks
DBFS is a Distributed File System
mounted into each Databricks
Workspace. DBFS contains Directorie
s which in turn contains Data Files,
Libraries and other Directories.
DBFS
By default, all tables created in
Databricks are Delta tables. Delta
tables are based on the Delta Lake
open source project.
Delta Table

Data Engineering A Deep Dive into Databricks

Data Engineering A Deep Dive into Databricks

More Related Content

What's hot

Similar to Data Engineering A Deep Dive into Databricks

More from Knoldus Inc.

Recently uploaded

Data Engineering A Deep Dive into Databricks