Data Engineering:
A Deep Dive into
Databricks
Presenter:
Mohika Rastogi
Sant Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction
o What is Data Engineering
o Data Engineer vs Analyst vs Scientist
2. Central Repository
o Data Warehouse
o Data Lake
o Data Lakehouse
3. Databricks
o What is Databricks ?
o Use cases
o Managed Integration
o Delta Lake
o Delta Sharing
4. Apache Spark
5. Databricks Workspace
o Workspace Terminologies
6. Demo
Data Engineering
 Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale.
 Data engineering is the complex task of making raw data usable to data scientists and groups
within an organization.
Data Engineer vs Data Analyst vs Data Scientist
Data Scientist
A data scientist is someone who
uses their knowledge of statistics,
machine learning, and
programming to extract meaning
from data. They use their skills to
solve complex problems, identify
trends, and make predictions.
Data Analyst
A data analyst is someone who
collects, cleans, and analyzes
data to help businesses make
better decisions. They use
their skills to identify patterns
in data, and to create reports and
visualizations that help others
understand the data.
Data Engineer
A data engineer is someone who
builds and maintains the systems
that data scientists and data
analysts use to collect, store, and
analyze data. They use their
skills to design and build
data pipelines, and to ensure
that data is stored in a secure
and efficient way.
Central Repositories
Data Warehouse
A data lake is an ample storage that can store structured,
semi-structured, and raw data. The schema of the data is
not known as it is a schema-on-read.
Data Lake
A data warehouse is a central repository of business
data stored in structured format to help organizations
gain insights. Schema needs to be known before writing
data into a warehouse.
Data Lakehouse
 Data lakehouse is a realtively new architecture and it is combining the best of the both worlds —
data warehouses and data lakes.
 It serves as a single platform for data warehousing and data lakes. It has data management
features such as ACID transcation coming from a warehouse perspective and low cost storage
like a data lake.
Databricks
A unified, open analytics platform for
building, deploying, sharing, and
maintaining enterprise-grade data,
analytics, and AI solutions at scale.
Databricks
 An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to
collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
 Databricks was founded by creators of Apache Spark in 2013
 A one-stop product for all Data requirements, like Storage and Analysis.
 Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
What is Databricks used for?
 Data processing workflows scheduling and
management
 Working in SQL
 Generating dashboards and visualizations
 Data ingestion
 Managing security, governance, and HA/DR
 Data discovery, annotation, and exploration
 Compute management
 Machine learning (ML) modeling and tracking
 ML model serving
 Source control with Git
The Databricks workspace provides a unified interface and tools for most data tasks, including:
Databricks for Data Engineering
 Simplified data ingestion
 Automated ETL processing
 Reliable workflow orchestration
 End-to-end observability and monitoring
 Next-generation data processing engine
 Foundation of governance, reliability and performance
Databricks excels in data engineering with
its unified platform, leveraging Apache
Spark for efficient processing and
scalability.
Managed integration with open source
The following technologies are open source projects founded by Databricks employees:
 Delta Lake
− Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks
Lakehouse Platform.
 Delta Sharing
− An open standard for secure data sharing.
 Apache Spark
− Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on
single-node machines or clusters.
 MLflow
− MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment,
and a central model registry
Delta Lake
 Delta Lake is the default storage format for all operations on Databricks.
 Delta Lake is open source software that extends Parquet data files with a file-based transaction
log for ACID transactions and scalable metadata handling.
 Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration
with Structured Streaming, allowing you to easily use a single copy of data for both batch and
streaming operations and providing incremental processing at scale.
Delta Sharing
 Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to
share data with other organizations regardless of which computing platforms they use.
 Databricks and the Linux Foundation developed Delta Sharing to provide the first open source
approach to data sharing across data, analytics and AI. Customers can share live data across
platforms, clouds and regions with strong security and governance.
Apache Spark
 Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
 PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write
Python and SQL-like commands to manipulate and analyze data in a distributed processing
environment.
Databricks Workspace
The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”.
The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments”
into “Folders”, and, provides access to “Data” and Computational Resources, such as -
“Clusters” and “Jobs”.
The Databricks “Workspace” can be managed using :-
1. Workspace UI
2. Databricks CLI
3. Databricks REST API
Databricks Workspace Terminology
01 02
03
05 06
04
Cluster is a “Set of Computational
Resources and Configurations”, on
which an organization’s Data
Engineering Workloads are run.
Cluster
Job is a way of running a “Notebook”
or a “JAR” either immediately or on a
“Scheduled Basis”.
Jobs
Every “Databricks Deployment” has a
“Central Hive Meta-store”, accessible
by all “Clusters” to persist “Table
Metadata”.
Hive Meta-store
“Notebook” is a “Web-Based Interface”
composed of a “Group of Cells” that
allow to execute coding commands.
Notebooks
DBFS is a Distributed File System
mounted into each Databricks
Workspace. DBFS contains Directorie
s which in turn contains Data Files,
Libraries and other Directories.
DBFS
By default, all tables created in
Databricks are Delta tables. Delta
tables are based on the Delta Lake
open source project.
Delta Table
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks

Data Engineering A Deep Dive into Databricks

  • 1.
    Data Engineering: A DeepDive into Databricks Presenter: Mohika Rastogi Sant Singh
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.
    1. Introduction o Whatis Data Engineering o Data Engineer vs Analyst vs Scientist 2. Central Repository o Data Warehouse o Data Lake o Data Lakehouse 3. Databricks o What is Databricks ? o Use cases o Managed Integration o Delta Lake o Delta Sharing 4. Apache Spark 5. Databricks Workspace o Workspace Terminologies 6. Demo
  • 5.
    Data Engineering  Dataengineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.  Data engineering is the complex task of making raw data usable to data scientists and groups within an organization.
  • 6.
    Data Engineer vsData Analyst vs Data Scientist Data Scientist A data scientist is someone who uses their knowledge of statistics, machine learning, and programming to extract meaning from data. They use their skills to solve complex problems, identify trends, and make predictions. Data Analyst A data analyst is someone who collects, cleans, and analyzes data to help businesses make better decisions. They use their skills to identify patterns in data, and to create reports and visualizations that help others understand the data. Data Engineer A data engineer is someone who builds and maintains the systems that data scientists and data analysts use to collect, store, and analyze data. They use their skills to design and build data pipelines, and to ensure that data is stored in a secure and efficient way.
  • 7.
    Central Repositories Data Warehouse Adata lake is an ample storage that can store structured, semi-structured, and raw data. The schema of the data is not known as it is a schema-on-read. Data Lake A data warehouse is a central repository of business data stored in structured format to help organizations gain insights. Schema needs to be known before writing data into a warehouse.
  • 8.
    Data Lakehouse  Datalakehouse is a realtively new architecture and it is combining the best of the both worlds — data warehouses and data lakes.  It serves as a single platform for data warehousing and data lakes. It has data management features such as ACID transcation coming from a warehouse perspective and low cost storage like a data lake.
  • 9.
    Databricks A unified, openanalytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
  • 10.
    Databricks  An InteractiveAnalytics platform that enables Data Engineers, Data Scientists, and Businesses to collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.  Databricks was founded by creators of Apache Spark in 2013  A one-stop product for all Data requirements, like Storage and Analysis.  Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
  • 11.
    What is Databricksused for?  Data processing workflows scheduling and management  Working in SQL  Generating dashboards and visualizations  Data ingestion  Managing security, governance, and HA/DR  Data discovery, annotation, and exploration  Compute management  Machine learning (ML) modeling and tracking  ML model serving  Source control with Git The Databricks workspace provides a unified interface and tools for most data tasks, including:
  • 12.
    Databricks for DataEngineering  Simplified data ingestion  Automated ETL processing  Reliable workflow orchestration  End-to-end observability and monitoring  Next-generation data processing engine  Foundation of governance, reliability and performance Databricks excels in data engineering with its unified platform, leveraging Apache Spark for efficient processing and scalability.
  • 13.
    Managed integration withopen source The following technologies are open source projects founded by Databricks employees:  Delta Lake − Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform.  Delta Sharing − An open standard for secure data sharing.  Apache Spark − Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.  MLflow − MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
  • 14.
    Delta Lake  DeltaLake is the default storage format for all operations on Databricks.  Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.  Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
  • 15.
    Delta Sharing  DeltaSharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.  Databricks and the Linux Foundation developed Delta Sharing to provide the first open source approach to data sharing across data, analytics and AI. Customers can share live data across platforms, clouds and regions with strong security and governance.
  • 16.
    Apache Spark  ApacheSpark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.  PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment.
  • 17.
    Databricks Workspace The Databricks“Workspace” is an Environment for accessing all of the Databricks “Assets”. The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments” into “Folders”, and, provides access to “Data” and Computational Resources, such as - “Clusters” and “Jobs”. The Databricks “Workspace” can be managed using :- 1. Workspace UI 2. Databricks CLI 3. Databricks REST API
  • 18.
    Databricks Workspace Terminology 0102 03 05 06 04 Cluster is a “Set of Computational Resources and Configurations”, on which an organization’s Data Engineering Workloads are run. Cluster Job is a way of running a “Notebook” or a “JAR” either immediately or on a “Scheduled Basis”. Jobs Every “Databricks Deployment” has a “Central Hive Meta-store”, accessible by all “Clusters” to persist “Table Metadata”. Hive Meta-store “Notebook” is a “Web-Based Interface” composed of a “Group of Cells” that allow to execute coding commands. Notebooks DBFS is a Distributed File System mounted into each Databricks Workspace. DBFS contains Directorie s which in turn contains Data Files, Libraries and other Directories. DBFS By default, all tables created in Databricks are Delta tables. Delta tables are based on the Delta Lake open source project. Delta Table