SlideShare a Scribd company logo
1 of 112
Download to read offline
©2022 Databricks Inc. — All rights reserved
Apache Spark
Programming
With Databricks
1
©2022 Databricks Inc. — All rights reserved
2
Welcome
©2022 Databricks Inc. — All rights reserved 3
Course Agenda
Here’s where we’re headed
DataFrames Introduction
Databricks
Platform
Spark SQL
Reader
& Writer
DataFrame
& Column
Transformations Aggregation Datetimes
Complex
Types
Additional
Functions
User-Defined
Functions
Performance
Query
Optimization
Partitioning
Spark
Architecture
Structured Streaming
and Delta
Streaming
Query
Delta Lake
Aggregating
Streams
©2022 Databricks Inc. — All rights reserved 4
Lesson Objectives
By the end of this course, you should be able to:
Identify core features of Spark and Databricks
Describe how DataFrames are created and evaluated in Spark
Apply the DataFrame API to process and analyze data
Demonstrate how Spark is optimized and executed on a cluster
Apply Delta Lake and Structured Streaming to process data
1
2
3
4
5
©2022 Databricks Inc. — All rights reserved
5
Introductions
Module 1 Apache Spark Programming with Databricks
©2022 Databricks Inc. — All rights reserved
Welcome!
Let’s get to know you
▪ Name
▪ Role and team
▪ Programing experience
▪ Motivation for attending
▪ Personal interest
©2022 Databricks Inc. — All rights reserved
Module 2 Apache Spark Programming with Databricks
Spark Core
©2022 Databricks Inc. — All rights reserved
Spark Core
Databricks Ecosystem
Spark Overview
Spark SQL
Reader & Writer
DataFrame & Column
8
©2022 Databricks Inc. — All rights reserved
9
Databricks
Ecosystem
©2022 Databricks Inc. — All rights reserved
10
10
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
Customers
Original creators of:
5000+
across the globe
©2021 Databricks Inc. — All rights reserved
Data
Warehouse
Data
Lake
Lakehouse
One platform to unify all of
your data, analytics, and AI
workloads
©2021 Databricks Inc. — All rights reserved
Data
Warehouse
Data
Lake
An open approach to bringing
data management and
governance to data lakes
Better reliability with transactions
48x faster data processing with
indexing
Data governance at scale with
fine-grained access control lists
©2021 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Simple
Platform Security & Administration
Open Data Lake
Data Management and Governance
Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications
Databricks Lakehouse Platform
Open
Collaborative
✓
✓
✓
Unstructured, semi-structured, structured, and streaming data
©2021 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Simple
Platform Security & Administration
Open Data Lake
Data Management and Governance
Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications
Databricks Lakehouse Platform
✓
Unstructured, semi-structured, structured, and streaming data
Unify your data, analytics,
and AI on one common
platform for all data use
cases
©2021 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Open
✓
Unify your data ecosystem
with open source standards
and formats.
Built on the innovation of
some of the most
successful open source
data projects in the world
30 Million+
Monthly downloads
©2021 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Open
✓
Unify your data ecosystem
with open source standards
and formats.
Partners across the
data landscape
450+
Azure Data
Factory
Visual ETL & Data Ingestion
Data Providers
Amazon
Redshift
Azure
Synapse
Lakehouse Platform
Business Intelligence
Google
BigQuery
Amazon
SageMaker
Azure Machine
Learning
Machine Learning
Google
AI Platform
AWS
Glue
Centralized Governance
Top Consulting & SI Partners
©2021 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Collaborative
✓
Unify your data teams to
collaborate across the
entire data and AI workflow
Data Analysts
Data Engineers Data Scientists
Models
Dashboards
Notebooks
Datasets
©2022 Databricks Inc. — All rights reserved
TPC-DS
Databricks SQL set official data warehousing performance record -
outperformed the previous record by 2.2x.
©2022 Databricks Inc. — All rights reserved 19
Spark Overview
©2022 Databricks Inc. — All rights reserved 20
• De-facto standard unified analytics engine for big
data processing
• Largest open-source project in data processing
• Technology created by the founders of Databricks
©2022 Databricks Inc. — All rights reserved
Spark Benefits
Fast Easy to Use Unified
©2022 Databricks Inc. — All rights reserved
Spark API
Spark SQL +
DataFrames
Streaming MLlib
SQL Python Scala Java
R
Spark Core API
©2022 Databricks Inc. — All rights reserved
Spark
application
Spark Execution
Job
Job
Job
Stage 1
Stage 2
Task 1
Task 2
©2022 Databricks Inc. — All rights reserved
Driver
Worker Worker Worker Worker
Executor Executor Executor Executor
Core Core Core Core Core Core Core Core
Spark Cluster
Task Task Task Task Task
©2022 Databricks Inc. — All rights reserved
Bonus: Magic commands in Notebook cells
Magic commands allow you to override default languages as well as a few
auxiliary commands that run utilities/commands. For example:
1. %python, %r, %scala, %sql Switch languages in a command cell
2. %sh Run shell code (runs only on Spark Driver, and not the Workers)
3. %fs Shortcut for dbutils filesystem commands
4. %md Markdown for styling the display
5. %run Execute a remote Notebook from a Notebook
6. %pip Install new Python libraries
©2022 Databricks Inc. — All rights reserved
CSV
Comma Separated Values
item_id, name, price, qty
M_PREM_Q, Premium Queen Mattress, 1795, 35
M_STAN_F, Standard Full Mattress, 945, 24
M_PREM_F, Premium Full Mattress, 1695, 45
M_PREM_T, Premium Twin Mattress, 1095, 18
©2022 Databricks Inc. — All rights reserved
Parquet
Name Score ID
Kit 4.2 1 Alex 4.5 2
Kit Alex Terry 4.2 4.5 4.1
Terry 4.1 3
1 2 3
Row-Oriented data on disk
Column-Oriented data on disk
A columnar storage format
©2022 Databricks Inc. — All rights reserved
Technology designed to be used with Apache Spark to build robust data lakes
Delta Lake
©2022 Databricks Inc. — All rights reserved
Open-source Storage Layer
©2022 Databricks Inc. — All rights reserved
Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API
©2022 Databricks Inc. — All rights reserved
Module 3 Apache Spark Programming with Databricks
Functions
©2022 Databricks Inc. — All rights reserved
Functions
Aggregation
Datetimes
Complex Types
Additional Functions
User-Defined Functions
32
©2022 Databricks Inc. — All rights reserved
Module 4 Apache Spark Programming with Databricks
Performance
©2022 Databricks Inc. — All rights reserved
Spark Architecture
Query Optimization
Partitioning
Performance
34
©2022 Databricks Inc. — All rights reserved
35
Spark
Architecture
Spark Cluster
Spark Execution
©2022 Databricks Inc. — All rights reserved
Scenario 1: Filter out brown pieces from candy
bags
©2022 Databricks Inc. — All rights reserved
Driver
Executo
r
Core
Cluster
©2022 Databricks Inc. — All rights reserved
Partitio
n
Data
©2022 Databricks Inc. — All rights reserved
We need filter out brown pieces
from these candy bags
©2022 Databricks Inc. — All rights reserved
A B C D E F
G H I J K L
1
2 3
4
5
8
12
9
10
11
6 7
©2022 Databricks Inc. — All rights reserved
1
2 3
4
5
8
12
9
10
11
6 7
Student A get bag # 1
Student B get bag # 2
Student C get bag # 3
...
A B C D E F
G H I J K L
©2022 Databricks Inc. — All rights reserved
A
B C D
E
F
G
I
L
K
J H
©2022 Databricks Inc. — All rights reserved
A B C D E F
G H I J K L
©2022 Databricks Inc. — All rights reserved
Eliminate the brown candy pieces
and pile the rest in the corner
A B C D E F
G H I J K L
©2022 Databricks Inc. — All rights reserved
Eliminate the brown candy pieces
and pile the rest in the corner
A B C D E F
G H I J K L
G H I J K L
©2022 Databricks Inc. — All rights reserved
A
B C D
E
F
G I K L
G
H
I
J
K L
©2022 Databricks Inc. — All rights reserved
B C D F
G I K L
G I K L
A E
H J
©2022 Databricks Inc. — All rights reserved
B C D F
G I K L
G I K L
A E
H J
Students A, E, H, J,
get bags 13, 14, 15, 16
©2022 Databricks Inc. — All rights reserved
B
C
D
F
G I
K
L
A
E
H
J
©2022 Databricks Inc. — All rights reserved
C
B D F
G I K L
A E
H J
©2022 Databricks Inc. — All rights reserved
B D F
G I K L
A E
H J
C
©2022 Databricks Inc. — All rights reserved
B D F
G I K L
A
E
H
J
C
©2022 Databricks Inc. — All rights reserved
A B C D E F
G H I J K L
All done!
©2022 Databricks Inc. — All rights reserved
Scenario 2: Count total pieces in candy bags
©2022 Databricks Inc. — All rights reserved
Stage 1: Local Count
©2022 Databricks Inc. — All rights reserved
We need to count the total pieces
in these candy bags
A B C D E F
G H I J K L
Stage 1: Local Count
©2022 Databricks Inc. — All rights reserved
Students B, E, I, L,
get these four bags
A B C D E F
G H I J K L
Stage 1: Local Count
©2022 Databricks Inc. — All rights reserved
A D F
G H J K
B
I L
Stage 1: Local Count
C E
©2022 Databricks Inc. — All rights reserved
A D F
G H J K
B
I L
Stage 1: Local Count
C E
©2022 Databricks Inc. — All rights reserved
Students B, E, I, L,
commit your findings
A D F
G H J K
5
4 5
B
I L
6
E
C
5 6
4 5
Stage 1: Local Count
©2022 Databricks Inc. — All rights reserved
5 6
4 5
Stage 1 is complete!
A D F
G H J K
B
I L
E
C
Stage 1: Local Count
©2022 Databricks Inc. — All rights reserved
Stage 2: Global
Count
©2022 Databricks Inc. — All rights reserved
5 6
4 5
Student G, fetch counts from
students B, E, I, L
A D F
G H J K
Stage 2: Global Count
B
I L
E
C
©2022 Databricks Inc. — All rights reserved
5
5
6
4
A D F
H J K
B
I L
E
C
G
Stage 2: Global Count
©2022 Databricks Inc. — All rights reserved
20
A D F
H J K
B
I L
E
C
G
Stage 2: Global Count
©2022 Databricks Inc. — All rights reserved
A D F
H J K
B
I L
E
C
20
G
Stage 2: Global Count
©2022 Databricks Inc. — All rights reserved
Stage 2: Global Count
Stage 1: Local Count
A D F
H J K
B
I L
E
C
20
G
5 6
4 5
A D F
G H J K
B
I L
E
C
©2022 Databricks Inc. — All rights reserved
68
Query
Optimization
Catalyst Optimizer
Adaptive Query Execution
©2022 Databricks Inc. — All rights reserved
Query Optimization
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
COST BASED OPTIMIZATION
WHOLE-STAGE
CODE GENERATION
Metadata
Catalog
Cost
Model
Selected
Physical Plan
Physical
Plans
Physical
Plans
Physical
Plans
Query
Catalyst
Catalog
ANALYSIS
LOGICAL OPTIMIZATION
PHYSICAL
PLANNING
RDDs
©2022 Databricks Inc. — All rights reserved
Query Optimization with AQE
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
COST BASED OPTIMIZATION
WHOLE-STAGE
CODE GENERATION
Metadata
Catalog
Cost
Model
Selected
Physical Plan
Physical
Plans
Physical
Plans
Query
Catalyst
Catalog
ANALYSIS
LOGICAL OPTIMIZATION
PHYSICAL
PLANNING
RDDs
Runtime Statistics
ADAPTIVE QUERY EXECUTION
New in Spark 3.0, enabled by default as of Spark 3.2
©2022 Databricks Inc. — All rights reserved
Module 5 Apache Spark Programming with Databricks
Structured Streaming
©2022 Databricks Inc. — All rights reserved
Streaming Query
Stream Aggregations
Structured
Streaming
72
©2022 Databricks Inc. — All rights reserved
73
Streaming Query
Advantages
Use Cases
Sources
©2022 Databricks Inc. — All rights reserved
Stream Processing
Batch
Processing
©2022 Databricks Inc. — All rights reserved
Advantages of Stream Processing
Lower latency
Automatic bookkeeping on new data
Efficient Updates
©2022 Databricks Inc. — All rights reserved
Stream Processing Use Cases
Incremental ETL
Real-time
reporting
Online ML
Update data to
serve in
real-time
Real-time
decision making
Notifications
©2022 Databricks Inc. — All rights reserved
Micro-Batch Processing
©2022 Databricks Inc. — All rights reserved
Micro-Batch Processing
©2022 Databricks Inc. — All rights reserved
Structured Streaming
Micro-batches
= new rows
appended to
unbounded table
©2022 Databricks Inc. — All rights reserved
FOR TESTING
Input Sources
Sockets Generator
Files
Kafka Event Hubs
©2022 Databricks Inc. — All rights reserved
Sinks
FOR DEBUGGING
Console Memory
Foreach
Files
Kafka Event Hubs
©2022 Databricks Inc. — All rights reserved
Output Modes
APPEND
Add new records
only
UPDATE
Update changed
records in place
COMPLETE
Rewrite full output
©2022 Databricks Inc. — All rights reserved
Trigger Types
Default
Process each micro-batch as soon as the previous
one has been processed
Fixed interval
Micro-batch processing kicked off at the
user-specified interval
One-time
Process all of the available data as a single
micro-batch and then automatically stop the query
Continuous
Processing
Long-running tasks that continuously read,
process, and write data as soon events are
available
*Experimental See Structured Streaming Programming Guide
©2022 Databricks Inc. — All rights reserved
Guaranteed in Structured Streaming by
Checkpointing and write-ahead logs
Idempotent sinks
Replayable data sources
End-to-end fault tolerance
©2022 Databricks Inc. — All rights reserved
85
Stream
Aggregations
Aggregations
Windows
Watermarking
©2022 Databricks Inc. — All rights reserved
Errors in IoT data by device type
Real-time Aggregations
Anomalous behavior in server log files by country
Behavior analysis on messages by hashtags
©2022 Databricks Inc. — All rights reserved
No window overlap
Any given event gets
aggregated into only one
window group
e.g. 1:00–2:00 am, 2:00–3:00
am, 3:00-4:00 am, ...
Time-Based Windows
Windows overlap
Any given event gets
aggregated into multiple
window groups
e.g. 1:00-2:00 am, 1:30–2:30 am,
2:00–3:00 am, ...
Tumbling Windows Sliding Windows
©2022 Databricks Inc. — All rights reserved
Sliding Windows Example
©2022 Databricks Inc. — All rights reserved
Windowing
(streamingDF
.groupBy(col("device"),
window(col("time"), "1 hour"))
.count())
Why are we seeing 200
tasks for this stage?
©2022 Databricks Inc. — All rights reserved
Control the Shuffle Repartitioning
spark.conf.set("spark.sql.shuffle.partitions",spark.sparkContext.defaultParallelism)
©2022 Databricks Inc. — All rights reserved
Event-Time Processing
WATERMARKS
Handle late data and limit how
long to remember old data
EVENT-TIME DATA
Process based on event-time
(time fields embedded in data)
rather than receipt time
©2022 Databricks Inc. — All rights reserved
Handling Late Data and Watermarking
©2022 Databricks Inc. — All rights reserved
Watermarking
(streamingDF
.withWatermark("time", "2 hours")
.groupBy(col("device"),
window(col("time"), "1 hour"))
.count()
)
©2022 Databricks Inc. — All rights reserved
Module 6 Apache Spark Programming with Databricks
Delta Lake
©2022 Databricks Inc. — All rights reserved
Using Spark with Delta Lake
Delta
Lake
95
©2022 Databricks Inc. — All rights reserved
96
Delta Lake
Delta Lake Concepts
©2022 Databricks Inc. — All rights reserved
What is Delta Lake?
▪ Technology designed to be used with Apache Spark to build robust
data lakes
▪ Open source project at delta.io
▪ Databricks Delta Lake documentation
©2022 Databricks Inc. — All rights reserved
Delta Lake features
▪ ACID transactions on Spark
▪ Scalable metadata handling
▪ Streaming and batch unification
▪ Schema enforcement
▪ Time travel
▪ Upserts and deletes
▪ Fully configurable/optimizable
▪ Structured streaming support
©2022 Databricks Inc. — All rights reserved
Delta Lake components
Delta Lake
storage
layer
Delta tables Delta Engine
©2022 Databricks Inc. — All rights reserved
Delta Lake Storage Layer
▪ Highly performant and persistent
▪ Low-cost, easily scalable object storage
▪ Ensures consistency
▪ Allows for flexibility
©2022 Databricks Inc. — All rights reserved
Delta tables
▪ Contain data in Parquet files that are kept in object storage
▪ Keep transaction logs in object storage
▪ Can be registered in a metastore (optional)
©2022 Databricks Inc. — All rights reserved
Delta Engine
▪ File management optimizations
▪ Auto-optimized writes
▪ Performance optimization via Delta caching
Databricks edge feature; not available in OS Apache Spark
©2022 Databricks Inc. — All rights reserved
What is the Delta transaction log?
▪ Ordered record of the transactions performed on a Delta table
▪ Single source of truth for that table
▪ Mechanism that the Delta Engine uses to guarantee atomicity
©2022 Databricks Inc. — All rights reserved
How does the transaction log work?
▪ Delta Lake breaks operations down into one or more of these steps:
▪ Add file
▪ Remove file
▪ Update metadata
▪ Set transaction
▪ Change protocol
▪ Commit info
©2022 Databricks Inc. — All rights reserved
Delta transaction log at the file level
©2022 Databricks Inc. — All rights reserved
Adding commits to the transaction log
©2022 Databricks Inc. — All rights reserved
▪ Returns the metadata of an
existing table
▪ Ex. Column names, data
types, comments
▪ Returns a more complete set
of metadata for a Delta table
▪ Operation, user,
operation metrics
DESCRIBE HISTORY Command
DESCRIBE Command
©2022 Databricks Inc. — All rights reserved
Delta Lake Time Travel
▪ Query an older snapshot of a Delta table
▪ Re-creating analysis, reports or outputs
▪ Writing complex temporal queries
▪ Fixing mistakes in data
▪ Providing snapshot isolation
©2022 Databricks Inc. — All rights reserved
Time Travel SQL syntax
SELECT * FROM events TIMESTAMP AS OF timestamp_expression
SELECT * FROM events VERSION AS OF version
©2022 Databricks Inc. — All rights reserved
Time Travel SQL syntax
SELECT * FROM events TIMESTAMP AS OF timestamp_expression
▪ timestamp_expression can be:
▪ String that can be cast to a timestamp
▪ String that can be cast to a date
▪ Explicit timestamp type (the result of casting)
▪ Simple time or date expressions (see the Databricks
documentation)
©2022 Databricks Inc. — All rights reserved
Time Travel SQL syntax
SELECT * FROM events VERSION AS OF version
▪ Version can be obtained from the output of DESCRIBE HISTORY
events.
©2022 Databricks Inc. — All rights reserved
Thank you! Congrats!

More Related Content

Similar to apache-spark-programming-with-databricks.pdf

VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyLeonid Nekhymchuk
 
Threat Modelling - It's not just for developers
Threat Modelling - It's not just for developersThreat Modelling - It's not just for developers
Threat Modelling - It's not just for developersMITRE ATT&CK
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
 
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
Pi Day 2022 -  from IoT to MySQL HeatWave Database ServicePi Day 2022 -  from IoT to MySQL HeatWave Database Service
Pi Day 2022 - from IoT to MySQL HeatWave Database ServiceFrederic Descamps
 
Fun with Fabric in 15
Fun with Fabric in 15Fun with Fabric in 15
Fun with Fabric in 15Neo4j
 
Idera live 2021: Managing Digital Transformation on a Budget by Bert Scalzo
Idera live 2021:  Managing Digital Transformation on a Budget by Bert ScalzoIdera live 2021:  Managing Digital Transformation on a Budget by Bert Scalzo
Idera live 2021: Managing Digital Transformation on a Budget by Bert ScalzoIDERA Software
 
Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Ed Thewlis
 
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceOptimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceThousandEyes
 
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceEMEA Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceThousandEyes
 
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceOptimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceThousandEyes
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdfNeo4j
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfabhaybansal43
 
ROAD TO NODES - Intro to Neo4j + NeoDash.pdf
ROAD TO NODES - Intro to Neo4j + NeoDash.pdfROAD TO NODES - Intro to Neo4j + NeoDash.pdf
ROAD TO NODES - Intro to Neo4j + NeoDash.pdfNeo4j
 
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data Applications
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data ApplicationsDaniel Myers (Snowflake) - Developer Journey_ The Evolution of Data Applications
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data ApplicationsTechsylvania
 
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...IDERA Software
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyesThousandEyes
 

Similar to apache-spark-programming-with-databricks.pdf (20)

VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
 
Threat Modelling - It's not just for developers
Threat Modelling - It's not just for developersThreat Modelling - It's not just for developers
Threat Modelling - It's not just for developers
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
 
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
Pi Day 2022 -  from IoT to MySQL HeatWave Database ServicePi Day 2022 -  from IoT to MySQL HeatWave Database Service
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
 
Fun with Fabric in 15
Fun with Fabric in 15Fun with Fabric in 15
Fun with Fabric in 15
 
Idera live 2021: Managing Digital Transformation on a Budget by Bert Scalzo
Idera live 2021:  Managing Digital Transformation on a Budget by Bert ScalzoIdera live 2021:  Managing Digital Transformation on a Budget by Bert Scalzo
Idera live 2021: Managing Digital Transformation on a Budget by Bert Scalzo
 
Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)Relational Database to Apache Spark (and sometimes back again)
Relational Database to Apache Spark (and sometimes back again)
 
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceOptimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
 
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceEMEA Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
EMEA Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
 
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid WorkforceOptimizing and Troubleshooting Digital Experience for a Hybrid Workforce
Optimizing and Troubleshooting Digital Experience for a Hybrid Workforce
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdf
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdf
 
ROAD TO NODES - Intro to Neo4j + NeoDash.pdf
ROAD TO NODES - Intro to Neo4j + NeoDash.pdfROAD TO NODES - Intro to Neo4j + NeoDash.pdf
ROAD TO NODES - Intro to Neo4j + NeoDash.pdf
 
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data Applications
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data ApplicationsDaniel Myers (Snowflake) - Developer Journey_ The Evolution of Data Applications
Daniel Myers (Snowflake) - Developer Journey_ The Evolution of Data Applications
 
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...
Idera live 2021: Will Data Vault add Value to Your Data Warehouse? 3 Signs th...
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 

Recently uploaded

如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一hwhqz6r1y
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一w7jl3eyno
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 

Recently uploaded (20)

如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 

apache-spark-programming-with-databricks.pdf

  • 1. ©2022 Databricks Inc. — All rights reserved Apache Spark Programming With Databricks 1
  • 2. ©2022 Databricks Inc. — All rights reserved 2 Welcome
  • 3. ©2022 Databricks Inc. — All rights reserved 3 Course Agenda Here’s where we’re headed DataFrames Introduction Databricks Platform Spark SQL Reader & Writer DataFrame & Column Transformations Aggregation Datetimes Complex Types Additional Functions User-Defined Functions Performance Query Optimization Partitioning Spark Architecture Structured Streaming and Delta Streaming Query Delta Lake Aggregating Streams
  • 4. ©2022 Databricks Inc. — All rights reserved 4 Lesson Objectives By the end of this course, you should be able to: Identify core features of Spark and Databricks Describe how DataFrames are created and evaluated in Spark Apply the DataFrame API to process and analyze data Demonstrate how Spark is optimized and executed on a cluster Apply Delta Lake and Structured Streaming to process data 1 2 3 4 5
  • 5. ©2022 Databricks Inc. — All rights reserved 5 Introductions Module 1 Apache Spark Programming with Databricks
  • 6. ©2022 Databricks Inc. — All rights reserved Welcome! Let’s get to know you ▪ Name ▪ Role and team ▪ Programing experience ▪ Motivation for attending ▪ Personal interest
  • 7. ©2022 Databricks Inc. — All rights reserved Module 2 Apache Spark Programming with Databricks Spark Core
  • 8. ©2022 Databricks Inc. — All rights reserved Spark Core Databricks Ecosystem Spark Overview Spark SQL Reader & Writer DataFrame & Column 8
  • 9. ©2022 Databricks Inc. — All rights reserved 9 Databricks Ecosystem
  • 10. ©2022 Databricks Inc. — All rights reserved 10 10 Lakehouse One simple platform to unify all of your data, analytics, and AI workloads Customers Original creators of: 5000+ across the globe
  • 11. ©2021 Databricks Inc. — All rights reserved Data Warehouse Data Lake Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 12. ©2021 Databricks Inc. — All rights reserved Data Warehouse Data Lake An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine-grained access control lists
  • 13. ©2021 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Simple Platform Security & Administration Open Data Lake Data Management and Governance Data Engineering BI and SQL Analytics Data Science and ML Real-Time Data Applications Databricks Lakehouse Platform Open Collaborative ✓ ✓ ✓ Unstructured, semi-structured, structured, and streaming data
  • 14. ©2021 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Simple Platform Security & Administration Open Data Lake Data Management and Governance Data Engineering BI and SQL Analytics Data Science and ML Real-Time Data Applications Databricks Lakehouse Platform ✓ Unstructured, semi-structured, structured, and streaming data Unify your data, analytics, and AI on one common platform for all data use cases
  • 15. ©2021 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Open ✓ Unify your data ecosystem with open source standards and formats. Built on the innovation of some of the most successful open source data projects in the world 30 Million+ Monthly downloads
  • 16. ©2021 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Open ✓ Unify your data ecosystem with open source standards and formats. Partners across the data landscape 450+ Azure Data Factory Visual ETL & Data Ingestion Data Providers Amazon Redshift Azure Synapse Lakehouse Platform Business Intelligence Google BigQuery Amazon SageMaker Azure Machine Learning Machine Learning Google AI Platform AWS Glue Centralized Governance Top Consulting & SI Partners
  • 17. ©2021 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Collaborative ✓ Unify your data teams to collaborate across the entire data and AI workflow Data Analysts Data Engineers Data Scientists Models Dashboards Notebooks Datasets
  • 18. ©2022 Databricks Inc. — All rights reserved TPC-DS Databricks SQL set official data warehousing performance record - outperformed the previous record by 2.2x.
  • 19. ©2022 Databricks Inc. — All rights reserved 19 Spark Overview
  • 20. ©2022 Databricks Inc. — All rights reserved 20 • De-facto standard unified analytics engine for big data processing • Largest open-source project in data processing • Technology created by the founders of Databricks
  • 21. ©2022 Databricks Inc. — All rights reserved Spark Benefits Fast Easy to Use Unified
  • 22. ©2022 Databricks Inc. — All rights reserved Spark API Spark SQL + DataFrames Streaming MLlib SQL Python Scala Java R Spark Core API
  • 23. ©2022 Databricks Inc. — All rights reserved Spark application Spark Execution Job Job Job Stage 1 Stage 2 Task 1 Task 2
  • 24. ©2022 Databricks Inc. — All rights reserved Driver Worker Worker Worker Worker Executor Executor Executor Executor Core Core Core Core Core Core Core Core Spark Cluster Task Task Task Task Task
  • 25. ©2022 Databricks Inc. — All rights reserved Bonus: Magic commands in Notebook cells Magic commands allow you to override default languages as well as a few auxiliary commands that run utilities/commands. For example: 1. %python, %r, %scala, %sql Switch languages in a command cell 2. %sh Run shell code (runs only on Spark Driver, and not the Workers) 3. %fs Shortcut for dbutils filesystem commands 4. %md Markdown for styling the display 5. %run Execute a remote Notebook from a Notebook 6. %pip Install new Python libraries
  • 26. ©2022 Databricks Inc. — All rights reserved CSV Comma Separated Values item_id, name, price, qty M_PREM_Q, Premium Queen Mattress, 1795, 35 M_STAN_F, Standard Full Mattress, 945, 24 M_PREM_F, Premium Full Mattress, 1695, 45 M_PREM_T, Premium Twin Mattress, 1095, 18
  • 27. ©2022 Databricks Inc. — All rights reserved Parquet Name Score ID Kit 4.2 1 Alex 4.5 2 Kit Alex Terry 4.2 4.5 4.1 Terry 4.1 3 1 2 3 Row-Oriented data on disk Column-Oriented data on disk A columnar storage format
  • 28. ©2022 Databricks Inc. — All rights reserved Technology designed to be used with Apache Spark to build robust data lakes Delta Lake
  • 29. ©2022 Databricks Inc. — All rights reserved Open-source Storage Layer
  • 30. ©2022 Databricks Inc. — All rights reserved Delta Lake’s Key Features ▪ ACID transactions ▪ Time travel (data versioning) ▪ Schema enforcement and evolution ▪ Audit history ▪ Parquet format ▪ Compatible with Apache Spark API
  • 31. ©2022 Databricks Inc. — All rights reserved Module 3 Apache Spark Programming with Databricks Functions
  • 32. ©2022 Databricks Inc. — All rights reserved Functions Aggregation Datetimes Complex Types Additional Functions User-Defined Functions 32
  • 33. ©2022 Databricks Inc. — All rights reserved Module 4 Apache Spark Programming with Databricks Performance
  • 34. ©2022 Databricks Inc. — All rights reserved Spark Architecture Query Optimization Partitioning Performance 34
  • 35. ©2022 Databricks Inc. — All rights reserved 35 Spark Architecture Spark Cluster Spark Execution
  • 36. ©2022 Databricks Inc. — All rights reserved Scenario 1: Filter out brown pieces from candy bags
  • 37. ©2022 Databricks Inc. — All rights reserved Driver Executo r Core Cluster
  • 38. ©2022 Databricks Inc. — All rights reserved Partitio n Data
  • 39. ©2022 Databricks Inc. — All rights reserved We need filter out brown pieces from these candy bags
  • 40. ©2022 Databricks Inc. — All rights reserved A B C D E F G H I J K L 1 2 3 4 5 8 12 9 10 11 6 7
  • 41. ©2022 Databricks Inc. — All rights reserved 1 2 3 4 5 8 12 9 10 11 6 7 Student A get bag # 1 Student B get bag # 2 Student C get bag # 3 ... A B C D E F G H I J K L
  • 42. ©2022 Databricks Inc. — All rights reserved A B C D E F G I L K J H
  • 43. ©2022 Databricks Inc. — All rights reserved A B C D E F G H I J K L
  • 44. ©2022 Databricks Inc. — All rights reserved Eliminate the brown candy pieces and pile the rest in the corner A B C D E F G H I J K L
  • 45. ©2022 Databricks Inc. — All rights reserved Eliminate the brown candy pieces and pile the rest in the corner A B C D E F G H I J K L G H I J K L
  • 46. ©2022 Databricks Inc. — All rights reserved A B C D E F G I K L G H I J K L
  • 47. ©2022 Databricks Inc. — All rights reserved B C D F G I K L G I K L A E H J
  • 48. ©2022 Databricks Inc. — All rights reserved B C D F G I K L G I K L A E H J Students A, E, H, J, get bags 13, 14, 15, 16
  • 49. ©2022 Databricks Inc. — All rights reserved B C D F G I K L A E H J
  • 50. ©2022 Databricks Inc. — All rights reserved C B D F G I K L A E H J
  • 51. ©2022 Databricks Inc. — All rights reserved B D F G I K L A E H J C
  • 52. ©2022 Databricks Inc. — All rights reserved B D F G I K L A E H J C
  • 53. ©2022 Databricks Inc. — All rights reserved A B C D E F G H I J K L All done!
  • 54. ©2022 Databricks Inc. — All rights reserved Scenario 2: Count total pieces in candy bags
  • 55. ©2022 Databricks Inc. — All rights reserved Stage 1: Local Count
  • 56. ©2022 Databricks Inc. — All rights reserved We need to count the total pieces in these candy bags A B C D E F G H I J K L Stage 1: Local Count
  • 57. ©2022 Databricks Inc. — All rights reserved Students B, E, I, L, get these four bags A B C D E F G H I J K L Stage 1: Local Count
  • 58. ©2022 Databricks Inc. — All rights reserved A D F G H J K B I L Stage 1: Local Count C E
  • 59. ©2022 Databricks Inc. — All rights reserved A D F G H J K B I L Stage 1: Local Count C E
  • 60. ©2022 Databricks Inc. — All rights reserved Students B, E, I, L, commit your findings A D F G H J K 5 4 5 B I L 6 E C 5 6 4 5 Stage 1: Local Count
  • 61. ©2022 Databricks Inc. — All rights reserved 5 6 4 5 Stage 1 is complete! A D F G H J K B I L E C Stage 1: Local Count
  • 62. ©2022 Databricks Inc. — All rights reserved Stage 2: Global Count
  • 63. ©2022 Databricks Inc. — All rights reserved 5 6 4 5 Student G, fetch counts from students B, E, I, L A D F G H J K Stage 2: Global Count B I L E C
  • 64. ©2022 Databricks Inc. — All rights reserved 5 5 6 4 A D F H J K B I L E C G Stage 2: Global Count
  • 65. ©2022 Databricks Inc. — All rights reserved 20 A D F H J K B I L E C G Stage 2: Global Count
  • 66. ©2022 Databricks Inc. — All rights reserved A D F H J K B I L E C 20 G Stage 2: Global Count
  • 67. ©2022 Databricks Inc. — All rights reserved Stage 2: Global Count Stage 1: Local Count A D F H J K B I L E C 20 G 5 6 4 5 A D F G H J K B I L E C
  • 68. ©2022 Databricks Inc. — All rights reserved 68 Query Optimization Catalyst Optimizer Adaptive Query Execution
  • 69. ©2022 Databricks Inc. — All rights reserved Query Optimization Unresolved Logical Plan Logical Plan Optimized Logical Plan COST BASED OPTIMIZATION WHOLE-STAGE CODE GENERATION Metadata Catalog Cost Model Selected Physical Plan Physical Plans Physical Plans Physical Plans Query Catalyst Catalog ANALYSIS LOGICAL OPTIMIZATION PHYSICAL PLANNING RDDs
  • 70. ©2022 Databricks Inc. — All rights reserved Query Optimization with AQE Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans COST BASED OPTIMIZATION WHOLE-STAGE CODE GENERATION Metadata Catalog Cost Model Selected Physical Plan Physical Plans Physical Plans Query Catalyst Catalog ANALYSIS LOGICAL OPTIMIZATION PHYSICAL PLANNING RDDs Runtime Statistics ADAPTIVE QUERY EXECUTION New in Spark 3.0, enabled by default as of Spark 3.2
  • 71. ©2022 Databricks Inc. — All rights reserved Module 5 Apache Spark Programming with Databricks Structured Streaming
  • 72. ©2022 Databricks Inc. — All rights reserved Streaming Query Stream Aggregations Structured Streaming 72
  • 73. ©2022 Databricks Inc. — All rights reserved 73 Streaming Query Advantages Use Cases Sources
  • 74. ©2022 Databricks Inc. — All rights reserved Stream Processing Batch Processing
  • 75. ©2022 Databricks Inc. — All rights reserved Advantages of Stream Processing Lower latency Automatic bookkeeping on new data Efficient Updates
  • 76. ©2022 Databricks Inc. — All rights reserved Stream Processing Use Cases Incremental ETL Real-time reporting Online ML Update data to serve in real-time Real-time decision making Notifications
  • 77. ©2022 Databricks Inc. — All rights reserved Micro-Batch Processing
  • 78. ©2022 Databricks Inc. — All rights reserved Micro-Batch Processing
  • 79. ©2022 Databricks Inc. — All rights reserved Structured Streaming Micro-batches = new rows appended to unbounded table
  • 80. ©2022 Databricks Inc. — All rights reserved FOR TESTING Input Sources Sockets Generator Files Kafka Event Hubs
  • 81. ©2022 Databricks Inc. — All rights reserved Sinks FOR DEBUGGING Console Memory Foreach Files Kafka Event Hubs
  • 82. ©2022 Databricks Inc. — All rights reserved Output Modes APPEND Add new records only UPDATE Update changed records in place COMPLETE Rewrite full output
  • 83. ©2022 Databricks Inc. — All rights reserved Trigger Types Default Process each micro-batch as soon as the previous one has been processed Fixed interval Micro-batch processing kicked off at the user-specified interval One-time Process all of the available data as a single micro-batch and then automatically stop the query Continuous Processing Long-running tasks that continuously read, process, and write data as soon events are available *Experimental See Structured Streaming Programming Guide
  • 84. ©2022 Databricks Inc. — All rights reserved Guaranteed in Structured Streaming by Checkpointing and write-ahead logs Idempotent sinks Replayable data sources End-to-end fault tolerance
  • 85. ©2022 Databricks Inc. — All rights reserved 85 Stream Aggregations Aggregations Windows Watermarking
  • 86. ©2022 Databricks Inc. — All rights reserved Errors in IoT data by device type Real-time Aggregations Anomalous behavior in server log files by country Behavior analysis on messages by hashtags
  • 87. ©2022 Databricks Inc. — All rights reserved No window overlap Any given event gets aggregated into only one window group e.g. 1:00–2:00 am, 2:00–3:00 am, 3:00-4:00 am, ... Time-Based Windows Windows overlap Any given event gets aggregated into multiple window groups e.g. 1:00-2:00 am, 1:30–2:30 am, 2:00–3:00 am, ... Tumbling Windows Sliding Windows
  • 88. ©2022 Databricks Inc. — All rights reserved Sliding Windows Example
  • 89. ©2022 Databricks Inc. — All rights reserved Windowing (streamingDF .groupBy(col("device"), window(col("time"), "1 hour")) .count()) Why are we seeing 200 tasks for this stage?
  • 90. ©2022 Databricks Inc. — All rights reserved Control the Shuffle Repartitioning spark.conf.set("spark.sql.shuffle.partitions",spark.sparkContext.defaultParallelism)
  • 91. ©2022 Databricks Inc. — All rights reserved Event-Time Processing WATERMARKS Handle late data and limit how long to remember old data EVENT-TIME DATA Process based on event-time (time fields embedded in data) rather than receipt time
  • 92. ©2022 Databricks Inc. — All rights reserved Handling Late Data and Watermarking
  • 93. ©2022 Databricks Inc. — All rights reserved Watermarking (streamingDF .withWatermark("time", "2 hours") .groupBy(col("device"), window(col("time"), "1 hour")) .count() )
  • 94. ©2022 Databricks Inc. — All rights reserved Module 6 Apache Spark Programming with Databricks Delta Lake
  • 95. ©2022 Databricks Inc. — All rights reserved Using Spark with Delta Lake Delta Lake 95
  • 96. ©2022 Databricks Inc. — All rights reserved 96 Delta Lake Delta Lake Concepts
  • 97. ©2022 Databricks Inc. — All rights reserved What is Delta Lake? ▪ Technology designed to be used with Apache Spark to build robust data lakes ▪ Open source project at delta.io ▪ Databricks Delta Lake documentation
  • 98. ©2022 Databricks Inc. — All rights reserved Delta Lake features ▪ ACID transactions on Spark ▪ Scalable metadata handling ▪ Streaming and batch unification ▪ Schema enforcement ▪ Time travel ▪ Upserts and deletes ▪ Fully configurable/optimizable ▪ Structured streaming support
  • 99. ©2022 Databricks Inc. — All rights reserved Delta Lake components Delta Lake storage layer Delta tables Delta Engine
  • 100. ©2022 Databricks Inc. — All rights reserved Delta Lake Storage Layer ▪ Highly performant and persistent ▪ Low-cost, easily scalable object storage ▪ Ensures consistency ▪ Allows for flexibility
  • 101. ©2022 Databricks Inc. — All rights reserved Delta tables ▪ Contain data in Parquet files that are kept in object storage ▪ Keep transaction logs in object storage ▪ Can be registered in a metastore (optional)
  • 102. ©2022 Databricks Inc. — All rights reserved Delta Engine ▪ File management optimizations ▪ Auto-optimized writes ▪ Performance optimization via Delta caching Databricks edge feature; not available in OS Apache Spark
  • 103. ©2022 Databricks Inc. — All rights reserved What is the Delta transaction log? ▪ Ordered record of the transactions performed on a Delta table ▪ Single source of truth for that table ▪ Mechanism that the Delta Engine uses to guarantee atomicity
  • 104. ©2022 Databricks Inc. — All rights reserved How does the transaction log work? ▪ Delta Lake breaks operations down into one or more of these steps: ▪ Add file ▪ Remove file ▪ Update metadata ▪ Set transaction ▪ Change protocol ▪ Commit info
  • 105. ©2022 Databricks Inc. — All rights reserved Delta transaction log at the file level
  • 106. ©2022 Databricks Inc. — All rights reserved Adding commits to the transaction log
  • 107. ©2022 Databricks Inc. — All rights reserved ▪ Returns the metadata of an existing table ▪ Ex. Column names, data types, comments ▪ Returns a more complete set of metadata for a Delta table ▪ Operation, user, operation metrics DESCRIBE HISTORY Command DESCRIBE Command
  • 108. ©2022 Databricks Inc. — All rights reserved Delta Lake Time Travel ▪ Query an older snapshot of a Delta table ▪ Re-creating analysis, reports or outputs ▪ Writing complex temporal queries ▪ Fixing mistakes in data ▪ Providing snapshot isolation
  • 109. ©2022 Databricks Inc. — All rights reserved Time Travel SQL syntax SELECT * FROM events TIMESTAMP AS OF timestamp_expression SELECT * FROM events VERSION AS OF version
  • 110. ©2022 Databricks Inc. — All rights reserved Time Travel SQL syntax SELECT * FROM events TIMESTAMP AS OF timestamp_expression ▪ timestamp_expression can be: ▪ String that can be cast to a timestamp ▪ String that can be cast to a date ▪ Explicit timestamp type (the result of casting) ▪ Simple time or date expressions (see the Databricks documentation)
  • 111. ©2022 Databricks Inc. — All rights reserved Time Travel SQL syntax SELECT * FROM events VERSION AS OF version ▪ Version can be obtained from the output of DESCRIBE HISTORY events.
  • 112. ©2022 Databricks Inc. — All rights reserved Thank you! Congrats!