Data Lakes with Azure Databricks

Building Data Lakes
with Azure Databricks
Dustin Vannoy
dustinvannoy.com

Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming

Agenda
● Data Lake Defined
● Querying the Data Lake
● Reference Architecture
○ Spark with Databricks
○ Data Lake Storage
○ Event Hubs for Kafka

Data Lake Defined
Big Data Capable
Store first, evaluate and
model later
Data Zones
1. Raw
2. Enriched
3. Certified / Curated
Ready for Analysts
Query layer, other
analytic tools access

Store Everything
Why Data Lakes?
Reason #1
CSV, JSON, Parquet, Avro, Text
No schema on write
Cheaper storage

Massive Scale (Big Data)
Why Data Lakes?
Reason #2
Scale up easily
Span hot and cold storage
Pay only for what you need

Reason #3
Storage + Compute
Separate
Why Data Lakes?
Multiple analytics tools / same data
Cost savings

Data Warehouse Defined
Structured Data
Processed and modeled
for analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently

Unified Analytics Platform
Managed Apache Spark
Performance optimizations
Auto-scale and auto-terminate
Azure Databricks

Notebooks
IDE (PyCharm, VS Code)
REST API
Azure Data Factory
Interfaces for Databricks

Demo
Data Lake
Querying
+
Databricks
Workspace

Databricks Data Lake - Simple
Azure DatabricksSources Azure Data
Lake Storage

Databricks Data Lake – Ingest Options
Data Factory
Event Hubs
Azure DatabricksSources
Azure Data
Lake Storage

Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is faster
and easier than Map
Reduce.

Benefit of horizontal scaling
Traditional Distributed (Parallel)

What is Spark?
● Fast, general purpose data processing
● Simple code, parallel compute
● Programming API
○ Scala, Python, Java
○ Spark Core, SQL, Streaming, ML
● Execution Engine
○ Spark Cluster / Yarn / Local

Simple code, parallel compute
Worker
Controller
WorkerWorker

Spark Example - Read
df = spark.readStream
.format("kafka")
.options(**config)
.load()
df = spark.read
.format("kafka")
.options(**config)
.load()
Batch Streaming

Data Lake Storage, Gen 2
• Built on Azure Blob Storage
• Hadoop compatible access
• Optimized for cloud analytics
• Low cost: $$

Why Event Hubs?
Reliable place to
stream events;
decoupled from
destination
Event Hubs is a scalable
message broker,
keeping up with
producer and persisting
data for all consumers

Hub for streaming data
Data Lake
Trip Data
User
Dashboard
Real-time
report
Vendor Data
Azure Event Hubs

Event Hubs key concepts
● Namespace = container to hold multiple Event Hubs
● Event Hub = Topic
● Partitions and Consumer Groups
○ Same concepts as Kafka
○ Minor differences in implementation
● Throughput Units define level of scalability

Event Hubs Tier =>
Standard
* not Basic

Azure Data Lake Storage, Gen 2
Partition folders
Parquet or Delta format (not CSV)
Use splittable compression
Small files are a problem (< 128 MB)
Storage Best Practices

Spark is powerful, but...
● Not ACID compliant – too easy to get corrupted data
● Schema mismatches – no validation on write
● Small files written, not efficient for reading
● Reads too much data (no indexes, only partitions)

Delta Lake addresses
● ACID compliance
● Schema enforcement
● Compacting files
● Performance optimizations

Delta Log
“The transaction log is the mechanism through which Delta Lake is able to
offer the guarantee of atomicity.”
Reference: Databricks Blog: Unpacking the Transaction Log

References
Notebooks from demos: https://github.com/datakickstart/databricks-notebooks
Pluralsight Databricks + PySpark Training:
https://app.pluralsight.com/channels/details/0418df96-d33b-43bc-8a77-1d437d3c53e2?s=1
LinkedIn Learning: https://www.linkedin.com/learning/apache-spark-essential-training
Delta Lake:
https://www.youtube.com/watch?v=F91G4RoA8is
My YouTube Data Lake and Spark intros:
https://youtu.be/YOu2OZ2Y2mI
https://youtu.be/Ud6luYCkkMk
More links at bottom of this blog post:
https://dustinvannoy.com/2020/04/26/journey-of-a-data-engineer-part-2/

Website: dustinvannoy.com
Twitter: @dustinvannoy
YouTube: Dustin Vannoy on YouTube
More Content
Thank you!

Data Lakes with Azure Databricks

More Related Content

What's hot

Similar to Data Lakes with Azure Databricks

More from Data Con LA

Recently uploaded

Data Lakes with Azure Databricks