Building Data Lakes
with Azure Databricks
Dustin Vannoy
dustinvannoy.com
Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
Agenda
● Data Lake Defined
● Querying the Data Lake
● Reference Architecture
○ Spark with Databricks
○ Data Lake Storage
○ Event Hubs for Kafka
Data Lake Defined
Big Data Capable
Store first, evaluate and
model later
Data Zones
1. Raw
2. Enriched
3. Certified / Curated
Ready for Analysts
Query layer, other
analytic tools access
Store Everything
Why Data Lakes?
Reason #1
CSV, JSON, Parquet, Avro, Text
No schema on write
Cheaper storage
Massive Scale (Big Data)
Why Data Lakes?
Reason #2
Scale up easily
Span hot and cold storage
Pay only for what you need
Reason #3
Storage + Compute
Separate
Why Data Lakes?
Multiple analytics tools / same data
Cost savings
Data Warehouse Defined
Structured Data
Processed and modeled
for analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently
Querying the Data Lake
9
Unified Analytics Platform
Managed Apache Spark
Performance optimizations
Auto-scale and auto-terminate
Azure Databricks
Notebooks
IDE (PyCharm, VS Code)
REST API
Azure Data Factory
Interfaces for Databricks
Demo
Data Lake
Querying
+
Databricks
Workspace
Reference Architecture
13
Databricks Data Lake - Simple
Azure DatabricksSources Azure Data
Lake Storage
Databricks Data Lake – Ingest Options
Data Factory
Event Hubs
Azure DatabricksSources
Azure Data
Lake Storage
Apache Spark
16
Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is faster
and easier than Map
Reduce.
Benefit of horizontal scaling
Traditional Distributed (Parallel)
What is Spark?
● Fast, general purpose data processing
● Simple code, parallel compute
● Programming API
○ Scala, Python, Java
○ Spark Core, SQL, Streaming, ML
● Execution Engine
○ Spark Cluster / Yarn / Local
Simple code, parallel compute
Worker
Controller
WorkerWorker
Spark Example - Read
df = spark.readStream
.format("kafka")
.options(**config)
.load()
df = spark.read
.format("kafka")
.options(**config)
.load()
Batch Streaming
Data Lake Storage
Data Lake Storage, Gen 2
• Built on Azure Blob Storage
• Hadoop compatible access
• Optimized for cloud analytics
• Low cost: $$
Event Hubs for Kafka
Why Event Hubs?
Reliable place to
stream events;
decoupled from
destination
Event Hubs is a scalable
message broker,
keeping up with
producer and persisting
data for all consumers
Hub for streaming data
Data Lake
Trip Data
User
Dashboard
Real-time
report
Vendor Data
Azure Event Hubs
Event Hubs key concepts
● Namespace = container to hold multiple Event Hubs
● Event Hub = Topic
● Partitions and Consumer Groups
○ Same concepts as Kafka
○ Minor differences in implementation
● Throughput Units define level of scalability
Event Hubs Tier =>
Standard
* not Basic
Demo
Loading the
Data Lake
File System Best Practices
Azure Data Lake Storage, Gen 2
Partition folders
Parquet or Delta format (not CSV)
Use splittable compression
Small files are a problem (< 128 MB)
Storage Best Practices
Spark is powerful, but...
● Not ACID compliant – too easy to get corrupted data
● Schema mismatches – no validation on write
● Small files written, not efficient for reading
● Reads too much data (no indexes, only partitions)
Delta Lake addresses
● ACID compliance
● Schema enforcement
● Compacting files
● Performance optimizations
Delta Log
“The transaction log is the mechanism through which Delta Lake is able to
offer the guarantee of atomicity.”
Reference: Databricks Blog: Unpacking the Transaction Log
Final thoughts
References
Notebooks from demos: https://github.com/datakickstart/databricks-notebooks
Pluralsight Databricks + PySpark Training:
https://app.pluralsight.com/channels/details/0418df96-d33b-43bc-8a77-1d437d3c53e2?s=1
LinkedIn Learning: https://www.linkedin.com/learning/apache-spark-essential-training
Delta Lake:
https://www.youtube.com/watch?v=F91G4RoA8is
My YouTube Data Lake and Spark intros:
https://youtu.be/YOu2OZ2Y2mI
https://youtu.be/Ud6luYCkkMk
More links at bottom of this blog post:
https://dustinvannoy.com/2020/04/26/journey-of-a-data-engineer-part-2/
Website: dustinvannoy.com
Twitter: @dustinvannoy
YouTube: Dustin Vannoy on YouTube
More Content
Thank you!

Data Lakes with Azure Databricks

  • 1.
    Building Data Lakes withAzure Databricks Dustin Vannoy dustinvannoy.com
  • 2.
    Dustin Vannoy Data EngineeringConsultant Co-founder Data Engineering San Diego /in/dustinvannoy @dustinvannoy dustin@dustinvannoy.com Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  • 3.
    Agenda ● Data LakeDefined ● Querying the Data Lake ● Reference Architecture ○ Spark with Databricks ○ Data Lake Storage ○ Event Hubs for Kafka
  • 4.
    Data Lake Defined BigData Capable Store first, evaluate and model later Data Zones 1. Raw 2. Enriched 3. Certified / Curated Ready for Analysts Query layer, other analytic tools access
  • 5.
    Store Everything Why DataLakes? Reason #1 CSV, JSON, Parquet, Avro, Text No schema on write Cheaper storage
  • 6.
    Massive Scale (BigData) Why Data Lakes? Reason #2 Scale up easily Span hot and cold storage Pay only for what you need
  • 7.
    Reason #3 Storage +Compute Separate Why Data Lakes? Multiple analytics tools / same data Cost savings
  • 8.
    Data Warehouse Defined StructuredData Processed and modeled for analytics use Interactive query Analysts can get answers to questions quickly BI tool support Reporting tools can query efficiently
  • 9.
  • 10.
    Unified Analytics Platform ManagedApache Spark Performance optimizations Auto-scale and auto-terminate Azure Databricks
  • 11.
    Notebooks IDE (PyCharm, VSCode) REST API Azure Data Factory Interfaces for Databricks
  • 12.
  • 13.
  • 14.
    Databricks Data Lake- Simple Azure DatabricksSources Azure Data Lake Storage
  • 15.
    Databricks Data Lake– Ingest Options Data Factory Event Hubs Azure DatabricksSources Azure Data Lake Storage
  • 16.
  • 17.
    Why Spark? Big dataand the cloud changed our mindset. We want tools that scale easily as data size grows. Spark is a leader in data processing that scales across many machines. It can run on Hadoop but is faster and easier than Map Reduce.
  • 18.
    Benefit of horizontalscaling Traditional Distributed (Parallel)
  • 19.
    What is Spark? ●Fast, general purpose data processing ● Simple code, parallel compute ● Programming API ○ Scala, Python, Java ○ Spark Core, SQL, Streaming, ML ● Execution Engine ○ Spark Cluster / Yarn / Local
  • 20.
    Simple code, parallelcompute Worker Controller WorkerWorker
  • 21.
    Spark Example -Read df = spark.readStream .format("kafka") .options(**config) .load() df = spark.read .format("kafka") .options(**config) .load() Batch Streaming
  • 22.
  • 23.
    Data Lake Storage,Gen 2 • Built on Azure Blob Storage • Hadoop compatible access • Optimized for cloud analytics • Low cost: $$
  • 24.
  • 25.
    Why Event Hubs? Reliableplace to stream events; decoupled from destination Event Hubs is a scalable message broker, keeping up with producer and persisting data for all consumers
  • 26.
    Hub for streamingdata Data Lake Trip Data User Dashboard Real-time report Vendor Data Azure Event Hubs
  • 27.
    Event Hubs keyconcepts ● Namespace = container to hold multiple Event Hubs ● Event Hub = Topic ● Partitions and Consumer Groups ○ Same concepts as Kafka ○ Minor differences in implementation ● Throughput Units define level of scalability
  • 28.
    Event Hubs Tier=> Standard * not Basic
  • 29.
  • 30.
  • 31.
    Azure Data LakeStorage, Gen 2 Partition folders Parquet or Delta format (not CSV) Use splittable compression Small files are a problem (< 128 MB) Storage Best Practices
  • 32.
    Spark is powerful,but... ● Not ACID compliant – too easy to get corrupted data ● Schema mismatches – no validation on write ● Small files written, not efficient for reading ● Reads too much data (no indexes, only partitions)
  • 33.
    Delta Lake addresses ●ACID compliance ● Schema enforcement ● Compacting files ● Performance optimizations
  • 34.
    Delta Log “The transactionlog is the mechanism through which Delta Lake is able to offer the guarantee of atomicity.” Reference: Databricks Blog: Unpacking the Transaction Log
  • 35.
  • 36.
    References Notebooks from demos:https://github.com/datakickstart/databricks-notebooks Pluralsight Databricks + PySpark Training: https://app.pluralsight.com/channels/details/0418df96-d33b-43bc-8a77-1d437d3c53e2?s=1 LinkedIn Learning: https://www.linkedin.com/learning/apache-spark-essential-training Delta Lake: https://www.youtube.com/watch?v=F91G4RoA8is My YouTube Data Lake and Spark intros: https://youtu.be/YOu2OZ2Y2mI https://youtu.be/Ud6luYCkkMk More links at bottom of this blog post: https://dustinvannoy.com/2020/04/26/journey-of-a-data-engineer-part-2/
  • 37.
    Website: dustinvannoy.com Twitter: @dustinvannoy YouTube:Dustin Vannoy on YouTube More Content Thank you!