Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Accelerating Data
Ingestion with
Databricks Autoloader
Simon Whiteley
Director of Engineering, Advancing Analytics
Agenda
▪ Why Incremental is Hard
▪ Autoloader Components
▪ Implementation
▪ Evolution
▪ Lessons
Why Incremental is Hard
Incremental Ingestion
BRONZE SILVER
LANDING
?
Incremental Ingestion
▪ Only Read New Files
▪ Don’t Miss Files
▪ Trigger Immediately
▪ Repeatable Pattern
▪ Fast over larg...
Existing Patterns – 1) ETL Metadata
etl batch read
{“lastRead”:”2021/05/26”}
Contents:
• /2021/05/24/file 1
• /2021/05/25/...
Existing Patterns – 2) Spark File Streaming
file stream read
Contents:
• File 1
• File 2
• File 3
• File 4
Checkpoint:
• F...
Existing Patterns – 3) DIY
triggered batch read
Blob File
Trigger
Logic
App
Azure
Function
Databricks
Job API
Incremental Ingestion Approaches
Approach Good At Bad At
Metadata ETL Repeatable
Not immediate,
requires polling
File Stre...
Databricks Autoloader
Prakash Chockalingam
Databricks Engineering Blog
Auto Loader is an optimized cloud file
source for Apache Spark that loads
...
What is Autoloader?
Essentially, Autoloader combines our three approaches of:
• Storing Metadata about what has been read
...
Cloudfiles Reader
Blob Storage
Blob Storage Queue
{“fileAdded”:”/landing/file 4.json”
• File 1.json
• File 2.json
• File 3....
CloudFiles DataReader
df = ( spark
.readStream
.format(“cloudfiles”)
.option(“cloudfiles.format”,”json”)
.option(“cloudfil...
Cloud Notification Services - Azure
Blob Storage
Event Grid Topic
Event Grid Subscription Blob Storage Queue
Event Grid Sub...
Cloud Notification Services - Azure
Blob Storage
New File Arrives,
Triggers Event Topic
Subscription checks
message filters...
NotificationServices Config
cloudFiles
.useNotifications – Directory Listing VS Notification Queue
.queueName – Use an Exist...
Implementing Autoloader
▪ Setup Steps
▪ Reading New Files
▪ A Basic ETL Setup
Delta Implementation
Practical Implementations
BRONZE SILVER
LANDING
Autoloader
Low Frequency Streams
Autoloader
One
File Per
Day
24/7
Cluster
Low Frequency Streams
Autoloader
One
File Per
Day
1/7
Cluster df
.writeStream
.trigger(once=True)
.save(path)
Autoloader c...
Delta Merge
Autoloader
Merge?
Delta Merge
Autoloader
df
.writeStream
.foreachBatch(runThis)
.save(path)
def runThis(df, batchId):
(df
.write
.save(path)...
Delta Implementation
▪ Batch ETL Pattern
▪ Merge Statements
▪ Logging State
Evolving Schemas
New Features since Databricks Runtime 8.2
What is Schema Evolution?
{“ID”:1,“ProductName”:“Belt”}
{“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”}
{“ID”:3,“ProductName”...
How do we handle Evolution?
1. Fail the Stream
2. Manually Intervene
3. Automatically Evolve
In order to manage schema evo...
Schema Inference
In Databricks 8.2 Onwards – simply don’t provide a
Schema to enable Schema Inference. This infers the
sch...
Schema Metastore
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "ID",
"type": “string",...
Schema Metastore – DataType Inference
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "I...
Schema Metastore – Schema Hints
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "ID",
"t...
Schema Evolution
cloudFiles.schemaEvolutionMode
• addNewColumns – Fail the job, update the schema
metastore
• failOnNewCol...
Evolution Reminder
{“ID”:1,“ProductName”:“Belt”}
{“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”}
{“ID”:3,“ProductName”:“Shirt...
Schema Evolution - Rescue
1
2
3
ID Product Name _rescued_data
1 Belt
ID Product Name _rescued_data
2 T-Shirt {“Size”:”XL”}...
Schema Evolution – Add New Columns
_schemas
{“ID”:2,
“ProductName”:“T-Shirt”,
“Size”:”XL”}
0
On Arrival
2
1
{
"type": "str...
Schema Evolution
▪ Inference & The Schema
Metastore
▪ Schema Hints
▪ Schema Evolution
Lessons from an Autoloader Life
Autoloader Lessons
▪ EventGrid Quotas &
Settings
▪ Streaming Best
Practices
▪ Batching Best Practices
EventGrid Quota Lessons
• You can have 500 files from a single storage account
using the system topic
• Deleting checkpoint...
Streaming Optimisation
• MaxBytesPerTrigger / MaxFilesPerTrigger
Manages the size of the streaming microbatch
• FetchParal...
Batch Lessons – Look for Lost Messages
Default 7 days!
Databricks Autoloader
▪ Reduces complexity of ingesting files
▪ Has some quirks in implementing ETL processes
▪ Growing num...
Simon Whiteley
Director of
Engineering
hello@advancinganalytics.co.uk
@MrSiWhiteley
www.youtube.com/c/AdvancingAnalytics
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Accelerating Data Ingestion with Databricks Autoloader
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Accelerating Data Ingestion with Databricks Autoloader

Download to read offline

Tracking which incoming files have been processed has always required thought and design when implementing an ETL framework. The Autoloader feature of Databricks looks to simplify this, taking away the pain of file watching and queue management. However, there can also be a lot of nuance and complexity in setting up Autoloader and managing the process of ingesting data using it. After implementing an automated data loading process in a major US CPMG, Simon has some lessons to share from the experience.

This session will run through the initial setup and configuration of Autoloader in a Microsoft Azure environment, looking at the components used and what is created behind the scenes. We’ll then look at some of the limitations of the feature, before walking through the process of overcoming these limitations. We will build out a practical example that tackles evolving schemas, applying transformations to your stream, extracting telemetry from the process and finally, how to merge the incoming data into a Delta table.

After this session you will be better equipped to use Autoloader in a data ingestion platform, simplifying your production workloads and accelerating the time to realise value in your data!

Accelerating Data Ingestion with Databricks Autoloader

  1. 1. Accelerating Data Ingestion with Databricks Autoloader Simon Whiteley Director of Engineering, Advancing Analytics
  2. 2. Agenda ▪ Why Incremental is Hard ▪ Autoloader Components ▪ Implementation ▪ Evolution ▪ Lessons
  3. 3. Why Incremental is Hard
  4. 4. Incremental Ingestion BRONZE SILVER LANDING ?
  5. 5. Incremental Ingestion ▪ Only Read New Files ▪ Don’t Miss Files ▪ Trigger Immediately ▪ Repeatable Pattern ▪ Fast over large directories ?
  6. 6. Existing Patterns – 1) ETL Metadata etl batch read {“lastRead”:”2021/05/26”} Contents: • /2021/05/24/file 1 • /2021/05/25/file 2 • /2021/05/26/file 3 • /2021/05/27/file 4 .load(f“/{loadDate}/”
  7. 7. Existing Patterns – 2) Spark File Streaming file stream read Contents: • File 1 • File 2 • File 3 • File 4 Checkpoint: • File 1 • File 2 • File 3
  8. 8. Existing Patterns – 3) DIY triggered batch read Blob File Trigger Logic App Azure Function Databricks Job API
  9. 9. Incremental Ingestion Approaches Approach Good At Bad At Metadata ETL Repeatable Not immediate, requires polling File Streaming Repeatable Immediate Slows down over large directories DIY Architecture Immediate Triggering Not Repeatable
  10. 10. Databricks Autoloader
  11. 11. Prakash Chockalingam Databricks Engineering Blog Auto Loader is an optimized cloud file source for Apache Spark that loads data continuously and efficiently from cloud storage as new data arrives.
  12. 12. What is Autoloader? Essentially, Autoloader combines our three approaches of: • Storing Metadata about what has been read • Using Structured Streaming for immediate processing • Utilising Cloud-Native Components to optimise identifying arriving files There are two parts to the Autoloader job: • CloudFiles DataReader • CloudNotification Services (optional)
  13. 13. Cloudfiles Reader Blob Storage Blob Storage Queue {“fileAdded”:”/landing/file 4.json” • File 1.json • File 2.json • File 3.json • File 4.json Dataframe Check Files in Queue Read specific files from source
  14. 14. CloudFiles DataReader df = ( spark .readStream .format(“cloudfiles”) .option(“cloudfiles.format”,”json”) .option(“cloudfiles.useNotifications”,”true”) .schema(mySchema) .load(“/mnt/landing/”) ) Tells Spark to use Autoloader Tells Autoloader to expect JSON files Should Autoloader use the Notification Queue
  15. 15. Cloud Notification Services - Azure Blob Storage Event Grid Topic Event Grid Subscription Blob Storage Queue Event Grid Subscription Blob Storage Queue Event Grid Subscription Blob Storage Queue
  16. 16. Cloud Notification Services - Azure Blob Storage New File Arrives, Triggers Event Topic Subscription checks message filters, inserts into queue {fileAdded:“/file 4/”}
  17. 17. NotificationServices Config cloudFiles .useNotifications – Directory Listing VS Notification Queue .queueName – Use an Existing Queue .connectionString – Queue Storage Connection .subscriptionId .resourceGroup .tenantId .clientId .clientSecret Service Principal for Queue Creation
  18. 18. Implementing Autoloader ▪ Setup Steps ▪ Reading New Files ▪ A Basic ETL Setup
  19. 19. Delta Implementation
  20. 20. Practical Implementations BRONZE SILVER LANDING Autoloader
  21. 21. Low Frequency Streams Autoloader One File Per Day 24/7 Cluster
  22. 22. Low Frequency Streams Autoloader One File Per Day 1/7 Cluster df .writeStream .trigger(once=True) .save(path) Autoloader can be combined with trigger.Once – each run finds only files not processed since last run
  23. 23. Delta Merge Autoloader Merge?
  24. 24. Delta Merge Autoloader df .writeStream .foreachBatch(runThis) .save(path) def runThis(df, batchId): (df .write .save(path) )
  25. 25. Delta Implementation ▪ Batch ETL Pattern ▪ Merge Statements ▪ Logging State
  26. 26. Evolving Schemas New Features since Databricks Runtime 8.2
  27. 27. What is Schema Evolution? {“ID”:1,“ProductName”:“Belt”} {“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”} {“ID”:3,“ProductName”:“Shirt”,“Size”:“14”, “Care”:{ “DryClean”: “Yes”, “MachineWash”:“Don’t you dare” } }
  28. 28. How do we handle Evolution? 1. Fail the Stream 2. Manually Intervene 3. Automatically Evolve In order to manage schema evolution, we need to know: • What the schema is expected to be • What the schema is now • How we want to handle any changes in schema
  29. 29. Schema Inference In Databricks 8.2 Onwards – simply don’t provide a Schema to enable Schema Inference. This infers the schema once when the stream is started and stores it as metadata. cloudfiles .schemaLocation – where to store the schema .inferColumnTypes – sample data to infer types .schemaHints – manually specify data types for certain columns
  30. 30. Schema Metastore _schemas {“ID”:1, “ProductName”:“Belt”} { "type": "struct", "fields": [ { "name": "ID", "type": “string", "nullable": true, "metadata": {} }, { "name": "ProductName", "type": “string", "nullable": true, "metadata": {} } ] } 0 On First Read
  31. 31. Schema Metastore – DataType Inference _schemas {“ID”:1, “ProductName”:“Belt”} { "type": "struct", "fields": [ { "name": "ID", "type": “int", "nullable": true, "metadata": {} }, { "name": "ProductName", "type": “string", "nullable": true, "metadata": {} } ] } 0 On First Read .option(“cloudFiles.inferColumnTypes”,”True”)
  32. 32. Schema Metastore – Schema Hints _schemas {“ID”:1, “ProductName”:“Belt”} { "type": "struct", "fields": [ { "name": "ID", "type": “long", "nullable": true, "metadata": {} }, { "name": "ProductName", "type": “string", "nullable": true, "metadata": {} } ] } 0 On First Read .option(“cloudFiles.schemaHints”,”ID long”)
  33. 33. Schema Evolution cloudFiles.schemaEvolutionMode • addNewColumns – Fail the job, update the schema metastore • failOnNewColumns – Fail the job, no updates made • rescue – Do not fail, pull all unexpected data into _rescued_data To allow for schema evolution, we can include a schema evolution mode option:
  34. 34. Evolution Reminder {“ID”:1,“ProductName”:“Belt”} {“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”} {“ID”:3,“ProductName”:“Shirt”,“Size”:“14”, “Care”:{ “DryClean”: “Yes”, “MachineWash”:“Don’t you dare” } } 1 2 3
  35. 35. Schema Evolution - Rescue 1 2 3 ID Product Name _rescued_data 1 Belt ID Product Name _rescued_data 2 T-Shirt {“Size”:”XL”} ID Product Name _rescued_data 3 Shirt {“Size”:”14”,”Care”:{“DryC…
  36. 36. Schema Evolution – Add New Columns _schemas {“ID”:2, “ProductName”:“T-Shirt”, “Size”:”XL”} 0 On Arrival 2 1 { "type": "struct", "fields": [ { "name": "ID", "type": “string", }, { "name": "ProductName", "type": “string", } , { "name": “Size", "type": “string", }…
  37. 37. Schema Evolution ▪ Inference & The Schema Metastore ▪ Schema Hints ▪ Schema Evolution
  38. 38. Lessons from an Autoloader Life
  39. 39. Autoloader Lessons ▪ EventGrid Quotas & Settings ▪ Streaming Best Practices ▪ Batching Best Practices
  40. 40. EventGrid Quota Lessons • You can have 500 files from a single storage account using the system topic • Deleting checkpoint will reset the stream ID and create a new Subscription/Queue, leaving an orphan set • Use the CloudNotification Libraries to manage this more closely with custom topics
  41. 41. Streaming Optimisation • MaxBytesPerTrigger / MaxFilesPerTrigger Manages the size of the streaming microbatch • FetchParallelism Manages the workload on your queue
  42. 42. Batch Lessons – Look for Lost Messages Default 7 days!
  43. 43. Databricks Autoloader ▪ Reduces complexity of ingesting files ▪ Has some quirks in implementing ETL processes ▪ Growing number of schema evolution features
  44. 44. Simon Whiteley Director of Engineering hello@advancinganalytics.co.uk @MrSiWhiteley www.youtube.com/c/AdvancingAnalytics
  45. 45. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • saurabhverma2412

    Jul. 24, 2021
  • ukr0336

    Jul. 8, 2021

Tracking which incoming files have been processed has always required thought and design when implementing an ETL framework. The Autoloader feature of Databricks looks to simplify this, taking away the pain of file watching and queue management. However, there can also be a lot of nuance and complexity in setting up Autoloader and managing the process of ingesting data using it. After implementing an automated data loading process in a major US CPMG, Simon has some lessons to share from the experience. This session will run through the initial setup and configuration of Autoloader in a Microsoft Azure environment, looking at the components used and what is created behind the scenes. We’ll then look at some of the limitations of the feature, before walking through the process of overcoming these limitations. We will build out a practical example that tackles evolving schemas, applying transformations to your stream, extracting telemetry from the process and finally, how to merge the incoming data into a Delta table. After this session you will be better equipped to use Autoloader in a data ingestion platform, simplifying your production workloads and accelerating the time to realise value in your data!

Views

Total views

265

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

22

Shares

0

Comments

0

Likes

2

×