SlideShare a Scribd company logo
1 of 23
Realtime Structured
Streaming in Azure
Databricks
Brian Steele - Principal Consultant
bsteele@pragmaticworks.com
• You currently have high volume data that you are
processing in a batch format
• You are you trying to get real-time insights from your
data
• You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
Your Current Situation
Prior Architecture
Source
System
Azure Data
Factory
Daily File Extract
Batch
Processing
New Architecture
Bypass Source System
Realtime Message
Streaming to Event
Hubs
Structured
Streaming
Realtime Transaction
Processing
• Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
• Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
• Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
Why Azure Databricks?
• For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
• This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
• As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
• Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.
• Structured Streaming is the Apache Spark API that lets you express
computation on streaming data in the same way you express a batch
computation on static data.
• The Spark SQL engine performs the computation incrementally and
continuously updates the result as streaming data arrives.
• Databricks maintains the current checkpoint of the data processed,
making restart after failure nearly seamless.
• Can bring impactful insights to the users in almost real-time.
Advantages of Structured Streaming
Streaming Data Source/Sinks
Sources Sinks
Azure Event Hubs/IOT Hubs
Databricks Delta Tables
Azure Data Lake Gen2 (Auto Loader)
Apache Kafka
Amazon Kinesis
Amazon S3 with Amazon SQS
Databricks Delta Tables
Almost any Sink using foreachBatch
• Source Parameters
• Source Format/Location
• Batch/File Size
• Transformations
• Streaming data can be transformed in the
same ways as static data
• Output Parameters
• Output Format/Location
• Checkpoint Location
Structured Streaming
Structured
Streaming
EVENT HUB
DEMO
Join Operations
• Join Types
• Inner
• Left
• Not Stateful by default
Stream-Static Joins
Structured
Streaming
EVENT HUB
STATIC FILE
DEMO
• Join Types
• Inner (Watermark and Time
Constraint Optional)
• Left Outer (Watermark and Time
Constraint Req)
• Right Outer (Watermark and Time
Constraint Req)
• You can also Join Static
Tables/Files into your Stream-
Stream Join
Stream-Stream Joins
Structured
Streaming
EVENT HUB
STATIC FILE
EVENT HUB
Structured
Streaming
Micro
Batch
• Watermark – How late a record can
arrive and after what time can it be
removed from the state.
• Time Constraint – How log the
records will be kept in state in
relation to the other stream
• Only used in stateful operation
• Ignored in non-stateful streaming
queries and batch queries
Watermark vs. Time Constraint
Structured
Streaming
EVENT HUB
EVENT HUB
Structured
Streaming
Transaction 1/Customer 1/Item 1
Transaction 2/Customer 2/Item 1
Transaction 3/Customer 1/Item 2
View 1/Customer 1/Item 1
View 2/Customer 2/Item 2
View 3/Customer 3/Item 3
View 4/Customer 1/Item 2
Watermark
10 Minutes
Watermark
5 Minutes
Time constraint
View.timeStamp >=
Transaction.timeStamp
and
View.timeStamp <=
Transaction.timeStamp + interval 5
minutes
10:00 - 10:05
View 6 Watermark
10:00 10:15
10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14
10:00 - 10:10
Watermark Time
10:00 - 10:05
Constraint Time
10:01
Transaction 1 Recieved
10:00
Transaction 1 Occurs
10:08
View 3 Received
10:06
View 4
10:02
View 1 10:03
View 2
10:04
View 3 Occurs
10:04
View 5 Occurs
10:12
View 5 Received
10:00
View 6
DEMO
• Allows Batch Type Processing to be performed on Streaming Data
• Perform Processes with out adding to state
• dropDuplicates
• Aggregating data
• Perform a Merge/Upsert with Existing Static Data
• Write Data to multiple sinks/destinations
• Write Data to sinks not support in Structured Streaming
foreachBatch
DEMO
• Spark Shuffle Partitions –
• Equal to the number of cores on the Cluster
• Maximum Records per Micro-Batch
• File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
• EventHubs – maxEventsPerTrigger
• Limit Stateful – limits state and memory errors
• Watermarking
• MERGE/Join/Aggregation
• Broadcast Joins
• Output Tables – Influences downstream streams
• Manually re-partition
• Delta Lake – Auto-Optimize
Going to Production
Conclusion
Have Any Questions?

More Related Content

Similar to StructuredStreaming webinar slides.pptx

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microserviceNilanjan Roy
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache SparkOren Raboy
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
 
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAmazon Web Services
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesAmazon Web Services
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCAbhijit Kumar
 
Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Eric Bragas
 

Similar to StructuredStreaming webinar slides.pptx (20)

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microservice
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWSAWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
AWS Cloud Kata | Kuala Lumpur - Getting to Scale on AWS
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Data Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDCData Stream Processing for Beginners with Kafka and CDC
Data Stream Processing for Beginners with Kafka and CDC
 
Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

StructuredStreaming webinar slides.pptx

  • 1. Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks.com
  • 2. • You currently have high volume data that you are processing in a batch format • You are you trying to get real-time insights from your data • You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems Your Current Situation
  • 4. New Architecture Bypass Source System Realtime Message Streaming to Event Hubs Structured Streaming Realtime Transaction Processing
  • 5. • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. • Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. • Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service. Why Azure Databricks?
  • 6. • For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real- time using Kafka, Event Hub, or IoT Hub. • This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. • As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. • Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.
  • 7. • Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. • The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. • Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. • Can bring impactful insights to the users in almost real-time. Advantages of Structured Streaming
  • 8. Streaming Data Source/Sinks Sources Sinks Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S3 with Amazon SQS Databricks Delta Tables Almost any Sink using foreachBatch
  • 9. • Source Parameters • Source Format/Location • Batch/File Size • Transformations • Streaming data can be transformed in the same ways as static data • Output Parameters • Output Format/Location • Checkpoint Location Structured Streaming Structured Streaming EVENT HUB
  • 10. DEMO
  • 12. • Join Types • Inner • Left • Not Stateful by default Stream-Static Joins Structured Streaming EVENT HUB STATIC FILE
  • 13. DEMO
  • 14. • Join Types • Inner (Watermark and Time Constraint Optional) • Left Outer (Watermark and Time Constraint Req) • Right Outer (Watermark and Time Constraint Req) • You can also Join Static Tables/Files into your Stream- Stream Join Stream-Stream Joins Structured Streaming EVENT HUB STATIC FILE EVENT HUB Structured Streaming Micro Batch
  • 15. • Watermark – How late a record can arrive and after what time can it be removed from the state. • Time Constraint – How log the records will be kept in state in relation to the other stream • Only used in stateful operation • Ignored in non-stateful streaming queries and batch queries Watermark vs. Time Constraint
  • 16. Structured Streaming EVENT HUB EVENT HUB Structured Streaming Transaction 1/Customer 1/Item 1 Transaction 2/Customer 2/Item 1 Transaction 3/Customer 1/Item 2 View 1/Customer 1/Item 1 View 2/Customer 2/Item 2 View 3/Customer 3/Item 3 View 4/Customer 1/Item 2 Watermark 10 Minutes Watermark 5 Minutes Time constraint View.timeStamp >= Transaction.timeStamp and View.timeStamp <= Transaction.timeStamp + interval 5 minutes
  • 17. 10:00 - 10:05 View 6 Watermark 10:00 10:15 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:00 - 10:10 Watermark Time 10:00 - 10:05 Constraint Time 10:01 Transaction 1 Recieved 10:00 Transaction 1 Occurs 10:08 View 3 Received 10:06 View 4 10:02 View 1 10:03 View 2 10:04 View 3 Occurs 10:04 View 5 Occurs 10:12 View 5 Received 10:00 View 6
  • 18. DEMO
  • 19. • Allows Batch Type Processing to be performed on Streaming Data • Perform Processes with out adding to state • dropDuplicates • Aggregating data • Perform a Merge/Upsert with Existing Static Data • Write Data to multiple sinks/destinations • Write Data to sinks not support in Structured Streaming foreachBatch
  • 20. DEMO
  • 21. • Spark Shuffle Partitions – • Equal to the number of cores on the Cluster • Maximum Records per Micro-Batch • File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger • EventHubs – maxEventsPerTrigger • Limit Stateful – limits state and memory errors • Watermarking • MERGE/Join/Aggregation • Broadcast Joins • Output Tables – Influences downstream streams • Manually re-partition • Delta Lake – Auto-Optimize Going to Production

Editor's Notes

  1. - Questions responses from the poles For the last year or so I have been working very heavily in Databricks – specifically in using it in big data processing with structured streaming So what we are going to look at today is for the user: who maybe has played a little with Databricks has used Spark in some other format in the past at least an idea or need for big data processing, specifically in a real time solution
  2. So why Azure Databricks? I had working with many big data systems over the years on several different platforms I had also used spark before But as more of a data architect and developer, I was always put off by what seemed like over complexity of the spark ecosystem. There were a lot of elements, it took a lot of “under the hood” setup and tuning, and would just always rather use something else. Especially as we moved to Azure and the cloud I could just throw a never ending amount of processing at my big data problems. With Databricks I now get the best of both worlds. A simple to setup, simple to maintain, easy to scale spark based system with all the development and processing benefits without all the technical and administrative overhead. So with Azure Databricks you get Spark – directly from the people that invented it – but just in a fast, easy and collaborative cloud service.
  3. You also get great integration with all the other Azure elements – Event Hubs, Key Vaults, Data Lakes, Azure SQL, data warehouse, Data Factory and even Azure DevOPS. Then you overlay your existing Azure security model with Active Directory right over it to provide a completely integrated security model.
  4. Structured streaming then allows you to take all of that integration and processing power and apply it to a stream of big data to gain near real-time processing capabilities. So you can process thru large amounts of messages/events/files as they are received and perform the same computations on the data that you could with static data set. At the same time Databricks automatically keeps a record of the data as it is processed, allowing almost seamless restarts if a failure were to occur in the process. This allows you to generate dataset in near real time – providing marketable insights to your business.
  5. There are several different source and sink locations that can be used with streaming in Databricks. Within the Azure ecosystem Azure Event Hubs and Databricks Delta tables in Azure Datalake are the most popular, but other source streams like Apache Kafka or Amazon Kinesis are also popular. You can also use the file Queue in Data Lake Gen2 with Auto Loader to load blob files as they are saves to file location. You can use almost anything as a sink by using the foreachBatch method which we will take a look at later.
  6. So a typical structure streaming pipeline is made up 3 parts, the source, any transformations and the output sink or destination. In our first example we will look at the source being an event hub message stream, add some minor transformations, and then sink the results to a Databricks Delta table. Each source has some specific options or parameters, such as format, connection information, file location, etc. The transformations can be any transformation you can perform on a static dataset. And the output can again have specific options and formats based on the type, including the destination location or partitioning information. The key element that makes the sink of a streaming datasource different is the checkpoint location. This checkpoint allows the stream to keep state with which messages have been read for the source and if the stream is interrupted, where to pick up at on restart. In the case of the Event Hub queue the checkpoint keeps track of the specific message offset on each partition. Also note that to use an Event Hub source you must add the azure event hubs library to your cluster and import the microsoft.azure.eventhubs library into your notebook.
  7. TASK – Need data elements and code. Databricks environment Can all be in the same command, can be in as many commands as you want
  8. Structured Streaming supports joining a streaming Dataset or DataFrame with a static Dataset or DataFrame – such as binding our transactional table to other dimensional information – like sales info to an item table, customer information or sale territories. It also supports joining to another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally as the micro batches are exectute and looks similar to the results of our previous streaming aggregations example before. So in the upcoming demonstrations we will look at a few of these examples and see what the type of joins (i.e. inner, outer, etc.) are handled. In all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
  9. When a streaming dataset and a static dataset are used, then only an inner join and a left outer join are supported. Right outer joins and full outer joins are not supported. Inner joins and left outer joins on streaming and static datasets don’t have to be stateful, which improves your performance. The records in any single micro batch can be matched with a static set of records.
  10. TASK – need data and example code
  11. Stream to Stream joins support inner, left and right joins, but with differing requirements. While on an inner join watermarking is not required, unless you can be sure both records will exist at some point it is best to use it. Otherwise you may have records that stay in state indefinitely and are never cleaned up.
  12. It’s very important to understand the difference between watermark and time constraint. Watermarking a stream decides how delayed a record can arrive and gives a time when records can be dropped. For example, if you set a watermark for 30 minutes, then records older than 30 minutes will be dropped/ignored. Time constraints decide how long the record in will be retained in Spark's state in correlation to the other stream.
  13. So in our scenario we are going to receive our transaction data, and in addition we are going to get View data from our website. So we want to analyze For Customer X, After buying Item Y, how many other items did they view in the next 5 minutes? Another thing to remember that often gets people is that the watermark is not from the “current time”, it is from the last event time that the system saw. So if you have not received new messages in the stream, it will not apply.
  14. We have several possible outcomes. The transaction may be late, so how long do we want to keep that record? – this can depend on the volume of records and the source system. If you have a large volume, but few late records you can make this timeframe shorter. The views may be late, or even before the transaction – so again how long do we want to keep those records in memory – it has to be >= 5 minutes since that is our time constraint. They may not view anything else, so if we want to know that, we need to use a left join so we can get transactions that have no view data within 5 minutes.
  15. TASK – Need data elements and code. Databricks environment Can all be in the same command, can be in as many commands as you want
  16. The last element of structured streaming that we are going to review is the foreachBatch What the foreachBatch really lets you do is “cheat” on your streaming. You can take the streaming microbatch, put it in the foreachBatch method, then perform anything you could normally do in a standard batch processing. One of the key things to do is to perform normally “stateful” processing – a great example of this is dropping duplicates As you get into more complex data structures you might also have need to perform aggregations on the micro batch itself. So if you had a complex structure like a sales ticket, that contained multiple individual sale items, you might want to aggregate those by item or department before saving them. In the foreachBatch you could perform the aggregation, then save the data. Another great use is when you need to save the same streaming data to multiple sinks. This might be to update a summary dataset and to save the detailed record at the same time. This method can also be used to write data to sinks that are not supported in streaming – such as an SQL database table.
  17. TASK – need data and example code
  18. This topic could really be its own webinar, but I did want to touch on some of the items you will want to look at when you get ready to move to production with your stream. There is a really good session from the spark AI 2020 summit that does a very good job of what types of issue to look for and I will put that in the chat. https://databricks.com/session_na20/performant-streaming-in-production-preventing-common-pitfalls-when-productionizing-streaming-jobs But some of the items we want to watch for that are harder to fix once you have started to run a process in production are things like the shuffle partition setting, which can limit the disk shuffle and greatly increase performance. Once that is set the value is saved in the delta metadata and is hard to change if you need to scale up or down the number of cores on your cluster. Another is the “auto-optimize” setting on your delta tables. By default, if you write streaming data to delta you will get a lot of very small files. You can setup a job to optimize the tables periodically, but in a real-time environment it is best to let the system optimize as data is processed. You can set your delta tables to auto-optimize which will reduce your number of files and increase the size of the files to help downstream performance. You can also manipulate the size of the micro batch by changing the number of events/files/bytes that are consumed – depending on your source. This again is to help keep your processing from having to use disk for the shuffle partitions. Finally, as you design your streaming environment try to limit the number of stateful processes you bring into the streams. By limiting things like deduplication of the stream itself, the number of aggregations, the length of any watermarking, or by using the broadcast join hint on smaller static tables, you can greatly increase your record thruput and reduce memory usage and errors.