2. • You currently have high volume data that you are
processing in a batch format
• You are you trying to get real-time insights from your
data
• You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
Your Current Situation
4. New Architecture
Bypass Source System
Realtime Message
Streaming to Event
Hubs
Structured
Streaming
Realtime Transaction
Processing
5. • Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
• Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
• Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
Why Azure Databricks?
6. • For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
• This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
• As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
• Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.
7. • Structured Streaming is the Apache Spark API that lets you express
computation on streaming data in the same way you express a batch
computation on static data.
• The Spark SQL engine performs the computation incrementally and
continuously updates the result as streaming data arrives.
• Databricks maintains the current checkpoint of the data processed,
making restart after failure nearly seamless.
• Can bring impactful insights to the users in almost real-time.
Advantages of Structured Streaming
8. Streaming Data Source/Sinks
Sources Sinks
Azure Event Hubs/IOT Hubs
Databricks Delta Tables
Azure Data Lake Gen2 (Auto Loader)
Apache Kafka
Amazon Kinesis
Amazon S3 with Amazon SQS
Databricks Delta Tables
Almost any Sink using foreachBatch
9. • Source Parameters
• Source Format/Location
• Batch/File Size
• Transformations
• Streaming data can be transformed in the
same ways as static data
• Output Parameters
• Output Format/Location
• Checkpoint Location
Structured Streaming
Structured
Streaming
EVENT HUB
14. • Join Types
• Inner (Watermark and Time
Constraint Optional)
• Left Outer (Watermark and Time
Constraint Req)
• Right Outer (Watermark and Time
Constraint Req)
• You can also Join Static
Tables/Files into your Stream-
Stream Join
Stream-Stream Joins
Structured
Streaming
EVENT HUB
STATIC FILE
EVENT HUB
Structured
Streaming
Micro
Batch
15. • Watermark – How late a record can
arrive and after what time can it be
removed from the state.
• Time Constraint – How log the
records will be kept in state in
relation to the other stream
• Only used in stateful operation
• Ignored in non-stateful streaming
queries and batch queries
Watermark vs. Time Constraint
19. • Allows Batch Type Processing to be performed on Streaming Data
• Perform Processes with out adding to state
• dropDuplicates
• Aggregating data
• Perform a Merge/Upsert with Existing Static Data
• Write Data to multiple sinks/destinations
• Write Data to sinks not support in Structured Streaming
foreachBatch
21. • Spark Shuffle Partitions –
• Equal to the number of cores on the Cluster
• Maximum Records per Micro-Batch
• File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
• EventHubs – maxEventsPerTrigger
• Limit Stateful – limits state and memory errors
• Watermarking
• MERGE/Join/Aggregation
• Broadcast Joins
• Output Tables – Influences downstream streams
• Manually re-partition
• Delta Lake – Auto-Optimize
Going to Production
- Questions responses from the poles
For the last year or so I have been working very heavily in Databricks – specifically in using it in big data processing with structured streaming
So what we are going to look at today is for the user:
who maybe has played a little with Databricks
has used Spark in some other format in the past
at least an idea or need for big data processing, specifically in a real time solution
So why Azure Databricks?
I had working with many big data systems over the years on several different platforms
I had also used spark before
But as more of a data architect and developer, I was always put off by what seemed like over complexity of the spark ecosystem.
There were a lot of elements, it took a lot of “under the hood” setup and tuning, and would just always rather use something else.
Especially as we moved to Azure and the cloud I could just throw a never ending amount of processing at my big data problems.
With Databricks I now get the best of both worlds.
A simple to setup, simple to maintain, easy to scale spark based system with all the development and processing benefits without all the technical and administrative overhead.
So with Azure Databricks you get Spark – directly from the people that invented it – but just in a fast, easy and collaborative cloud service.
You also get great integration with all the other Azure elements – Event Hubs, Key Vaults, Data Lakes, Azure SQL, data warehouse, Data Factory and even Azure DevOPS.
Then you overlay your existing Azure security model with Active Directory right over it to provide a completely integrated security model.
Structured streaming then allows you to take all of that integration and processing power and apply it to a stream of big data to gain near real-time processing capabilities.
So you can process thru large amounts of messages/events/files as they are received and perform the same computations on the data that you could with static data set.
At the same time Databricks automatically keeps a record of the data as it is processed, allowing almost seamless restarts if a failure were to occur in the process.
This allows you to generate dataset in near real time – providing marketable insights to your business.
There are several different source and sink locations that can be used with streaming in Databricks.
Within the Azure ecosystem Azure Event Hubs and Databricks Delta tables in Azure Datalake are the most popular, but other source streams like Apache Kafka or Amazon Kinesis are also popular.
You can also use the file Queue in Data Lake Gen2 with Auto Loader to load blob files as they are saves to file location.
You can use almost anything as a sink by using the foreachBatch method which we will take a look at later.
So a typical structure streaming pipeline is made up 3 parts, the source, any transformations and the output sink or destination.
In our first example we will look at the source being an event hub message stream, add some minor transformations, and then sink the results to a Databricks Delta table.
Each source has some specific options or parameters, such as format, connection information, file location, etc.
The transformations can be any transformation you can perform on a static dataset.
And the output can again have specific options and formats based on the type, including the destination location or partitioning information. The key element that makes the sink of a streaming datasource different is the checkpoint location. This checkpoint allows the stream to keep state with which messages have been read for the source and if the stream is interrupted, where to pick up at on restart.
In the case of the Event Hub queue the checkpoint keeps track of the specific message offset on each partition.
Also note that to use an Event Hub source you must add the azure event hubs library to your cluster and import the microsoft.azure.eventhubs library into your notebook.
TASK – Need data elements and code. Databricks environment
Can all be in the same command, can be in as many commands as you want
Structured Streaming supports joining a streaming Dataset or DataFrame with a static Dataset or DataFrame – such as binding our transactional table to other dimensional information – like sales info to an item table, customer information or sale territories.
It also supports joining to another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally as the micro batches are exectute and looks similar to the results of our previous streaming aggregations example before.
So in the upcoming demonstrations we will look at a few of these examples and see what the type of joins (i.e. inner, outer, etc.) are handled.
In all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
When a streaming dataset and a static dataset are used, then only an inner join and a left outer join are supported. Right outer joins and full outer joins are not supported.
Inner joins and left outer joins on streaming and static datasets don’t have to be stateful, which improves your performance. The records in any single micro batch can be matched with a static set of records.
TASK – need data and example code
Stream to Stream joins support inner, left and right joins, but with differing requirements.
While on an inner join watermarking is not required, unless you can be sure both records will exist at some point it is best to use it. Otherwise you may have records that stay in state indefinitely and are never cleaned up.
It’s very important to understand the difference between watermark and time constraint.
Watermarking a stream decides how delayed a record can arrive and gives a time when records can be dropped. For example, if you set a watermark for 30 minutes, then records older than 30 minutes will be dropped/ignored.
Time constraints decide how long the record in will be retained in Spark's state in correlation to the other stream.
So in our scenario we are going to receive our transaction data, and in addition we are going to get View data from our website.
So we want to analyze For Customer X, After buying Item Y, how many other items did they view in the next 5 minutes?
Another thing to remember that often gets people is that the watermark is not from the “current time”, it is from the last event time that the system saw. So if you have not received new messages in the stream, it will not apply.
We have several possible outcomes.
The transaction may be late, so how long do we want to keep that record? – this can depend on the volume of records and the source system. If you have a large volume, but few late records you can make this timeframe shorter.
The views may be late, or even before the transaction – so again how long do we want to keep those records in memory – it has to be >= 5 minutes since that is our time constraint.
They may not view anything else, so if we want to know that, we need to use a left join so we can get transactions that have no view data within 5 minutes.
TASK – Need data elements and code. Databricks environment
Can all be in the same command, can be in as many commands as you want
The last element of structured streaming that we are going to review is the foreachBatch
What the foreachBatch really lets you do is “cheat” on your streaming. You can take the streaming microbatch, put it in the foreachBatch method, then perform anything you could normally do in a standard batch processing.
One of the key things to do is to perform normally “stateful” processing – a great example of this is dropping duplicates
As you get into more complex data structures you might also have need to perform aggregations on the micro batch itself. So if you had a complex structure like a sales ticket, that contained multiple individual sale items, you might want to aggregate those by item or department before saving them. In the foreachBatch you could perform the aggregation, then save the data.
Another great use is when you need to save the same streaming data to multiple sinks. This might be to update a summary dataset and to save the detailed record at the same time.
This method can also be used to write data to sinks that are not supported in streaming – such as an SQL database table.
TASK – need data and example code
This topic could really be its own webinar, but I did want to touch on some of the items you will want to look at when you get ready to move to production with your stream.
There is a really good session from the spark AI 2020 summit that does a very good job of what types of issue to look for and I will put that in the chat.
https://databricks.com/session_na20/performant-streaming-in-production-preventing-common-pitfalls-when-productionizing-streaming-jobs
But some of the items we want to watch for that are harder to fix once you have started to run a process in production are things like the shuffle partition setting, which can limit the disk shuffle and greatly increase performance. Once that is set the value is saved in the delta metadata and is hard to change if you need to scale up or down the number of cores on your cluster.
Another is the “auto-optimize” setting on your delta tables. By default, if you write streaming data to delta you will get a lot of very small files. You can setup a job to optimize the tables periodically, but in a real-time environment it is best to let the system optimize as data is processed. You can set your delta tables to auto-optimize which will reduce your number of files and increase the size of the files to help downstream performance.
You can also manipulate the size of the micro batch by changing the number of events/files/bytes that are consumed – depending on your source. This again is to help keep your processing from having to use disk for the shuffle partitions.
Finally, as you design your streaming environment try to limit the number of stateful processes you bring into the streams. By limiting things like deduplication of the stream itself, the number of aggregations, the length of any watermarking, or by using the broadcast join hint on smaller static tables, you can greatly increase your record thruput and reduce memory usage and errors.