Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Arijit Tarafdar
Nan Zhu
Azure HDInsight @ Microsoft
Building Structured
Streaming Connector for
Continuous Applications
Who Are We?
• Nan Zhu
– Software Engineer@Azure HDInsight. Work on Spark
Streaming/Structured Streaming service in Azure. ...
Agenda
• What is/Why ContinuousApplication?
– What are the challenges when we already have Spark in
hand?
• Introduction o...
Continuous Application Architecture
and Role of Spark Connectors (1)
• Not only size of data is increasing, but also the
v...
Continuous Application Architecture
and Role of Spark Connectors (2)
Processing
Engine
Real-time
Data Source
Acting
Compon...
Building a Connector for
Real-time Data Source and
Structured Streaming
taking Azure Event Hubs as an example
(Comparing w...
What is Azure Event Hubs?
Platform as a Service
What is Structured Streaming?
• A relatively new component in Spark
– Introduced in Spark 2.0
– Keep evolving from Spark 2...
Structured Streaming
Abstraction in Structured Streaming
Input
Batch
X, X+1,
...
Batch
X, X+1,
,…
(Spark 2.1
Implementatio...
Implementation of a “Source”
Description Implementation
getOffset()
Return the last offset of
next batch
EndOffsetOfLastBa...
Implementation of a “Source”
Description Implementation
getOffset()
Return the last offset of
next batch
EndOffsetOfLastBa...
Various Forms of Message Offset
• Consecutive Numbers
– 0, 1, 2, 3, …
– Examples: Kafka, MQTT
– Server-side index mapping ...
How it brings difference? - Kafka
Driver
Executor
Executor
Kafka
Batch 0: Mapping the partition of DataFrame
to a offset r...
How it brings difference? - Event Hubs
Driver
Executor
Executor
Batch 0: How many messages are to be
processed in next bat...
Difference of KafkaSource and
EventHubsSource
getOffset EndOffsetOfLastBatch (known
before the batch is finished)
Calculat...
How it brings difference? - Event Hubs
Driver
Executor
Executor
Batch 0: How many messages are to be
processed in next bat...
HDFS-Based Channel
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
Source
DataFrame
.map()
Transformed
DataFrame
What’s the ne...
HDFS-Based Channel
LastOffset_P1_Batch_i
LastOffset_PN_Batch_i
EventHubsRDD
Tasks
.map()
MapPartitionsRDD
Tasks
•APIs like...
Takeaways (1)
• Data Source is not only designed for structured
streaming
• Accommodate Connector Implementation with Data...
Structured Streaming Unit Testing
Framework
• “Action” based structured streaming test flow –
StartStream, AddData, CheckA...
Unit Testing Event Hubs Source
• No Kafka type micro-service available for Event
Hubs.
• Two simulated clients – Event Hub...
Issues With Unit Test Framework
• No separation between test setup and test
execution.
• Unreliable update of the Event Hu...
What Is The Asynchrony Problem?
StreamingQueryManager.startQuery()
: StreamingQuery
StreamingQueryManager.createQuery()
: ...
One Way To Solve Asynchrony Problem
StreamingQueryManager.startQuery()
: StreamingQuery
StreamingQueryManager.createQuery(...
Solving Asynchrony in Client Substitutions (1)
• Invoke createQuery
val createQueryMethod =
sparkSession.streams.getClass....
Solve Asynchrony in Client Substitutions (2)
• Substitute Event Hubs clients in source
val eventHubsSource = sources.head
...
What Is The Task Serialization Issue?
SimulatedEventHubs
StreamTest
AddEventHubsData
AddData
TestEventHubsReceiver
TestEve...
What Is StreamTest Inheritance Model?
SparkFunSuite
StreamTest
PlanTest
QueryTest
FunSuite
Solving Task Serialization Issue
• Separate StreamTest into two traits:
– EventHubsStreamTest
trait EventHubsStreamTest ex...
After All These Changes…
testStream(sourceQuery)(
StartStream(…),
CheckAnswer(…),
AddEventHubsData(…),
AdvanceManualClock(...
Summary/Takeaway(2)
• We have a production grade Spark Streaming connector for Structured
Streaming for the widely used Ev...
Contributing Back To Community
• Failed Recovery from checkpoint caused by the multi-
threads issue in Spark Streaming sch...
Thank You!!!
https://github.com/hdinsight/spark-eventhubs
https://azure.microsoft.com/en-us/services/hdinsight/
Upcoming SlideShare
Loading in …5
×

Building Continuous Application with Structured Streaming and Real-Time Data Source with Arijit Tarafdar and Nan Zhu

935 views

Published on

One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.

In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.

They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.

Published in: Data & Analytics
  • Be the first to comment

Building Continuous Application with Structured Streaming and Real-Time Data Source with Arijit Tarafdar and Nan Zhu

  1. 1. Arijit Tarafdar Nan Zhu Azure HDInsight @ Microsoft Building Structured Streaming Connector for Continuous Applications
  2. 2. Who Are We? • Nan Zhu – Software Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator). Spark Contributor. • Arijit Tarafdar – Software Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure. Previously worked with other distributed platforms like DryadLinq and MPI. Also worked on graph coloring algorithms which was contributed to ADOL-C (https://projects.coin-or.org/ADOL-C).
  3. 3. Agenda • What is/Why ContinuousApplication? – What are the challenges when we already have Spark in hand? • Introduction of Azure Event Hubs and Structured Streaming • Key Design Considerations in Building Structured Streaming Connector • Test Structured Streaming Connector • Summary
  4. 4. Continuous Application Architecture and Role of Spark Connectors (1) • Not only size of data is increasing, but also the velocity of data – Sensors, IoT devices, social networks and online transactions are all generating data that needs to be monitored constantly and acted upon quickly.
  5. 5. Continuous Application Architecture and Role of Spark Connectors (2) Processing Engine Real-time Data Source Acting Components Continuous Raw Data Continuous Insights Block the fraud transactions, broadcast alerts, etc. Sensors, IoT Devices, Log Collectors, etc. Spark Streaming, Structured Streaming How to connect Spark and Real-time Data Sources?
  6. 6. Building a Connector for Real-time Data Source and Structured Streaming taking Azure Event Hubs as an example (Comparing with Kafka Connector)
  7. 7. What is Azure Event Hubs? Platform as a Service
  8. 8. What is Structured Streaming? • A relatively new component in Spark – Introduced in Spark 2.0 – Keep evolving from Spark 2.0 – 2.2 • Streaming Analytic Engine for Structured Data – Sharing the same set of APIs with Spark SQL Batching Engine – Running computation over continuously arriving data and keep updating results
  9. 9. Structured Streaming Abstraction in Structured Streaming Input Batch X, X+1, ... Batch X, X+1, ,… (Spark 2.1 Implementation) Source FS Source EventHubs Source … Describe Data Source: 1. getOffset: offset of the last message to be processed in next batch 2. getBatch: build DataFrame for the next batch Processing Spark SQL Core getOffset/get Batch User-Defined Operations Define Data Transforming Task with Spark SQL APIs (Structured Streaming Internal) Output Sink FS Sink Console Sink … Describe Data Sink 1. addBatch: evaluate DataFrame and save data to target system Still follow Micro-Batch streaming model
  10. 10. Implementation of a “Source” Description Implementation getOffset() Return the last offset of next batch EndOffsetOfLastBatch “+” RateControlFunc(…) getBatch(startOffset, endOffset) Build DataFrame for next Batch SS internal pases in startOffset and endOffset (results of getOffset) 1. How to represent Offset 2. How to define Rate (To avoid tracking size of each message, rate is usually defined as # of messages)
  11. 11. Implementation of a “Source” Description Implementation getOffset() Return the last offset of next batch EndOffsetOfLastBatch “+” RateControlFunc(…) getBatch(startOffset, endOffset) Build DataFrame for next Batch SS internal pases in startOffset and endOffset (results of getOffset) 1. How to represent Offset 2. How to define Rate (To avoid tracking size of each message, rate is usually defined as # of messages)
  12. 12. Various Forms of Message Offset • Consecutive Numbers – 0, 1, 2, 3, … – Examples: Kafka, MQTT – Server-side index mapping an integer to the actual offset of the message in disk space • Real offset – 0, sizeOf(msg0), sizeOf(msg0 + msg1), ... – Examples: Azure Event Hubs – No server-side index, passing offset as part of message to user
  13. 13. How it brings difference? - Kafka Driver Executor Executor Kafka Batch 0: Mapping the partition of DataFrame to a offset range in server side: (endOffsetOfLastBatch:0, # of msgs: 1000) Batch 1: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch: 999, # of msgs: 1000)
  14. 14. How it brings difference? - Event Hubs Driver Executor Executor Batch 0: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch:0, # of Msgs: 1000) Azure Event Hubs Batch 1: How many messages are to be processed in next batch, and where to start? (endOffsetOfLastBatch:?, endOffset: 1000) What’s the offset of 1000th message??? The answer appeared in Executor side (when Task receives the message (offset as part of message)) Build a Channel to Pass Information from Executor to Driver!!!
  15. 15. Difference of KafkaSource and EventHubsSource getOffset EndOffsetOfLastBatch (known before the batch is finished) Calculate targetOffset of next Batch Collect the ending offsets of last batch from executors getBatch( startOffset, endOffset) StartOffset: EndOffsetOfLastBatch (passed-in value from SS internal) StartOffset: the collected values Azure Event Hubs More details about two-layered index to work with OffsetSeqLog in SS for offine discussion or QA
  16. 16. How it brings difference? - Event Hubs Driver Executor Executor Batch 0: How many messages are to be processed in next batch, and where to start? (startOffset:0, # of Msgs: 1000) Azure Event Hubs Batch 1: How many messages are to be processed in next batch, and where to start? (startOffset:?, endOffset: 1000) What’s the offset of 1000th message??? The answer appeared in Executor side (when Task receives the message (offset as part of message)) Build a Channel to Pass Information from Executor to Driver!!! HOW?
  17. 17. HDFS-Based Channel LastOffset_P1_Batch_i LastOffset_PN_Batch_i Source DataFrame .map() Transformed DataFrame What’s the next step??? Simply let Driver-side logic read the files? •APIs like DataFrame.head(x) evaluates only some of the partitions…Batch 0 generate 3 files, and Batch 1 generates 5 files… •You have to merge the latest files with the historical results and commit and then direct the driver-side logic to read No!!!
  18. 18. HDFS-Based Channel LastOffset_P1_Batch_i LastOffset_PN_Batch_i EventHubsRDD Tasks .map() MapPartitionsRDD Tasks •APIs like DataFrame.head(x) evaluates only some of the partitions…Batch 0 generate 3 files, and Batch 1 generates 5 files… •You have to merge the latest files with the historical results and commit... •Ensure that all streams’ offset are committed transactionally •Discard the partially merged/committed results to rerun the batch
  19. 19. Takeaways (1) • Data Source is not only designed for structured streaming • Accommodate Connector Implementation with Data Source Design • Message Addressing (Offset) in data source side is the Key Design Factor to be considered – Requirement of Additional Information Sharing Channel between Executors and Driver • Design to ensure the correctness of the channel
  20. 20. Structured Streaming Unit Testing Framework • “Action” based structured streaming test flow – StartStream, AddData, CheckAnswer, AdvanceClock, StopStream, etc. • Kafka source unit tests use Kafka micro-service. • Unit test framework not usable as is.
  21. 21. Unit Testing Event Hubs Source • No Kafka type micro-service available for Event Hubs. • Two simulated clients – Event Hubs AMQP and Event Hubs REST clients. • Replace both clients before first call to the Event Hubs source.
  22. 22. Issues With Unit Test Framework • No separation between test setup and test execution. • Unreliable update of the Event Hubs clients before asynchronous call to StreamingQueryManager.startQuery(). • Task serialization issue with AddData as part of StreamTest.
  23. 23. What Is The Asynchrony Problem? StreamingQueryManager.startQuery() : StreamingQuery StreamingQueryManager.createQuery() : StreamingQueryWrapper StreamingQueryWrapper .StreamExecution.start() EventHubsSource.setEventHubsClient() EventHubsSource .setEventHubsReceiver() EventHubs Test Code Spark Source Code Thread 1 Thread 2 spawned by Thread 1
  24. 24. One Way To Solve Asynchrony Problem StreamingQueryManager.startQuery() : StreamingQuery StreamingQueryManager.createQuery() : StreamingQueryWrapper StreamingQueryWrapper .StreamExecution.start() EventHubsSource.setEventHubsClient() EventHubsSource .setEventHubsReceiver() EventHubs Test Code Spark Source Code Thread 1 Thread 2 spawned by Thread 1
  25. 25. Solving Asynchrony in Client Substitutions (1) • Invoke createQuery val createQueryMethod = sparkSession.streams.getClass.getDeclaredMethods.filter(m => m.getName == "createQuery").head createQueryMethod.setAccessible(true) currentStream = createQueryMethod.invoke(…) • Set as active query val activeQueriesField = sparkSession.streams.getClass.getDeclaredFields.filter(f => f.getName == "org$apache$spark$sql$streaming$StreamingQueryManager$$activeQueries").head activeQueriesField.setAccessible(true) val activeQueries = activeQueriesField.get(sparkSession.streams). asInstanceOf[mutable.HashMap[UUID, StreamingQuery]] activeQueries += currentStream.id -> currentStream
  26. 26. Solve Asynchrony in Client Substitutions (2) • Substitute Event Hubs clients in source val eventHubsSource = sources.head eventHubsSource.setEventHubClient(…) eventHubsSource.setEventHubsReceiver(…) • Start StreamExecution currentStream.start()
  27. 27. What Is The Task Serialization Issue? SimulatedEventHubs StreamTest AddEventHubsData AddData TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver TestEventHubsReceiver Partition Partition Partition Partition Partition EventHubsRDD
  28. 28. What Is StreamTest Inheritance Model? SparkFunSuite StreamTest PlanTest QueryTest FunSuite
  29. 29. Solving Task Serialization Issue • Separate StreamTest into two traits: – EventHubsStreamTest trait EventHubsStreamTest extends QueryTest with BeforeAndAfter with SharedSQLContext with Timeouts with Serializable { } – EventHubsAddData trait EventHubsAddData extends StreamAction with Serializable {}
  30. 30. After All These Changes… testStream(sourceQuery)( StartStream(…), CheckAnswer(…), AddEventHubsData(…), AdvanceManualClock(…), CheckAnswer(…), StopStream, StartStream(…), CheckAnswer(…), AddEventHubsData(…), AdvanceManualClock(…), AdvanceManualClock(…), CheckAnswer(…))
  31. 31. Summary/Takeaway(2) • We have a production grade Spark Streaming connector for Structured Streaming for the widely used Event Hubs as streaming source on Azure (https://github.com/hdinsight/spark-eventhubs) • Design, implementation and testing of Structured Streaming Connectors o Accommodate Connector Implementation with Data Source Design (message addressing) o Unit Test with Simulated Service o Asynchrony – the biggest enemy to a easy test o Clear Boundary between test setup and test execution
  32. 32. Contributing Back To Community • Failed Recovery from checkpoint caused by the multi- threads issue in Spark Streaming scheduler https://issues.apache.org/jira/browse/SPARK-19280 One Realistic Example of its Impact: You are potentially getting wrong data when you use Kafka and reduceByWindow and recover from a failure • Data loss caused by improper post-batch-completed processing https://issues.apache.org/jira/browse/SPARK-18905 • Inconsistent Behavior of Spark Streaming Checkpoint https://issues.apache.org/jira/browse/SPARK-19233
  33. 33. Thank You!!! https://github.com/hdinsight/spark-eventhubs https://azure.microsoft.com/en-us/services/hdinsight/

×