Successfully reported this slideshow.
Your SlideShare is downloading. ×

Blue Pill/Red Pill: The Matrix of Thousands of Data Streams

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 22 Ad

Blue Pill/Red Pill: The Matrix of Thousands of Data Streams

Download to read offline

Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.

Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Blue Pill/Red Pill: The Matrix of Thousands of Data Streams (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Blue Pill/Red Pill: The Matrix of Thousands of Data Streams

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Knoldus Inc. Blue Pill / Red Pill : The Matrix of thousands of data streams #UnifiedDataAnalytics #SparkAISummit
  3. 3. ● My name is Himanshu Gupta ● Lead Consultant at Knoldus Inc. ● Twitter: @himanshug735 ● LinkedIn: https://www.linkedin.com/in/himanshu-gupta-25189629/ 3#UnifiedDataAnalytics #SparkAISummit About Me
  4. 4. ● The Need ● Challenges ● Our Solution ● Future Work 4#UnifiedDataAnalytics #SparkAISummit Agenda
  5. 5. 5#UnifiedDataAnalytics #SparkAISummit The Need: Make Better Use of Real-Time Data
  6. 6. 6#UnifiedDataAnalytics #SparkAISummit The Need: Make Better Use of Real-Time Data
  7. 7. 7#UnifiedDataAnalytics #SparkAISummit Benefits of Real-Time Data ● In 2014, real-time data analysis reduced crude mortality rate from 7.75% to 6.42% in Queen Alexandra Hospital in Portsmouth and University Hospital Coventry. ● World's largest Hedge Fund, Bridgewater, uses Twitter For Real-Time Economic Modeling.
  8. 8. 8#UnifiedDataAnalytics #SparkAISummit Solution: One Platform An End-to-end real-time data platform which can analyze and prepare data in a single platform-as-a- service.
  9. 9. 9#UnifiedDataAnalytics #SparkAISummit Challenge ● Collecting data from 1000s of streams is difficult. ● Using each stream for different purpose makes processing harder. ● Managing data of mission critical value is a challenge.
  10. 10. 10#UnifiedDataAnalytics #SparkAISummit How to overcome the challenges?
  11. 11. 11#UnifiedDataAnalytics #SparkAISummit Stream Data ● Streaming data into 1000s of streams is a resource intensive process. ● Since streaming requires dedicated resources, the number of streams supported by a system gets limited by the resources available. ● However, if combined, streams can be managed much more efficiently. ● Also, starting/stopping a stream becomes easy since data is managed by group.
  12. 12. 12#UnifiedDataAnalytics #SparkAISummit Group Data For example, consider a Power plant which has 100s of devices emitting data in real-time. The data contains information about different parameters of device like temperature, speed, etc. Since the data is coming from one source (power plant) it becomes a good candidate for grouping data into one stream.
  13. 13. 13#UnifiedDataAnalytics #SparkAISummit Output As Kafka is being used, the result of combining data from different streams into one looks like above. Where one key represents one device of the power plant from previous example.
  14. 14. 14#UnifiedDataAnalytics #SparkAISummit Analyze Data ● Analyzing combined/grouped data have many challenges. ● For example, applying different analytics on different data source. ● Or, managing state of each data source.
  15. 15. 15#UnifiedDataAnalytics #SparkAISummit Use Spark Since the introduction of Structured Streaming in Apache Spark 2.0, the way processing streams has changed a lot. As it has brought a lot of features which were earlier unheard.
  16. 16. 16#UnifiedDataAnalytics #SparkAISummit Why Spark? ● Provides support for ad-hoc queries, i.e., helps in applying different analytics on different data source. ● Manages state of each data source which via Arbitrary Stateful Operations.
  17. 17. 17#UnifiedDataAnalytics #SparkAISummit Store Data ● Storing data might look an easy task but it is not. ● Because after analysis of multiple data sources is done it is difficult to materialize it and save it in different locations. ● And, also saving in such a way that retrieving data becomes Easy.
  18. 18. 18#UnifiedDataAnalytics #SparkAISummit Again! Use Spark Apache Spark comes to rescue here as well. Spark Structured Streaming has support for 6 different types of output sinks.
  19. 19. 19#UnifiedDataAnalytics #SparkAISummit Result The data is saved in a hierarchical file system manner in AWS S3. Each sub file represents a device/data source in the power plant example.
  20. 20. 20#UnifiedDataAnalytics #SparkAISummit
  21. 21. 21#UnifiedDataAnalytics #SparkAISummit Future Work
  22. 22. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×