Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Con LA 2022 Keynote

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Loading in …3
×

Check these out next

1 of 16 Ad

More Related Content

More from Data Con LA (20)

Recently uploaded (20)

Advertisement

Data Con LA 2022 Keynote

  1. 1. Next Generation Apache Spark Structured Streaming Karthik Ramasamy Head of Streaming, Databricks Project #Lightspeed
  2. 2. Stream Processing DBMS / CDC, Apps, collection agents, IoT devices Streaming data lands in message bus (e.g. Pulsar, Kafka) / Files Window aggregation Pattern detection Enrichment Routing Streaming Transformations Data continuously, incrementally processed as it appears Triggers and Alerts Real-time Analytics Applications Operational Applications
  3. 3. Explosion of streaming Trillions of rows of data processed from thousands of sources 3 Manufacturing Retail Financial Services Healthcare Energy Gaming Technology & Software Media & Entertainment Fraud Detection Personalization Covid-19 Response Predictive Maintenance Smart Pricing Player Interaction Analytics Connected Cars, Smart Homes Content Recommendations
  4. 4. Growth of Spark Structured Streaming >150% YoY streaming job growth Most downloaded streaming engine from Maven Central
  5. 5. 1200+ customers Logos using Structured Streaming on the Lakehouse 9x growth in usage in 3 years
  6. 6. Spark Structured Streaming Powers thousands of your everyday life applications today Unified Batch & Streaming APIs Lets developers use the same business logic across batch and stream processing Fault Tolerance & Recovery Automatic checkpointing & failure recovery allowing for reliable operations Performance | Throughput Handles > 14M events/sec (1.2T events per day) for the most challenging workloads Flexible operations Arbitrary logic and operations on the output of a streaming query Stateful Processing Support for stateful aggregations and joins along with watermarks for bounded states
  7. 7. New streaming applications Proactive Maintenance in Oil Drilling Elevator Dispatch Consistent sub-second latency Ease of expressing processing logic for complex use cases Integrations with new cloud source and sink systems Tracing Microservices 1 2 3
  8. 8. Structured Streaming needs to evolve to satisfy these new requirements
  9. 9. Project Lightspeed Next generation of Spark Structured Streaming
  10. 10. Project Lightspeed Faster and simpler stream processing Predictable Low Latency Target reduction in tail latency by up to 2x Enhanced Functionality Advanced capabilities for processing data with new operators and easy to use APIs Operations & Troubleshooting Simplifying deployment, operations, monitoring, and troubleshooting Connectors & Ecosystem Improving ecosystem support for connectors, authentication & authorization features
  11. 11. Project Lightspeed - Predictable Low Latency Faster bookkeeping - Offset management External Storage Sequential Overlapped External Storage Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage async persist offset ranges async persist offset ranges time Micro-batch - 3 Processing async persist offset ranges 440 ms 120 ms 73% improvement in latency for stateless pipelines time Micro-batch - 1 Processing External Storage Micro-batch - 2 Processing External Storage External Storage Persist offset ranges Mark batch done Persist offset ranges Mark batch done
  12. 12. Project Lightspeed - Python as a first class citizen agg() count() min() max() mean() groupby() orderby() select() selectExpr() distinct() where() map() mapValues() flatMap() flatMapValues() csv() json() parquet() orc() schema() text() foreach() foreachBatch() Input & Output Aggr & Grouping awaitTermination() exception() explain() status stop() Query Management crossJoin() crosstab() join() union() unionAll() Joins, etc Filtering createGlobalTempView() createTempView() drop() drop_duplicates() registerTempTable() DDL Operations window() session_window() Windowing mapGroupWithState() flatMapGroupWithState() Arbitrary Stateful Processing
  13. 13. Project Lightspeed - Improve Debuggability Visualize the pipeline as data flow Provide timeline view of metrics for operators Group operator metrics by executor Incorporate source and sink specific metrics
  14. 14. and many more…
  15. 15. Interested in Collaboration? SPARK-39585 - Multiple Stateful Operators in Structured Streaming SPARK-39586 - Advanced Windowing in Structured Streaming SPARK-39587 - Schema Evolution for Stateful Pipelines SPARK-39589 - Asynchronous I/O support SPARK-39590 - Python API for Arbitrary Stateful Processing SPARK-39591 - Offset Management Improvements SPARK-39592 - Asynchronous State Checkpointing SPARK-39593 - Configurable State Checkpointing Frequency
  16. 16. Karthik Ramasamy Head of Streaming Thank you

Editor's Notes

  • <TRANSITION TO KARTHIK>
    So what happened in the last 6-9 months is that we’ve invested heavily on building up a strong streaming team that’s actually going to take structured streaming and elevate to the next level
    We actually have the CEO of Pulsar, Karthik who is going to present this talk. He built a very popular streaming engine prior to this that many of you may have used…
    and today we are very excited to introduce Karthik to share our vision to grow Structured Streaming to the next level….
  • We have seen an explosion of streaming applications across all industries…
    In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine….
    In finance……In retail….. In healthcare…. In manufacturing…. In retail…….


  • We have seen an explosion of streaming applications across all industries…
    In fact, data streaming is part of your everyday life and is reshaping/transforming every industry you can imagine….
    In finance……In retail….. In healthcare…. In manufacturing…. In retail…….


  • KARTHIK….
    Thank you Ali
    We are very data-driven at Databricks and we’ve been looking at the metrics, and from all numbers we’ve seen, this is the most surprising statistic that I’ve seen at Databricks.
    And we haven’t even done much on this, in fact we developed Structured Streaming many years ago and not too much investment went into it and still the growth is 160% of a large base. This is a significant portion of our revenue.
    Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. The majority of streaming workloads we saw were customers migrating their batch workloads to take advantage of the lower latency, fault tolerance, and support for incremental processing that streaming has to offer. The result is that we have seen tremendous adoption from streaming customers for both open source Spark and Databricks. The graph below shows the weekly number of streaming jobs on Databricks over the past three years, which has grown from thousands to 3+ millions, and is still accelerating.
    ……….

    Per Matei - to update, not to use graph, but to say a double digit percentage of our workflows is streaming and have a number here and we see that increasing over time. X many trillions of records p/day.


  • ..and many of our customers, from enterprises to startups have and are continuing adopting streaming in the lakehouse….

  • Why do I believe Spark Structured Streaming is growing? Several properties of Structured Streaming have made it popular and here are the top 5.

    Unification - The foremost advantage of Structured Streaming is that it uses the same API as batch processing,, making the transition to real-time processing from batch much simpler.
    Fault Tolerance & Recovery - Structured Streaming checkpoints state automatically at every stage of processing. When a failure occurs, it automatically recovers from the previous state. The failure recovery is very fast since it is restricted to failed tasks as opposed to restarting the entire streaming pipeline in other systems. AFAIK, SS runs in spot instances making streaming cost effective
    Performance - Structured Streaming provides very high throughput with seconds of latency at a lower cost, taking full advantage of the performance optimizations in the Spark SQL engine..
    Flexible Operations - The ability to apply arbitrary logic and operations on the output of a streaming query using foreachBatch. This enables developers to perform operations like upserts, writes to multiple sinks, as well as interaction with external data sources. Over 40% of our users on Databricks take advantage of this feature.
    Stateful Processing - Support for stateful aggregations and joins along with watermarks for bounded state and late order processing. In addition, arbitrary stateful operations with [flat]mapGroupsWithState backed by a RocksDB state store are provided for efficient and fault-tolerant state management (as of Spark 3.2).
  • As SS grew in leaps and bounds, developers started using it for emerging new applications such as …
    Monitor expensive drill bits continuously and stop them from hitting rock surfaces
    Continuously monitor the data from elevator for emergencies and quickly alert the dispatch
    Stitch the requests and responses from logs of microservices that serve a web request for tracing and troubleshooting
    These exposed some of the shortcomings of SS such as …
    .
    I think if we can address all of these, we will be able to increase adoption and see skyrocketed growth.
    So,

  • What are we doing about?
  • I am very excited to announce that we are launching Project Lightspeed to take SS into next generation


  • Project Lightspeed advances SS across four pillars…
    ….
    In the next few slides, I will give a glimpse of some of the Lightspeed features

  • SS has several bookkeeping - (b) plan offset ranges, (e) mark batch done. Forced into storage (b) and (a) and in sequence. Increased latency
    In default trigger, eliminate (e) and overlap the execution of mb with storing the offset range async
  • SS pipelines can be programmed using multiple languages Java, Scala, Python and SQL. Python is a popular choice. Python provides several API …. But there is a gap. Arbitrary Stateful processing - needed for exponential weighted avg. Key challenge with this API is executing arbitrary python code in a JVM system.
  • Streaming pipelines are brittle. There can be several reasons - surge in data to be processed, resources not adequately provisioned, bug in user code. SS provides tons of metrics ´& logs at micro batch level.

×