Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Structured Streaming in Spark


Published on

Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.

Published in: Education
  • Be the first to comment

Structured Streaming in Spark

  1. 1. Structured Streaming in Spark Vikram Agrawal Qubole
  2. 2. About Me ● Pursued Computer Science and Engineering from IIT Delhi ● Co-founded a web conferencing solution company before joining Qubole ● In last 5 years at Qubole, I wore multiple hats and worked across stacks to provide big-data solutions over cloud ● Currently leading the Streaming Team At Qubole
  3. 3. Who should watch this? ● Big Data Engineer (DevOps, Architect, Software, Engineer, Admin) ● Data Platform Manager ● Big Data Enthusiast (Consultant, Executive, Data User, Analyst)
  4. 4. How is streaming used in production? ● Identifying sessions based on user behavior from real time activity streams ● Anomaly and fraud detection: running ML predictions on data streaming in to keep the model updated continuously as new data comes in ● Time-based window aggregations: using window functions to do associative aggregations and run real time stats
  5. 5. Data Processing Architecture
  6. 6. Data Processing Architecture
  7. 7. Streaming Paradigm ● Stream In Stream out ○ Low Latency - How Low? ○ Complexity of Analytics ○ Volume - How high? ● Stream In Batch out ○ No Tight Latency Constraint ○ Higher Ingestion Rate ○ Aggregation/Data or Schema Transformation/Data Enrichment ○ Downstream ETL Operation
  8. 8. Why use Spark Streaming ● No ultra low Latency requirement ○ Processing time of few secs is acceptable ● Scalable and Mature Processing engine ● Higher Level API abstraction ○ Ease of Code Reuse from Batch jobs ○ Simple and Modular ● Vibrant Community ○ Active Development on new features
  9. 9. Spark’s Functionality
  10. 10. Structured Streaming - under the hood ● Abstractions of Repeated Queries ○ Data Streams as unbounded Table ○ Streaming query is a batch- like operation on this table
  11. 11. Structured Streaming - under the hood ● Query Planning & Execution ○ In Batch Execution, Planner creates code & memory optimized execution plan ○ For Streaming Query, Planner convert streaming Logical plans to a series of incremental execution plan to process next chunk of data DataFrame Logical Plan Planner Execution Plan Planner Incremental Execution 1 Incremental Execution 2 Incremental Execution 3
  12. 12. Programming Paradigm Start with Spark Session Specify Data Source, schema and other options (create input df) Write your incremental query to generate output Specify Data Sink and other options to export your data Val S= SparkSession.builder.appName("kafka streaming Example").getOrCreate() val ds = S.readStream.format("kafka") .option("kafka.bootstrap.servers", brokers) option("subscribe", topics).load().selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String) val c= ds.groupBy("value").count() c.writeStream.queryName("aggregates").format(" memory").outputMode("complete").start()
  13. 13. Productionizing Streaming Application ● Monitoring ○ Throughput ○ Latency ○ Time Lag ● Fault Tolerance ○ Checkpointing ○ Exactly Once or At Least Once
  14. 14. Q&A