More Related Content

Slideshows for you(20)

Similar to Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared(20)


Spark (Structured) Streaming vs. Kafka Streams - two stream processing platforms compared

  1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Spark (Structured) Streaming vs. Kafka Streams Two stream processing platforms compared Guido Schmutz 23.10.2018 @gschmutz
  2. Guido Schmutz Working at Trivadis for more than 21 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: Blog: Slideshare: Twitter: gschmutz
  3. Agenda 1. Introducing Stream Processing 2. Spark Streaming vs. Kafka Streams – Overview 3. Spark Structured Streaming vs. Kafka Streams – in Action 4. Summary
  4. Introducing Stream Processing
  5. “Data at Rest” vs. “Data in Motion” Data at Rest Data in Motion Store Act Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 Architekturen von Big Data Anwendungen
  6. When to use Stream Processing / When not? Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch Source: adapted from Cloudera
  7. Hadoop Clusterd Hadoop Cluster Big Data Reference Architecture for Data Analytics Solutions Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event Hub Data Flow Data Flow Change DataCapture Parallel Processing Storage Storage RawRefined SQL Export Microservice State { } API Event Stream Event Stream Search Service Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File IoT Data Mobile Apps Social Event Stream Telemetry Stream Processor State { } API Stream Analytics Results DB
  8. Two Types of Stream Processing (from Gartner) Stream Data Integration • Primarily cover streaming ETL • Integration of data source and data sinks • Filter and transform data • (Enrich data) • Route data Stream Analytics • calculating aggregates & detecting patterns to generate higher-level, more relevant summary information (complex events => used to be CEP) • Complex events may signify threats or opportunities that require a response
  9. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  10. Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  11. Example Use Case Truck-2 Truck-1 Truck-3 truck_ position detect_danger ous_driving Truck Driver jdbc-source join_dangerous_driv ing_driver dangerous_dri ving_driver Count By Event Type Window (1m, 30s) count_by_event _type
  12. Spark Streaming vs. Kafka Streams - Overview
  13. Spark (Structured) Streaming Spark Streaming • 1st generation • one of the first APIs to enable stream processing using high-level functional operators like map and reduce • Like RDD API the DStreams API is based on relatively low-level operations on Java/Python objects • Used by many organizations in production Spark Structured Streaming • 2nd generation • Structured API through DataFrames / Datasets rather than RDDs • Easier code reuse between batch and streaming • marked production ready in Spark 2.2.0 • Support for Java, Scala, Python, R and SQL • Focus of this talk
  14. Apache Spark Streaming as part of Spark Stack Spark (Structured) Streaming Resilient Distributed Dataset (RDD) Spark Standalone MESOS / Kubernetes YARN HDFS Elastic Search NoSQL S3 Libraries Low Level API Cluster Resource Managers Data Stores Advanced Analytics Libraries & Ecosystem Data Frame Structured API Datasets SQL Distributed Variables
  15. Kafka Streams – part of Kafka Core • Designed as a simple and lightweight library in Apache Kafka • no external dependencies on systems other than Apache Kafka • Part of open source Apache Kafka, introduced in 0.10+ • Leverages Kafka as its internal messaging layer • Support for Java and SQL (KSQL)
  16. Apache Kafka – Event Hub & Streaming Platform High-Level Architecture Distributed Log at the Core Scale-Out Architecture Logs do not (necessarily) forget
  17. Spark Structured Streaming vs. Kafka Streams – in Action
  18. Infrastructure • Runs as part of a full Spark stack • Cluster can be either Spark Standalone, YARN-based or container-based • Many cloud options • Just a Java library • Runs anyware Java runs: Web Container, Java Application, Container- based …
  19. Main Abstractions Dataset/Data Frame API • DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data • Use readStream() instead of read() Transformation & Actions • Almost all transformations from Spark bounded data processing (Batch) are also usable for streaming Input Sources and Sinks Triggers • triggers define when data is output • As soon as last group is finished • Fixed interval between micro-batches • One-time micro-batch Output Mode • Define how data is output • Append – only add new records to output • Update – update changed records in place • Complete – rewrite full output
  20. Main Abstractions Topologyval schema = new StructType() .add(...) val inputDf = spark .readStream .format(...) .option(...) .load() val filteredDf = inputDf.where(...) val query = filteredDf .writeStream .format(...) .option(...) .start() I F O
  21. Main Abstractions Stream Processing Application • program that uses Kafka Streams library Application Instance • running instance of application Topology • logic that needs to be performed by stream processing • functional DSL or low-level Processor API Stream Processor • a node in the processor topology KStream • Abstraction of a stream of records • Interpreted as events KTable • Abstraction of a change log stream • Interpreted as update of same record (by key) GlobalKTable • Like KTable, but not partitioned => all data is available on all parallel application instances
  22. Main Abstractions Topologypublic static void main(String[] args) { Properties streamsConfiguration = new Properties(); streamsConfiguration.put(...); final StreamsBuilder builder = new StreamsBuilder(); KStream<..,..> stream =; KStream<..,..> filtered = stream.filter(…) KafkaStreams streams = new KafkaStreams(,streamsConfiguration); streams.start(); } I F O
  23. Streaming Data Sources • File Source • Reads files as a stream of data • Supports text, csv, json, orc parquet • Files must be atomically placed • Kafka Source • Reads from Kafka Topic • Supports Kafka broker > 0.10.x • Socket Source (for testing) • Reads UTF8 text from socket connection • Rate Source (for testing) • Generate data at specified number of rows per second val rawDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", "truck_position") .load()
  24. Streaming Data Sources "Kafka only" KStream from Topic KTable from Topic Use Kafka Connect s reading other data sources into Kafka first KStream<String, TruckPosition> positions ="truck_position" , Consumed.with(Serdes.String() , truckPositionSerde)); KTable<String, Driver> driver = builder.table("trucking_driver" , Consumed.with(Serdes.String() , driverSerde) ,"driver-store"));
  25. Streaming Sinks • File Sink – stores output to a directory • Kafka Sink – publishes to Kafka • Foreach Sink - Runs arbitrary computation on the records in the output • Console Sink – for debugging, prints output to console • Memory Sink – for debugging, stores output in-memory table val query = jsonTruckPlusDriverDf .selectExpr("to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("topic","dangerous_driving ") .option("checkpointLocation", "/tmp") .start()
  26. Streaming Sinks "Kafka only" For testing only: Use Kafka Connect for writing out to other targets KStream<String, TruckPosition> posDriver = .."dangerous_driving" ,Produced.with(Serdes.String() , truckPositionDriverSerde)); KStream<String, TruckPosition> posDriver = .. // print to system output posDriver.print(Printed.toSysOut()) // shortcut for posDriver.foreach((key,value) -> System.out.println(key + "=" + value))
  27. Processing Model: Event-at-a-time vs. Micro Batch Introduction to Stream Processing Micro-Batch Processing • Splits incoming stream in small batches • Higher latency • Fault tolerance easier Event-at-a-time Processing • Events processed as they arrive • low-latency • fault tolerance expensive
  28. Stateless Operations – Selection & Projection Most common operations on DataFrame/Dataset are supported for streaming as well select, filter, map, flatMap, … KStream and KTable interfaces support variety of transformation operations filter, filterNot, map, mapValues, flatMap, flatMapValues, branch, selectKey, groupByKey … val filteredDf = truckPosDf.where( "eventType !='Normal'") KStream<> filtered = positions.filter((key,value) -> !value.eventType.equals("Normal") )
  29. Stateful Operations – Aggregations Held in distributed memory with option to spill to disk (fault tolerant through checkpointing to Hadoop-like FS) Output modes: Complete, Append, Update count, sum, mapGroupsWithState, flatMapGroupsWithState, reduce ... Require state store which can be in- memory, RocksDB or custom impl (fault tolerant through Kafka topics) Result of Aggregation is a KTable count, sum, avg, reduce, aggregate ... val c = source .withWatermark("timestamp" , "10 minutes") .groupBy() .count() KTable<..> c = stream .groupByKey(..) .count(...);
  30. Stateful Operations – Time Abstraction Clock Event Time Processing Time Ingestion Time 1 2 3 4 5 adapted from Matthias Niehoff (Codecentric)
  31. Stateful Operations – Time Abstraction Event Time • New with Spark Structured Streaming • Extracted from the message (payload) Ingestion Time • for sources which capture ingestion time Processing Time • “Old” Spark Streaming only supported processing time • generate the timestamp upon processing Event Time • Point in time when event occurred • Extracted from the message (payload or header) Ingestion Time • Point in time when event is stored in Kafka (sent in message header) Processing Time • Point in time when event happens to be processed by stream processing applicationdf.withColumn("processingTime" ,current_timestamp()) .option("includeTimestamp", true)
  32. Stateful Operations - Windowing streams are unbounded need some meaningful time frames to do computations (i.e. aggregations) Computations over events done using windows of data Windows are tracked per unique key Fixed Window Sliding Window Session Window Time Stream of Data Window of Data
  33. Stateful Operations - Windowing Support for Tumbling & Hopping (Sliding) Time Windows Handling Late Data with Watermarking val c = source .withWatermark("timestamp" , "10 minutes") .groupBy(window($"eventTime" , "1 minutes" , "30 seconds") , $"word") .count() Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time watermark 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  34. Stateful Operations - Windowing Support for Tumbling & Hopping Windows Support for Session Windows Handling Late Data with Data Retention (optional) KTable<..> c = stream .groupByKey(...) .windowedBy( SessionWindows .with(5 * 60 * 1000) ).count(); KTable<..> c = stream .groupByKey(..) .windowedBy( TimeWindows.of(60 * 1000) .advanceBy(30 * 1000) .until(10 * 60 * 1000) ).count(...); Data older than watermark not expected / get discarded event time Trailing gap of 10 mins max event time Data Retention 12:20 12:10 12:25 Trailing gap of 10 mins processing time
  35. Stateful Operations - Joins Introduction to Stream Processing Challenges of joining streams 1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data Stream to Static (Table) Join Stream to Stream Join (one window join) Stream to Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join Time Time Time
  36. Stateful Operations - Joins Stream-to-Static and Stream-to-Stream (since 2.3) Joins on Dataset/DataFrame Watermarking helps Spark to know for how long to retain data • Optional for Inner Joins • Mandatory for Outer Joins val jsonTruckPlusDriverDf = jsonFilteredDf.join(driverDf , Seq("driverId") , "left") Source: Spark Documentation
  37. Supports following joins • KStream-to-KStream • KTable-to-KTable • KStream-to-KTable • KStream-to-GlobalKTable • KTable-to-GlobalKTable Stateful Operations - Joins KStream<String, TruckPositionDriver> joined = filteredRekeyed.leftJoin(driver , (left,right) -> new TruckPositionDriver(left , StringUtils.defaultIfEmpty(right.first_name,"") , StringUtils.defaultIfEmpty(right.last_name,"")) , Joined.with(Serdes.String() , truckPositionSerde , driverSerde)); Source: Confluent Documentation
  38. There is more …. • Streaming Deduplication • Run-Once Trigger / fixed Interval Micro-Batching • Continuous Trigger with fixed checkpoint interval (experimental in 2.3) • Streaming Machine Learning • REPL • Queryable State • Processor API • Exactly Once Processing • Microservices with Kafka Streams • Automatic Scale-up / Scale-Down • Stand-by replica of local state • Streaming SQL
  39. There is more … Streaming SQL with KSQL • Enables stream processing with zero coding required • The simplest way to process (structured) streams of data in real- time • Powered by Kafka Streams • KSQL server with REST API • Spark SQL also offers SQL on streaming data, but not as a “first- class citizen” ksql> CREATE STREAM truck_position_s (timestamp BIGINT, truckId BIGINT, driverId BIGINT, routeId BIGINT, eventType VARCHAR, latitude DOUBLE, longitude DOUBLE) WITH (kafka_topic='truck_position', value_format='JSON'); ksql> SELECT * FROM truck_position_s; 1506922133306 | "truck/13/position0 | 2017-10- 02T07:28:53 | 31 | 13 | 371182829 | Memphis to Little Rock | Normal | 41.76 | -89.6 | - 2084263951914664106 ksql> SELECT * FROM truck_position_s WHERE eventType != 'Normal';
  40. Summary
  41. Spark Structured Streaming vs. Kafka Streams • Runs on top of a Spark cluster • Reuse your investments into Spark (knowledge and maybe code) • A HDFS like file system needs to be available • Higher latency due to micro-batching • Multi-Language support: Java, Python, Scala, R • Supports ad-hoc, notebook-style development/environment • Available as a Java library • Can be the implementation choice of a microservice • Can only work with Kafka for both input and output • low latency due to continuous processing • Currently only supports Java, Scala support available soon • KSQL abstraction provides SQL on top of Kafka Streams
  42. Comparison Kafka Streams Spark Streaming Spark Structured Streaming Language Options Java (KIP for Scala), KSQL Scala, Java, Python, R, SQL Scala, Java, Python, R, SQL Processing Model Continuous Streaming Micro-Batching Micro-Batching Core Abstraction KStream / KTable DStream (RDD) Data Frame / Dataset Programming Model Declarative/Imperative Declarative Declarative Time Support Event / Ingestion / Processing Processing Event / Ingestion/ Processing State Support Memory / RocksDB + Kafka Memory / Disk Memory / Disk Time Window Support Fixed, Sliding, Session Fixed, Sliding Fixed, Sliding Join Stream-Static, Stream-Stream Stream-Static Stream-Static, Stream-Stream (2.3) Event Pattern detection No No No Query Language Support KSQL No Spark SQL (limited) Queryable State Interactive Queries No No Scalability & Reliability Yes Yes Yes Guarantees At Least Once/Exactly Once At Least Once/Exactly Once (partial) At Least Once/Exactly Once (partial) Latency Sub-second seconds seconds Deployment Java Library Cluster (with HDFS like FS) Cluster (with HDFS like FS)
  43. Technology on its own won't help you. You need to know how to use it properly.