Structured Streaming with
Kafka
Deeper look into the integration of kafka and spark
https://github.com/Shasidhar/kafka-streaming
Agenda
● Data collection vs Data ingestion
● Why they are key?
● Streaming data sources
● Kafka overview
● Integration of kafka and spark
● Checkpointing
● Kafka as Sink
● Delivery semantics
● What next?
Data collection and Data ingestion
Data Collection
● Happens where data is created
● Varies for different type of workloads Batch vs Streaming
● Different modes of data collection pull vs push
Data ingestion
● Receive and store data
● Coupled with input sources
● Help in routing data
Data collection vs Data ingestion
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
Why Data collection/ingestion is key?
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
Data collection tools
● rsyslog
○ Ancient data collector
○ Streaming mode
○ Comes in default and widely known
● Flume
○ Distributed data collection service
○ Solution for data collection of all formats
○ Initially designed to transfer log data into HDFS frequently and reliably
○ Written and maintained by cloudera
○ Popular for data collection even today in hadoop ecosystem
Data collection tools cont..
● LogStash
○ Pluggable architecture
○ Popular choice in ELK stack
○ Written in JRuby
○ Multiple input/ Multiple output
○ Centralize logs - collect, parse and store/forward
● Fluentd
○ Plugin architecture
○ Built in HA architecture
○ Lightweight multi-source, multi-destination log routing
○ Its offered as a service inside google cloud
Data Ingestion tools
● RabbitMQ
○ Written in Erlang
○ Implements AMQP (Advanced Message Queuing Protocol) architecture
○ Has pluggable architecture and provides extension for HTTP
○ Provides strong guarantees for messages
Kafka Overview
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn, now backed by Confluent
High Level Architecture
Terminology
● Brokers: Every server which is part of kafka cluster
● Producers : Processes which produces messages to Topic
● Consumers: Processes which subscribes to topic and read messages
● Consumer Group: Set of consumers sharing a common group to consume
topic data
● Topics : Is where messages are maintained and partitioned.
○ Partitions: It’s an ordered immutable sequence of messages or a commit
log.
○ Offset: seqId given to each message to track its position in topic partition
Anatomy of Kafka Topic
Spark vs Kafka compatibility
Kafka Version Spark Streaming Spark Structured
Streaming
Spark Kafka Sink
Below 0.10 Yes No No
After 0.10 Yes Yes Yes
● Consumer semantics has changed from Kafka 0.10
● Timestamp is introduced in message formats
● Reduced client dependency on ZK (Offsets are stored in
kafka topic)
● Transport encryption SSL/TLS and ACLs are introduced
Kafka with Spark Structured Streaming
● Kafka becoming de facto streaming source
● Direct integration support from 2.1.0
○ Broker,
○ Topic,
○ Partitions
Kafka Wordcount
Kafka ingestion time Wordcount
Starting offsets in Streaming Query
● Ways to start accessing kafka data with respect to offset
○ Earliest - start from beginning of the topic, except the deleted data.
○ Latest - start processing only new data that arrives after the query has started.
○ Assign - specify the precise offset to start from for every partition
Kafka read from offset
Checkpointing and write ahead logs
● We still have both of these in structured streaming
● Is used to track progress of query and often keep writing intermediate state to
filesystem
● For kafka, OffsetRange and data processed in each trigger are tracked
● Checkpoint location has to be HDFS compatible path and should be specified
as option for DataStreamWriter
○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str
eaming-queries
● You can modify the application code and just start the query again, it will work
from the same offsets where it’s stopped earlier
Kafka Checkpointing and recovering
Kafka Sink
● Introduced Kafka sink from 2.2.0 (Topic, Broker)
● Currently at-least once semantics is supported
● To achieve the exactly once semantics, you can have unique <key> in output
data
● While reading the data run a deduplication logic to get each data exactly once
val streamingDf = spark.readStream. ... // columns: guid, eventTime, ...
// Without watermark using guid column
streamingDf.dropDuplicates("guid")
// With watermark using guid and eventTime columns
streamingDf
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("guid", "eventTime")
Kafka Sink example
Kafka Sink update mode example
Kafka Source
Delivery semantics
● Type of delivery semantics
○ At-least once
■ Results will be delivered at least once, probably there is a chance to
have duplicates in end
○ At-most once
■ Results will be delivered at most once, there is a chance to miss
some results
○ Exactly once
■ Each data is processed once and corresponding results will be
produced
Spark delivery semantics
● Depends on type of sources/sink
● Streaming sinks are designed to be idempotent for handling reprocessing
● Together, using replayable sources and idempotent sinks, Structured
Streaming can ensure end-to-end exactly-once semantics under any
failure.
● Currently Spark support exactly-once semantics for File output sink.
Input source Spark Output Store
Replayable source Idempotent Sink
Structured Streaming write semantics
File Sink Example
What kafka has in v0.11
● Idempotent producer
○ Exactly Once semantics in input
○ https://issues.apache.org/jira/browse/KAFKA-4815
● Transactional producer
○ Atomic writes across multiple partitions
● Exactly once stream processing
○ Transactional read-process-write-commit operations
○ https://issues.apache.org/jira/browse/KAFKA-4923
What kafka has in v0.8
● At-least once guarantees
Producer Kafka Broker (K,V)
Send
Message
(K,V)
Ack
Append
data to topic
What kafka has in v0.11
Producer Kafka Broker
K,V
Seq,
Pid
Send
Message
Ack
Append
data to topic
(K,V, Seq,Pid)
Idempotent Producer enable.idempotence = true
● Exactly once guarantees
Atomic Multi partition Writes
Transactional Producer transactional.id = “unique-id”
Atomic Multi partition Writes
Transactional Consumer isolation.level = “read_committed”
Exactly once stream processing
● Based on transactional read-process-write-commit pattern
What’s coming in Future
● Spark essentially will support the new semantics from Kafka
● JIRA to follow
○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057
○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879
● Kafka to make idempotent producer behaviour as default in latest versions
○ https://issues.apache.org/jira/browse/KAFKA-5795
● Structured Streaming continuous processing mode
https://issues.apache.org/jira/browse/SPARK-20928
References
● https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how
-apache-kafka-does-it/
● https://databricks.com/session/introducing-exactly-once-semantics-in-apache-
kafka
● https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-
structured-streaming-in-apache-spark-2-2.html
● http://shashidhare.com/spark,/kafka/2017/03/23/spark-structured-streaming-w
ith-kafka-advanced.html
● http://shashidhare.com/spark,/kafka/2017/01/14/spark-structured-streaming-w
ith-kafka-basic.html
● Shashidhar E S
● Lead Solution Engineer at Databricks
● www.shashidhare.com

Structured Streaming with Kafka

  • 1.
    Structured Streaming with Kafka Deeperlook into the integration of kafka and spark https://github.com/Shasidhar/kafka-streaming
  • 2.
    Agenda ● Data collectionvs Data ingestion ● Why they are key? ● Streaming data sources ● Kafka overview ● Integration of kafka and spark ● Checkpointing ● Kafka as Sink ● Delivery semantics ● What next?
  • 3.
    Data collection andData ingestion Data Collection ● Happens where data is created ● Varies for different type of workloads Batch vs Streaming ● Different modes of data collection pull vs push Data ingestion ● Receive and store data ● Coupled with input sources ● Help in routing data
  • 4.
    Data collection vsData ingestion Data Source Data Source Data Source Input data store Data processing engine Analytical engine Data Collection Data Ingestion Data Processing
  • 5.
    Why Data collection/ingestionis key? Data Source Data Source Data Source Input data store Data processing engine Analytical engine Data Collection Data Ingestion Data Processing
  • 6.
    Data collection tools ●rsyslog ○ Ancient data collector ○ Streaming mode ○ Comes in default and widely known ● Flume ○ Distributed data collection service ○ Solution for data collection of all formats ○ Initially designed to transfer log data into HDFS frequently and reliably ○ Written and maintained by cloudera ○ Popular for data collection even today in hadoop ecosystem
  • 7.
    Data collection toolscont.. ● LogStash ○ Pluggable architecture ○ Popular choice in ELK stack ○ Written in JRuby ○ Multiple input/ Multiple output ○ Centralize logs - collect, parse and store/forward ● Fluentd ○ Plugin architecture ○ Built in HA architecture ○ Lightweight multi-source, multi-destination log routing ○ Its offered as a service inside google cloud
  • 8.
    Data Ingestion tools ●RabbitMQ ○ Written in Erlang ○ Implements AMQP (Advanced Message Queuing Protocol) architecture ○ Has pluggable architecture and provides extension for HTTP ○ Provides strong guarantees for messages
  • 9.
    Kafka Overview ● Highthroughput publish subscribe based messaging system ● Distributed, partitioned and replicated commit log ● Messages are persistent in system as Topics ● Uses Zookeeper for cluster management ● Written in scala, but supports many client API’s - Java, Ruby, Python etc ● Developed by LinkedIn, now backed by Confluent
  • 10.
  • 11.
    Terminology ● Brokers: Everyserver which is part of kafka cluster ● Producers : Processes which produces messages to Topic ● Consumers: Processes which subscribes to topic and read messages ● Consumer Group: Set of consumers sharing a common group to consume topic data ● Topics : Is where messages are maintained and partitioned. ○ Partitions: It’s an ordered immutable sequence of messages or a commit log. ○ Offset: seqId given to each message to track its position in topic partition
  • 12.
  • 13.
    Spark vs Kafkacompatibility Kafka Version Spark Streaming Spark Structured Streaming Spark Kafka Sink Below 0.10 Yes No No After 0.10 Yes Yes Yes ● Consumer semantics has changed from Kafka 0.10 ● Timestamp is introduced in message formats ● Reduced client dependency on ZK (Offsets are stored in kafka topic) ● Transport encryption SSL/TLS and ACLs are introduced
  • 14.
    Kafka with SparkStructured Streaming ● Kafka becoming de facto streaming source ● Direct integration support from 2.1.0 ○ Broker, ○ Topic, ○ Partitions
  • 15.
  • 16.
  • 17.
    Starting offsets inStreaming Query ● Ways to start accessing kafka data with respect to offset ○ Earliest - start from beginning of the topic, except the deleted data. ○ Latest - start processing only new data that arrives after the query has started. ○ Assign - specify the precise offset to start from for every partition
  • 18.
  • 19.
    Checkpointing and writeahead logs ● We still have both of these in structured streaming ● Is used to track progress of query and often keep writing intermediate state to filesystem ● For kafka, OffsetRange and data processed in each trigger are tracked ● Checkpoint location has to be HDFS compatible path and should be specified as option for DataStreamWriter ○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str eaming-queries ● You can modify the application code and just start the query again, it will work from the same offsets where it’s stopped earlier
  • 20.
  • 21.
    Kafka Sink ● IntroducedKafka sink from 2.2.0 (Topic, Broker) ● Currently at-least once semantics is supported ● To achieve the exactly once semantics, you can have unique <key> in output data ● While reading the data run a deduplication logic to get each data exactly once val streamingDf = spark.readStream. ... // columns: guid, eventTime, ... // Without watermark using guid column streamingDf.dropDuplicates("guid") // With watermark using guid and eventTime columns streamingDf .withWatermark("eventTime", "10 seconds") .dropDuplicates("guid", "eventTime")
  • 22.
  • 23.
    Kafka Sink updatemode example
  • 24.
  • 25.
    Delivery semantics ● Typeof delivery semantics ○ At-least once ■ Results will be delivered at least once, probably there is a chance to have duplicates in end ○ At-most once ■ Results will be delivered at most once, there is a chance to miss some results ○ Exactly once ■ Each data is processed once and corresponding results will be produced
  • 26.
    Spark delivery semantics ●Depends on type of sources/sink ● Streaming sinks are designed to be idempotent for handling reprocessing ● Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. ● Currently Spark support exactly-once semantics for File output sink. Input source Spark Output Store Replayable source Idempotent Sink
  • 27.
  • 28.
  • 29.
    What kafka hasin v0.11 ● Idempotent producer ○ Exactly Once semantics in input ○ https://issues.apache.org/jira/browse/KAFKA-4815 ● Transactional producer ○ Atomic writes across multiple partitions ● Exactly once stream processing ○ Transactional read-process-write-commit operations ○ https://issues.apache.org/jira/browse/KAFKA-4923
  • 30.
    What kafka hasin v0.8 ● At-least once guarantees Producer Kafka Broker (K,V) Send Message (K,V) Ack Append data to topic
  • 31.
    What kafka hasin v0.11 Producer Kafka Broker K,V Seq, Pid Send Message Ack Append data to topic (K,V, Seq,Pid) Idempotent Producer enable.idempotence = true ● Exactly once guarantees
  • 32.
    Atomic Multi partitionWrites Transactional Producer transactional.id = “unique-id”
  • 33.
    Atomic Multi partitionWrites Transactional Consumer isolation.level = “read_committed”
  • 34.
    Exactly once streamprocessing ● Based on transactional read-process-write-commit pattern
  • 35.
    What’s coming inFuture ● Spark essentially will support the new semantics from Kafka ● JIRA to follow ○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057 ○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879 ● Kafka to make idempotent producer behaviour as default in latest versions ○ https://issues.apache.org/jira/browse/KAFKA-5795 ● Structured Streaming continuous processing mode https://issues.apache.org/jira/browse/SPARK-20928
  • 36.
    References ● https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how -apache-kafka-does-it/ ● https://databricks.com/session/introducing-exactly-once-semantics-in-apache- kafka ●https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with- structured-streaming-in-apache-spark-2-2.html ● http://shashidhare.com/spark,/kafka/2017/03/23/spark-structured-streaming-w ith-kafka-advanced.html ● http://shashidhare.com/spark,/kafka/2017/01/14/spark-structured-streaming-w ith-kafka-basic.html
  • 37.
    ● Shashidhar ES ● Lead Solution Engineer at Databricks ● www.shashidhare.com