Structured Streaming with Kafka

Structured Streaming with
Kafka
Deeper look into the integration of kafka and spark
https://github.com/Shasidhar/kafka-streaming

Agenda
● Data collection vs Data ingestion
● Why they are key?
● Streaming data sources
● Kafka overview
● Integration of kafka and spark
● Checkpointing
● Kafka as Sink
● Delivery semantics
● What next?

Data collection and Data ingestion
Data Collection
● Happens where data is created
● Varies for different type of workloads Batch vs Streaming
● Different modes of data collection pull vs push
Data ingestion
● Receive and store data
● Coupled with input sources
● Help in routing data

Data collection vs Data ingestion
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing

Why Data collection/ingestion is key?
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing

Data collection tools
● rsyslog
○ Ancient data collector
○ Streaming mode
○ Comes in default and widely known
● Flume
○ Distributed data collection service
○ Solution for data collection of all formats
○ Initially designed to transfer log data into HDFS frequently and reliably
○ Written and maintained by cloudera
○ Popular for data collection even today in hadoop ecosystem

Data collection tools cont..
● LogStash
○ Pluggable architecture
○ Popular choice in ELK stack
○ Written in JRuby
○ Multiple input/ Multiple output
○ Centralize logs - collect, parse and store/forward
● Fluentd
○ Plugin architecture
○ Built in HA architecture
○ Lightweight multi-source, multi-destination log routing
○ Its offered as a service inside google cloud

Data Ingestion tools
● RabbitMQ
○ Written in Erlang
○ Implements AMQP (Advanced Message Queuing Protocol) architecture
○ Has pluggable architecture and provides extension for HTTP
○ Provides strong guarantees for messages

Kafka Overview
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn, now backed by Confluent

Terminology
● Brokers: Every server which is part of kafka cluster
● Producers : Processes which produces messages to Topic
● Consumers: Processes which subscribes to topic and read messages
● Consumer Group: Set of consumers sharing a common group to consume
topic data
● Topics : Is where messages are maintained and partitioned.
○ Partitions: It’s an ordered immutable sequence of messages or a commit
log.
○ Offset: seqId given to each message to track its position in topic partition

Spark vs Kafka compatibility
Kafka Version Spark Streaming Spark Structured
Streaming
Spark Kafka Sink
Below 0.10 Yes No No
After 0.10 Yes Yes Yes
● Consumer semantics has changed from Kafka 0.10
● Timestamp is introduced in message formats
● Reduced client dependency on ZK (Offsets are stored in
kafka topic)
● Transport encryption SSL/TLS and ACLs are introduced

Kafka with Spark Structured Streaming
● Kafka becoming de facto streaming source
● Direct integration support from 2.1.0
○ Broker,
○ Topic,
○ Partitions

Kafka ingestion time Wordcount

Starting offsets in Streaming Query
● Ways to start accessing kafka data with respect to offset
○ Earliest - start from beginning of the topic, except the deleted data.
○ Latest - start processing only new data that arrives after the query has started.
○ Assign - specify the precise offset to start from for every partition

Checkpointing and write ahead logs
● We still have both of these in structured streaming
● Is used to track progress of query and often keep writing intermediate state to
filesystem
● For kafka, OffsetRange and data processed in each trigger are tracked
● Checkpoint location has to be HDFS compatible path and should be specified
as option for DataStreamWriter
○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str
eaming-queries
● You can modify the application code and just start the query again, it will work
from the same offsets where it’s stopped earlier

Kafka Checkpointing and recovering

Kafka Sink
● Introduced Kafka sink from 2.2.0 (Topic, Broker)
● Currently at-least once semantics is supported
● To achieve the exactly once semantics, you can have unique <key> in output
data
● While reading the data run a deduplication logic to get each data exactly once
val streamingDf = spark.readStream. ... // columns: guid, eventTime, ...
// Without watermark using guid column
streamingDf.dropDuplicates("guid")
// With watermark using guid and eventTime columns
streamingDf
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("guid", "eventTime")

Kafka Sink update mode example

Delivery semantics
● Type of delivery semantics
○ At-least once
■ Results will be delivered at least once, probably there is a chance to
have duplicates in end
○ At-most once
■ Results will be delivered at most once, there is a chance to miss
some results
○ Exactly once
■ Each data is processed once and corresponding results will be
produced

Spark delivery semantics
● Depends on type of sources/sink
● Streaming sinks are designed to be idempotent for handling reprocessing
● Together, using replayable sources and idempotent sinks, Structured
Streaming can ensure end-to-end exactly-once semantics under any
failure.
● Currently Spark support exactly-once semantics for File output sink.
Input source Spark Output Store
Replayable source Idempotent Sink

Structured Streaming write semantics

What kafka has in v0.11
● Idempotent producer
○ Exactly Once semantics in input
○ https://issues.apache.org/jira/browse/KAFKA-4815
● Transactional producer
○ Atomic writes across multiple partitions
● Exactly once stream processing
○ Transactional read-process-write-commit operations

● At-least once guarantees
Producer Kafka Broker (K,V)
Send
Message
(K,V)
Ack
Append
data to topic

Producer Kafka Broker
K,V
Seq,
Pid
Send
Message
Ack
Append
data to topic
(K,V, Seq,Pid)
Idempotent Producer enable.idempotence = true
● Exactly once guarantees

Atomic Multi partition Writes
Transactional Producer transactional.id = “unique-id”

Atomic Multi partition Writes
Transactional Consumer isolation.level = “read_committed”

Exactly once stream processing
● Based on transactional read-process-write-commit pattern

What’s coming in Future
● Spark essentially will support the new semantics from Kafka
● JIRA to follow
○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057
○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879
● Kafka to make idempotent producer behaviour as default in latest versions
● Structured Streaming continuous processing mode
https://issues.apache.org/jira/browse/SPARK-20928

References
● https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how
-apache-kafka-does-it/
● https://databricks.com/session/introducing-exactly-once-semantics-in-apache-
kafka
● https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-
structured-streaming-in-apache-spark-2-2.html
● http://shashidhare.com/spark,/kafka/2017/03/23/spark-structured-streaming-w
ith-kafka-advanced.html
● http://shashidhare.com/spark,/kafka/2017/01/14/spark-structured-streaming-w
ith-kafka-basic.html

● Shashidhar E S
● Lead Solution Engineer at Databricks
● www.shashidhare.com

Structured Streaming with Kafka

More Related Content

What's hot

Similar to Structured Streaming with Kafka

More from datamantra

Recently uploaded

Structured Streaming with Kafka