Spark streaming and Kafka

Spark Streaming with
Kafka
IRAJ HEDAYATI

Kafka as source of Spark streaming

Receive data from Kafka
There are two different approaches to receive data from Kafka
◦ Receiver-based (introduced in Spark 1.2): Use Kafka high-level API (Receiver) to consume messages and
pass to executors
◦ Direct (introduced in Spark 1.3): fetch offsets and process data manually
In Receiver-based mode, there is a possibility of losing data under failures. Hence, it is required
to enable Write Ahead Log (WAL). Basically, it will make a copy of data received from Kafka on
HDFS. Then, if there is any failure, we can recover from logs.
In order to use this method,
use KafkaUtils.createStream() function.

Direct approach
Simplified Parallelism: There is one RDD
partition per Kafka partition (1:1 mapping).
Efficiency: there is no need to keep WALs
(duplicating messages). We can always recover
from Kafka as offsets maintained by Spark.
Exactly-once semantics: in Receiver-based
method, API uses Zookeeper to keep track of
messages. In case of inconsistency between
Spark and Zookeeper, we may process data in
duplicate. With new method, Spark keeps
track of offsets and it will process each
message exactly once.

Integration with Kafka version

Direct stream
Implementation
IMPLEMENT SPARK STREAMING WITH KAFKA

Library dependency
Group ID = org.apache.spark
Artifact ID = spark-streaming-kafka-0-10_2.11
Version = 2.2.0
SBT
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"
Maven
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Kafka version
Scala version
Spark version

Import
In addition to import statements for creating DStream, there are more required to stream Kafka
topics.
Consume messages and deserialize them:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization._
Create DStream from Kafka topics
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

Kafka consumer properties
Kafka properties is a map from configuration key as string to its value as string as well. For more
information, check available configuration keys for the appropriate Kafka document.
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)

Subscribe to Kafka topic
In order to
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
Key class Value class
Convert records of type
ConsumerRecord to key-value for
fuher processing

Location strategies
Kafka consumer API pre-fetches data into cache. Hence, it is important to cache consumers on
executors.
PreferConsistent: distributes partitions evenly across available executors (use it by default)
PreferBrokers: schedule partitions on the Kafka leader for that partition. (Use it if executors are
in the same host as brokers)
PreferFixed: if you have a significant skew in load among partitions to specify an explicit
mapping of partitions to hosts

Consumer strategies
Kafka consumer API supports different strategies to specify topics.
Subscribe: to subscribe to a fixed collection of topics
SubscribePattern; to use a regex to specify topics of interest
Assign: to specify a fixed collection of partitions
It is also possible to extend the abstract class of ConsumerStrategy for special cases

References
“Spark Fundamentals I”, BD0211EN, IBM analytics
“Diving into Apache Spark Streaming’s Execution Model”, Tathagata Das, Matei Zaharia and
Patrick Wendell, https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-
execution-model.html

Spark streaming and Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark streaming and Kafka

Similar to Spark streaming and Kafka (20)

Recently uploaded

Recently uploaded (20)

Spark streaming and Kafka