Spark Streaming & Kafka-The Future of Stream Processing

|8/21/20
15
Jack Gudenkauf
VP Big Data
scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()
https://twitter.com/_JG

2
PLAYTIKA
 Founded in 2010
 Social Casino global category leader
 10 games
 13 platforms
 1000+ employees

3© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Spark + Kafka:
Future of Streaming Processing

Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...

From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past
Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours

Key Components of Streaming Architectures
Data Ingestion
& Transportation
Service
Real-Time Stream
Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time
Data Serving

Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources

Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory

Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM

RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage

Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput

Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture

Use DStreams for Windowing Functions

Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.

Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as
well

Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:

Reliability
• Received data automatically persisted to HDFS Write Ahead Log to prevent data
loss
• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf
• When AM dies, the application is restarted by YARN
• Received, ack-ed and unprocessed data replayed from WAL (data that made it
into blocks)
• Reliable Receivers can replay data from the original source, if required
• Un-acked data replayed from source.
• Kafka, Flume receivers bundled with Spark are examples
• Reliable Receivers + WAL = No data loss on driver or receiver failure!

Reliable Kafka DStream
• Stores received data to Write Ahead Log on HDFS for replay – no data loss!
• Stable and supported!
• Uses a reliable receiver to pull data from Kafka
• Application-controlled parallelism
• Create as many receivers as you want to parallelize
• Remember – each receiver is a task and holds one executor hostage, no
processing happens on that executor.
• Tricky to do this efficiently, so is controlling ordering (everything needs to be
done explicitly

Reliable Kafka Dstream - Issues
• Kafka can replay messages if processing failed for some reason
• So WAL is overkill – causes unnecessary performance hit
• In addition, the Reliable Stream causes a lot of network traffic due
to unneeded HDFS writes etc.
• Receivers hold executors hostage – which could otherwise be
used for processing
• How can we solve these issues?

Direct Kafka DStream
• No long-running receiver = no executor hogging!
• Communicates with Kafka via the “low-level API”
• 1 Spark partition Kafka partition
• At the end of every batch:
• The first message after the last batch to the current latest message in partition
• If max rate is configured, then rate x batch interval is downloaded & processed
• Checkpoint contains the starting and ending offset in the current RDD
• Recovering from checkpoint is simple – last offset + 1 is least offset of next
batch

Direct Kafka DStream
• (Almost) Exactly once processing
• At the end of each interval, the RDD can provide information about the starting
and ending offset
• These offsets can be persisted, so even on failure – recover from there
• Edge cases are possible and can cause duplicates
• Failure in the middle of HDFS writes -> duplicates!
• Failure after processing but before offsets getting persisted -> duplicates!
• More likely!
• Writes to Kafka also can cause duplicates, so do reads from Kafka
• Fix: You app should really be resilient to duplicates

Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.

What is coming?
• Better Monitoring and alerting
• Batch-level and task-level monitoring
• SQL on Streaming
• Run SQL-like queries on top of Streaming (medium – long term)
• Python!
• Limited support already available, but more detailed support coming
• ML
• More real-time ML algorithms

Current Spark project status
• 400+ contributors and 50+ companies contributing
• Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc
• Dozens of production deployments
• Spark Streaming Survived Netflix Chaos Monkey – production ready!
• Included in CDH!

More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark

Thank you
hshreedharan@cloudera.com
@harisr1234

Spark Streaming & Kafka-The Future of Stream Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Streaming & Kafka-The Future of Stream Processing

Similar to Spark Streaming & Kafka-The Future of Stream Processing (20)

Recently uploaded

Recently uploaded (20)

Spark Streaming & Kafka-The Future of Stream Processing