Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Cloudera

|8/21/20
15
Jack Gudenkauf
VP Big Data
scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()
https://twitter.com/_JG

2
PLAYTIKA
 Founded in 2010
 Social Casino global category leader
 10 games
 13 platforms
 1000+ employees

3© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Spark + Kafka:
Future of Streaming Processing

Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...

From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past
Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours

Key Components of Streaming Architectures
Data Ingestion
& Transportation
Service
Real-Time Stream
Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time
Data Serving

Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources

Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory

Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM

RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage

Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput

Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture

Use DStreams for Windowing Functions

Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.

Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as
well

Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:

Reliability
• Received data automatically persisted to HDFS Write Ahead Log to prevent data
loss
• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf
• When AM dies, the application is restarted by YARN
• Received, ack-ed and unprocessed data replayed from WAL (data that made it
into blocks)
• Reliable Receivers can replay data from the original source, if required
• Un-acked data replayed from source.
• Kafka, Flume receivers bundled with Spark are examples
• Reliable Receivers + WAL = No data loss on driver or receiver failure!

Reliable Kafka DStream
• Stores received data to Write Ahead Log on HDFS for replay – no data loss!
• Stable and supported!
• Uses a reliable receiver to pull data from Kafka
• Application-controlled parallelism
• Create as many receivers as you want to parallelize
• Remember – each receiver is a task and holds one executor hostage, no
processing happens on that executor.
• Tricky to do this efficiently, so is controlling ordering (everything needs to be
done explicitly

Reliable Kafka Dstream - Issues
• Kafka can replay messages if processing failed for some reason
• So WAL is overkill – causes unnecessary performance hit
• In addition, the Reliable Stream causes a lot of network traffic due
to unneeded HDFS writes etc.
• Receivers hold executors hostage – which could otherwise be
used for processing
• How can we solve these issues?

Direct Kafka DStream
• No long-running receiver = no executor hogging!
• Communicates with Kafka via the “low-level API”
• 1 Spark partition Kafka partition
• At the end of every batch:
• The first message after the last batch to the current latest message in partition
• If max rate is configured, then rate x batch interval is downloaded & processed
• Checkpoint contains the starting and ending offset in the current RDD
• Recovering from checkpoint is simple – last offset + 1 is least offset of next
batch

Direct Kafka DStream
• (Almost) Exactly once processing
• At the end of each interval, the RDD can provide information about the starting
and ending offset
• These offsets can be persisted, so even on failure – recover from there
• Edge cases are possible and can cause duplicates
• Failure in the middle of HDFS writes -> duplicates!
• Failure after processing but before offsets getting persisted -> duplicates!
• More likely!
• Writes to Kafka also can cause duplicates, so do reads from Kafka
• Fix: You app should really be resilient to duplicates

Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.

What is coming?
• Better Monitoring and alerting
• Batch-level and task-level monitoring
• SQL on Streaming
• Run SQL-like queries on top of Streaming (medium – long term)
• Python!
• Limited support already available, but more detailed support coming
• ML
• More real-time ML algorithms

Current Spark project status
• 400+ contributors and 50+ companies contributing
• Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc
• Dozens of production deployments
• Spark Streaming Survived Netflix Chaos Monkey – production ready!
• Included in CDH!

More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark

Thank you
hshreedharan@cloudera.com
@harisr1234

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Cloudera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Cloudera

Similar to Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Cloudera (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)