Spark Streaming Tips for Devs and Ops

Spark Streaming Tips
for
Devs & Ops

WHO ARE WE?
Fede Fernández
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@fede_fdz
Fran Pérez
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@FPerezP

Overview
Spark Streaming
Spark + Kafka
groupByKey vs reduceByKey
Table Joins
Serializer
Tunning

Spark Streaming
Real-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data

Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once

Spark + Kafka
● Receiver-based Approach

groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.

groupByKey VS reduceByKey
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).reduceByKey(_ + _)

groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)

groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle

groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
shuffle shuffle

groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 4)
shuffle shuffle

reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)

reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)

reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(c, 1)
(c, 2)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle

reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 2)
(c, 1)
(c, 4)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle

reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey

Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets

Table Joins: Medium - Large
FILTER
No Shuffle

Table Joins: Small - Large
...
Shuffled Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- SortMergeJoin]
[ :- Sort]
[ : +- TungstenExchange hashpartitioning]
[ : +- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Sort]
[ +- TungstenExchange hashpartitioning]
[ +- ConvertToUnsafe]
[ +- Scan ExistingRDD]

Table Joins: Small - Large
Broadcast Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- BroadcastHashJoin]
[ :- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast

Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.

Tuning
Garbage Collector
blockInterval
Partitioning
Storage

Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.

blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000

Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.

Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30

Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)

Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel

QUESTIONS
Fede Fernández
@fede_fdz
fede.f@47deg.com
Fran Pérez
@FPerezP
fran.p@47deg.com
Thanks!

Spark Streaming Tips for Devs and Ops

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Spark Streaming Tips for Devs and Ops

Similar to Spark Streaming Tips for Devs and Ops (20)

Recently uploaded

Recently uploaded (20)

Spark Streaming Tips for Devs and Ops