SlideShare a Scribd company logo
Spark Streaming Tips
for
Devs & Ops
WHO ARE WE?
Fede Fernández
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@fede_fdz
Fran Pérez
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@FPerezP
Overview
Spark Streaming
Spark + Kafka
groupByKey vs reduceByKey
Table Joins
Serializer
Tunning
Spark Streaming
Real-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data
Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once
Spark + Kafka
● Receiver-based Approach
Spark + Kafka
● Direct API
groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
groupByKey VS reduceByKey
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).reduceByKey(_ + _)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 4)
shuffle shuffle
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(c, 1)
(c, 2)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 2)
(c, 1)
(c, 4)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey
Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets
Table Joins: Medium - Large
Table Joins: Medium - Large
FILTER
No Shuffle
Table Joins: Small - Large
...
Shuffled Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- SortMergeJoin]
[ :- Sort]
[ : +- TungstenExchange hashpartitioning]
[ : +- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Sort]
[ +- TungstenExchange hashpartitioning]
[ +- ConvertToUnsafe]
[ +- Scan ExistingRDD]
Table Joins: Small - Large
Broadcast Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- BroadcastHashJoin]
[ :- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast
Table Joins: Small - Large
Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
Tuning
Garbage Collector
blockInterval
Partitioning
Storage
Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000
Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30
Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)
Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel
QUESTIONS
Fede Fernández
@fede_fdz
fede.f@47deg.com
Fran Pérez
@FPerezP
fran.p@47deg.com
Thanks!

More Related Content

What's hot

R and cpp
R and cppR and cpp
R and cpp
Romain Francois
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
Gordon Chung
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
Salo Shp
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
Gordon Chung
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
Josef Adersberger
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
Hadoop User Group
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
Steven Wu
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVM
Volkan Yazıcı
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
Reuven Lerner
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)
Larry Nung
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
Altinity Ltd
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
Josef Adersberger
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
 

What's hot (18)

R and cpp
R and cppR and cpp
R and cpp
 
Scala+data
Scala+dataScala+data
Scala+data
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVM
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 

Similar to Spark Streaming Tips for Devs and Ops

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Masayuki Matsushita
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
Cheng Feng
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
Shu-Jeng Hsieh
 

Similar to Spark Streaming Tips for Devs and Ops (20)

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Final_show
Final_showFinal_show
Final_show
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Spark Streaming Tips for Devs and Ops

  • 2.
  • 3. WHO ARE WE? Fede Fernández Scala Software Engineer at 47 Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  • 4. Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table Joins Serializer Tunning
  • 5. Spark Streaming Real-time data processing Continuous Data Flow RDD RDD RDD DStream Output Data
  • 6. Spark + Kafka ● Receiver-based Approach ○ At least once (with Write Ahead Logs) ● Direct API ○ Exactly once
  • 7. Spark + Kafka ● Receiver-based Approach
  • 8. Spark + Kafka ● Direct API
  • 9. groupByKey VS reduceByKey ● groupByKey ○ Groups pairs of data with the same key. ● reduceByKey ○ Groups and combines pairs of data based on a reduce operation.
  • 10. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1, t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  • 11. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 12. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 13. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 14. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  • 15. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  • 16. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  • 17. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 18. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 19. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s, 2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 20. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j, 5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 21. reduce VS group ● Improve performance ● Can’t always be used ● Out of Memory Exceptions ● aggregateByKey, foldByKey, combineByKey
  • 22. Table Joins ● Typical operations that can be improved ● Need a previous analysis ● There are no silver bullets
  • 24. Table Joins: Medium - Large FILTER No Shuffle
  • 25. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  • 26. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  • 28. Serializers ● Java’s ObjectOutputStream framework. (Default) ● Custom serializers: extends Serializable & Externalizable. ● KryoSerializer: register your custom classes. ● Where is our code being run? ● Special care to JodaTime.
  • 30. Tuning: Garbage Collector • Applications which rely heavily on memory consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  • 31. Tuning: blockInterval blockInterval = (bi * consumers) / (pf * sc) ● CAT: Total cores per partition. ● bi: Batch Interval time in milliseconds. ● consumers: number of streaming consumers. ● pf (partitionFactor): number of partitions per core. ● sc (sparkCores): CAT - consumers.
  • 32. blockInterval: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● CAT = 120 ● sparkCores = 120 - 20 = 100 ● partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  • 33. Tuning: Partitioning partitions = consumers * bi / blockInterval ● consumers: number of streaming consumers. ● bi: Batch Interval time in milliseconds. ● blockInterval: time size to split data before storing into Spark.
  • 34. Partitioning: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  • 35. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  • 36. Where to find more information? Spark Official Documentation Databricks Blog Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel