SlideShare a Scribd company logo
1 of 37
Download to read offline
Spark Streaming Tips
for
Devs & Ops
WHO ARE WE?
Fede Fernández
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@fede_fdz
Fran Pérez
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@FPerezP
Overview
Spark Streaming
Spark + Kafka
groupByKey vs reduceByKey
Table Joins
Serializer
Tunning
Spark Streaming
Real-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data
Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once
Spark + Kafka
● Receiver-based Approach
Spark + Kafka
● Direct API
groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
groupByKey VS reduceByKey
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).reduceByKey(_ + _)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 4)
shuffle shuffle
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(c, 1)
(c, 2)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 2)
(c, 1)
(c, 4)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey
Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets
Table Joins: Medium - Large
Table Joins: Medium - Large
FILTER
No Shuffle
Table Joins: Small - Large
...
Shuffled Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- SortMergeJoin]
[ :- Sort]
[ : +- TungstenExchange hashpartitioning]
[ : +- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Sort]
[ +- TungstenExchange hashpartitioning]
[ +- ConvertToUnsafe]
[ +- Scan ExistingRDD]
Table Joins: Small - Large
Broadcast Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- BroadcastHashJoin]
[ :- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast
Table Joins: Small - Large
Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
Tuning
Garbage Collector
blockInterval
Partitioning
Storage
Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000
Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30
Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)
Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel
QUESTIONS
Fede Fernández
@fede_fdz
fede.f@47deg.com
Fran Pérez
@FPerezP
fran.p@47deg.com
Thanks!

More Related Content

What's hot

Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with GnocchiGordon Chung
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalfSalo Shp
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)Gordon Chung
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in CloudSteven Wu
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMVolkan Yazıcı
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Databasewangzhonnew
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Larry Nung
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and SparkJosef Adersberger
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 

What's hot (18)

R and cpp
R and cppR and cpp
R and cpp
 
Scala+data
Scala+dataScala+data
Scala+data
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVM
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 

Viewers also liked

Agile Methodologies and Cost Estimation
Agile Methodologies and Cost EstimationAgile Methodologies and Cost Estimation
Agile Methodologies and Cost Estimationshansa2014
 
Life Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaLife Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaKelly Services
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go languageTzar Umang
 
Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Neil Witt
 
Software engineer
Software engineerSoftware engineer
Software engineerxccoffey10
 
bti asia salary guide
bti asia salary guidebti asia salary guide
bti asia salary guideFebrian ‎
 
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Xamarin
 
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Travis Albee
 
GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...Alex Spiers
 
Developing a technology enhanced learning strategy
Developing a technology enhanced learning strategyDeveloping a technology enhanced learning strategy
Developing a technology enhanced learning strategySarah Knight
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015aioughydchapter
 
Natural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaNatural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaKelly Services
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodbMitch Pirtle
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
 
Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Rowan Merewood
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMMaaz Anjum
 
Project-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningProject-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningLinkedIn Learning Solutions
 
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...DEVCON
 

Viewers also liked (20)

Agile Methodologies and Cost Estimation
Agile Methodologies and Cost EstimationAgile Methodologies and Cost Estimation
Agile Methodologies and Cost Estimation
 
Life Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and AsiaLife Sciences: Career Development in Europe and Asia
Life Sciences: Career Development in Europe and Asia
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go language
 
Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?Learning Analytics - What Do Stakeholders Really Think?
Learning Analytics - What Do Stakeholders Really Think?
 
Software engineer
Software engineerSoftware engineer
Software engineer
 
Agile cost estimation
Agile cost estimationAgile cost estimation
Agile cost estimation
 
bti asia salary guide
bti asia salary guidebti asia salary guide
bti asia salary guide
 
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
Sharing up to 80% code for iOS, Android, and Windows platforms, a Retail App ...
 
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
Southeast Indonesia: A guide for investors and developers in Lombok, Sumbawa,...
 
GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...GO Mobile presentation for English Language Centre at the University of Liver...
GO Mobile presentation for English Language Centre at the University of Liver...
 
Developing a technology enhanced learning strategy
Developing a technology enhanced learning strategyDeveloping a technology enhanced learning strategy
Developing a technology enhanced learning strategy
 
MongoDB
MongoDBMongoDB
MongoDB
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015
 
Natural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and AsiaNatural Resources: Career Development in Europe and Asia
Natural Resources: Career Development in Europe and Asia
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
 
Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"Estimation or, "How to Dig your Grave"
Estimation or, "How to Dig your Grave"
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
 
Project-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed LearningProject-Based Instruction and the Importance of Self-Directed Learning
Project-Based Instruction and the Importance of Self-Directed Learning
 
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
Open Minded? Software Engineer to a UX Engineer. Ask me how. by Micael Diaz d...
 

Similar to Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingaporeCheng Feng
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKShu-Jeng Hsieh
 

Similar to Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández (20)

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Final_show
Final_showFinal_show
Final_show
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 

More from J On The Beach

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayJ On The Beach
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t HaveJ On The Beach
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...J On The Beach
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoTJ On The Beach
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsJ On The Beach
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternJ On The Beach
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorJ On The Beach
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.J On The Beach
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EEJ On The Beach
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...J On The Beach
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorJ On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTingJ On The Beach
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...J On The Beach
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysJ On The Beach
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to failJ On The Beach
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersJ On The Beach
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...J On The Beach
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every levelJ On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesJ On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers ...
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 

Recently uploaded

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

  • 2.
  • 3. WHO ARE WE? Fede Fernández Scala Software Engineer at 47 Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  • 4. Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table Joins Serializer Tunning
  • 5. Spark Streaming Real-time data processing Continuous Data Flow RDD RDD RDD DStream Output Data
  • 6. Spark + Kafka ● Receiver-based Approach ○ At least once (with Write Ahead Logs) ● Direct API ○ Exactly once
  • 7. Spark + Kafka ● Receiver-based Approach
  • 8. Spark + Kafka ● Direct API
  • 9. groupByKey VS reduceByKey ● groupByKey ○ Groups pairs of data with the same key. ● reduceByKey ○ Groups and combines pairs of data based on a reduce operation.
  • 10. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1, t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  • 11. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 12. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 13. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 14. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  • 15. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  • 16. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  • 17. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 18. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 19. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s, 2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 20. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j, 5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 21. reduce VS group ● Improve performance ● Can’t always be used ● Out of Memory Exceptions ● aggregateByKey, foldByKey, combineByKey
  • 22. Table Joins ● Typical operations that can be improved ● Need a previous analysis ● There are no silver bullets
  • 24. Table Joins: Medium - Large FILTER No Shuffle
  • 25. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  • 26. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  • 28. Serializers ● Java’s ObjectOutputStream framework. (Default) ● Custom serializers: extends Serializable & Externalizable. ● KryoSerializer: register your custom classes. ● Where is our code being run? ● Special care to JodaTime.
  • 30. Tuning: Garbage Collector • Applications which rely heavily on memory consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  • 31. Tuning: blockInterval blockInterval = (bi * consumers) / (pf * sc) ● CAT: Total cores per partition. ● bi: Batch Interval time in milliseconds. ● consumers: number of streaming consumers. ● pf (partitionFactor): number of partitions per core. ● sc (sparkCores): CAT - consumers.
  • 32. blockInterval: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● CAT = 120 ● sparkCores = 120 - 20 = 100 ● partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  • 33. Tuning: Partitioning partitions = consumers * bi / blockInterval ● consumers: number of streaming consumers. ● bi: Batch Interval time in milliseconds. ● blockInterval: time size to split data before storing into Spark.
  • 34. Partitioning: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  • 35. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  • 36. Where to find more information? Spark Official Documentation Databricks Blog Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel