SlideShare a Scribd company logo
Sessionization At Scale
Using Spark Streaming in production and staying sane
Marina Grechuhin & Yuval
Itzchakov
12/09/2017
YuvalItzchakov
2Confidential
• Developer @ Clicktale for the past 3 years
• Previously developer @ IDF (8200)
• @yuvalitzchakov
• https://stackoverflow.com/users/1870803/yuval-itzchakov
• http://asyncified.io
MarinaGrechuhin
3Confidential
• Team Leader @ Clicktale
• Previously co-founder and VP R&D @ SureVisit
• Previously – many more
Yes
No
Agenda
4Confidential
• Introduction to Spark
• Spark In Depth
• What Is Sessionization?
• Spark Brief Overview
• Sessionization With Spark Streaming
• Scale Challenges
• Structured Streaming with Stateful Aggregations
5Confidential
Architecture – Pipeline CEC
Elastic Load
Balancing
Auto Scaling group
Ingest
Servers
{
"version": 1,
"location":"http://adobe.com/shoe.html",
"projectId": 10,
"documentReferrer": "",
"visitId": 6403608503386111,
"domContentLoaded": 324,
"visitorId": 3246944914767871,
"pageviewId": 1199465738272767,
"engagementTime": 2336,
"messageId": 0
}
6Confidential
Pipeline CEC – Data Types
• Init Message
• Chunk Messages 0-N
• End Message
14
7Confidential
Sizing Pipeline CEC
Elastic Load
Balancing
10
500G/Day
100G/Day
Ingest
1415
Elastic Load
Balancing
8
Ingest
10
8Confidential
What is Sessionization?
Session:
“A sequence of requests made by a single end-user during a visit to a particular site”
(Wikipedia)
• To be able to aggregate user actions over time
• All data doesn’t arrive at once, but piece by piece
9Confidential
Pipeline CEC – Data Types
PageView
End
Chunk
Init
PageView
End
Chunk
Init
PageView
Chunk
Chunk
Init
Visit
PageView
PageView
PageView
PageView – User’s Journey on a single web page
Visit – User’s journey on site
10Confidential
Requirements overview
• Data size ranging between 200B – 1K (may grow over time)
• Process incoming user messages up to 100,000 messages per second
• Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500
companies)
• Scale out as needed without user intervention (hopefully linearly)
• Save user state until a session is complete, and only then send it down the pipeline
• Latency - up to 10 seconds from ingestion to processing (make data available as
soon as it’s ready)
11Confidential
12Confidential
Spark Ecosystem
Source: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
13Confidential
Spark Streaming
• Discretized Stream (DStream)
• Micro batching
• One RDD every batch
Where is the state
kept between
batches?
14Confidential
mapWithState
Source: https://databricks.com/wp-content/uploads/2016/01/blog-faster-stateful-streaming-figure-1-1024x562.png
Init Chunk End
Page
View
Visit
PageView
• Partial Updates
• Timeout
• Initial State
15Confidential
(“dardasaba”, “hello”),
(“dardasaba”,
“goodbye”),
(“hathatul”, “w00t”),
(“hathatul”, “nope”),
(“gargamel”, “muhaha”)
Executor 1
Executor 2
Executor 3
Key Value
“dardasaba” [“hello”,
“goodbye”]
Key Value
“hathatul” [“w00t”,
“nope”]
Key Value
“gargamel” [“muhaha”]
OpenHashMap[String, List[String]]
DStream[(String, String)]
Key Value
16Confidential
What Could Possibly Go Wrong?
17Confidential
Scale Challenges
• Stability
• Resiliency
• Scalability (scale up / down)
• Monitoring
18Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing
• S3 
• Task failure - Eventual consistency on read
• AWS EFS (?)
• Not suited for small file systems (limited IOPS)
• HDFS 
• Best overall write performance out of the three
• Can be installed on the same node as Spark Workers
• Relatively low maintenance (if used only for checkpoint)
19Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing (cont.)
• Problem:
• State not always recoverable
• No matter the DFS, limits your throughput:
• 1KB message size
• 100,000 messages/sec
• 1 minute checkpoint time
(occurs every 40 seconds)
• Workaround:
• None (in Spark Streaming )
Checkpointing –
This is the cost???
20Confidential
Spark Streaming Challenges
2. Resiliency -> Managing user state between application upgrades
• Problems:
• Can’t change the graph
• Can’t change your data structures
• Workaround:
• Roll your own using `stateSnapshot()`
• Provide on start up using `StateSpec.initialState()`
* Can potentially double overhead of the job time (critical with high throughput).
21Confidential
Spark Streaming Challenges
• Problem:
• Spark Streaming defaults to one job (batch) at a
time
• If a particular job is stuck, all others wait
indefinitely
• Workaround:
• Monitor job status using Sparks driver REST API
(http://<driver ip>:4040/api/v1/applications)
• Consider using Speculation (should be done
carefully)
• Enable Blacklisting if a particular node is faulty.
• If you like to live dangerously, consider modifying
“spark.streaming.concurrentJobs”
3. Stability -> Frozen Jobs
22Confidential
Spark Streaming Challenges
• Scale Up – Just works*
• Scale Down – Who takes over the worker’s state?
4. Scalability
No One!
23Confidential
Spark Streaming Challenges
• Logging mechanism – Log only small and random percent of
traffic
4. Monitoring
25Confidential
Is there a better alternative?
26Confidential
Structured Streaming
“The key idea in Structured Streaming is to treat a live data stream as a table
that is being continuously appended” (Structured Streaming Documentation)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
27Confidential
Structured Streaming (Cont.)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
28Confidential
mapGroupsWithState
A second iteration at stateful aggregations in Spark
Resiliency & Stability -> Checkpointing
• Checkpoints are incremental, only deltas!
• Allows state recovery between upgrades *
*According to a set of tests made by us, may not apply to all cases and isn’t documented
behavior
29Confidential
Spark Structured Streaming
• More new features and cool stuff
• Event based timeouts (previously only processing based)​
• Watermarking (New)​
• Deduplication (New)​
• Timeout per state item (Enhancement)​
30Confidential
Our experience so far
Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState:
Pros:
• Queries seem to take less time on average than Spark Streaming *
• No need to save state manually
• Deduplication out of the box is awesome
• Event based timeouts + Watermarking for late data is also awesome
* In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
31Confidential
Our experience so far (Cont.)
Neutral:
• Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data
locality (less shuffling).
• This means that in order to scale up, you must have at least a 1:1 mapping
between number of Kafka partitions and Spark Executors.
Cons:
• Creates a significantly larger memory overhead (due to internal state implementation)
• Makes heavier use of HDFS (many small file writes)
• Doesn’t support multiple states (yet)
• UI not as good as Streaming
32Confidential
Wrapping up
• Overall, Spark Streaming is a great candidate for small-medium loads or none
Stateful aggregations streams.
• If you’re considering Spark as an option for your business, start with
Structured Streaming from the get go.
• Do consider Apache Flink and it’s similar state management module which
allows pluggable state stores as an alternative.
33Confidential
• Real-time Streaming ETL with Structured Streaming:
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-
streaming-apache-spark-2-1.html
• Making Structured Streaming Ready for Production:
https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be
• Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark:
https://www.youtube.com/watch?v=JAb4FIheP28
• Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful-
streaming-with-apache-spark
• Exploring Stateful Streaming with Spark Structured Streaming:
http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming
Resources
Thank you for listening!
Questions?

More Related Content

What's hot

Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
Redis Labs
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Matt Fuller
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
Phil Peace
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
What's new in MongoDB 2.6
What's new in MongoDB 2.6What's new in MongoDB 2.6
What's new in MongoDB 2.6
Matias Cascallares
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres Deployment
PGConf APAC
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
Aung Thu Rha Hein
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
Ivan Glushkov
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Bastiaan Bakker
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
Lynn Langit
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
DataStax Academy
 

What's hot (20)

Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
 
What's new in MongoDB 2.6
What's new in MongoDB 2.6What's new in MongoDB 2.6
What's new in MongoDB 2.6
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and SpeedmentSpeed Up Your Existing Relational Databases with Hazelcast and Speedment
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Security Best Practices for your Postgres Deployment
Security Best Practices for your Postgres DeploymentSecurity Best Practices for your Postgres Deployment
Security Best Practices for your Postgres Deployment
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud PlatformDevoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 

Similar to Spark Streaming @ Scale (Clicktale)

Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 

Similar to Spark Streaming @ Scale (Clicktale) (20)

Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 

Recently uploaded

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
pvpriya2
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Dr.Costas Sachpazis
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
DharmaBanothu
 
Northrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdfNorthrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdf
takipo7507
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
foxlyon
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Lateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptxLateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptx
DebendraDevKhanal1
 

Recently uploaded (20)

Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...Determination of Equivalent Circuit parameters and performance characteristic...
Determination of Equivalent Circuit parameters and performance characteristic...
 
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
An In-Depth Exploration of Natural Language Processing: Evolution, Applicatio...
 
Northrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdfNorthrop Grumman - Aerospace Structures Overvi.pdf
Northrop Grumman - Aerospace Structures Overvi.pdf
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Lateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptxLateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptx
 

Spark Streaming @ Scale (Clicktale)

  • 1. Sessionization At Scale Using Spark Streaming in production and staying sane Marina Grechuhin & Yuval Itzchakov 12/09/2017
  • 2. YuvalItzchakov 2Confidential • Developer @ Clicktale for the past 3 years • Previously developer @ IDF (8200) • @yuvalitzchakov • https://stackoverflow.com/users/1870803/yuval-itzchakov • http://asyncified.io
  • 3. MarinaGrechuhin 3Confidential • Team Leader @ Clicktale • Previously co-founder and VP R&D @ SureVisit • Previously – many more
  • 4. Yes No Agenda 4Confidential • Introduction to Spark • Spark In Depth • What Is Sessionization? • Spark Brief Overview • Sessionization With Spark Streaming • Scale Challenges • Structured Streaming with Stateful Aggregations
  • 5. 5Confidential Architecture – Pipeline CEC Elastic Load Balancing Auto Scaling group Ingest Servers
  • 6. { "version": 1, "location":"http://adobe.com/shoe.html", "projectId": 10, "documentReferrer": "", "visitId": 6403608503386111, "domContentLoaded": 324, "visitorId": 3246944914767871, "pageviewId": 1199465738272767, "engagementTime": 2336, "messageId": 0 } 6Confidential Pipeline CEC – Data Types • Init Message • Chunk Messages 0-N • End Message
  • 7. 14 7Confidential Sizing Pipeline CEC Elastic Load Balancing 10 500G/Day 100G/Day Ingest 1415 Elastic Load Balancing 8 Ingest 10
  • 8. 8Confidential What is Sessionization? Session: “A sequence of requests made by a single end-user during a visit to a particular site” (Wikipedia) • To be able to aggregate user actions over time • All data doesn’t arrive at once, but piece by piece
  • 9. 9Confidential Pipeline CEC – Data Types PageView End Chunk Init PageView End Chunk Init PageView Chunk Chunk Init Visit PageView PageView PageView PageView – User’s Journey on a single web page Visit – User’s journey on site
  • 10. 10Confidential Requirements overview • Data size ranging between 200B – 1K (may grow over time) • Process incoming user messages up to 100,000 messages per second • Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500 companies) • Scale out as needed without user intervention (hopefully linearly) • Save user state until a session is complete, and only then send it down the pipeline • Latency - up to 10 seconds from ingestion to processing (make data available as soon as it’s ready)
  • 13. 13Confidential Spark Streaming • Discretized Stream (DStream) • Micro batching • One RDD every batch Where is the state kept between batches?
  • 15. 15Confidential (“dardasaba”, “hello”), (“dardasaba”, “goodbye”), (“hathatul”, “w00t”), (“hathatul”, “nope”), (“gargamel”, “muhaha”) Executor 1 Executor 2 Executor 3 Key Value “dardasaba” [“hello”, “goodbye”] Key Value “hathatul” [“w00t”, “nope”] Key Value “gargamel” [“muhaha”] OpenHashMap[String, List[String]] DStream[(String, String)] Key Value
  • 17. 17Confidential Scale Challenges • Stability • Resiliency • Scalability (scale up / down) • Monitoring
  • 18. 18Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing • S3  • Task failure - Eventual consistency on read • AWS EFS (?) • Not suited for small file systems (limited IOPS) • HDFS  • Best overall write performance out of the three • Can be installed on the same node as Spark Workers • Relatively low maintenance (if used only for checkpoint)
  • 19. 19Confidential Spark Streaming Challenges 1. Stability & Resiliency -> Checkpointing (cont.) • Problem: • State not always recoverable • No matter the DFS, limits your throughput: • 1KB message size • 100,000 messages/sec • 1 minute checkpoint time (occurs every 40 seconds) • Workaround: • None (in Spark Streaming ) Checkpointing – This is the cost???
  • 20. 20Confidential Spark Streaming Challenges 2. Resiliency -> Managing user state between application upgrades • Problems: • Can’t change the graph • Can’t change your data structures • Workaround: • Roll your own using `stateSnapshot()` • Provide on start up using `StateSpec.initialState()` * Can potentially double overhead of the job time (critical with high throughput).
  • 21. 21Confidential Spark Streaming Challenges • Problem: • Spark Streaming defaults to one job (batch) at a time • If a particular job is stuck, all others wait indefinitely • Workaround: • Monitor job status using Sparks driver REST API (http://<driver ip>:4040/api/v1/applications) • Consider using Speculation (should be done carefully) • Enable Blacklisting if a particular node is faulty. • If you like to live dangerously, consider modifying “spark.streaming.concurrentJobs” 3. Stability -> Frozen Jobs
  • 22. 22Confidential Spark Streaming Challenges • Scale Up – Just works* • Scale Down – Who takes over the worker’s state? 4. Scalability No One!
  • 23. 23Confidential Spark Streaming Challenges • Logging mechanism – Log only small and random percent of traffic 4. Monitoring
  • 24. 25Confidential Is there a better alternative?
  • 25. 26Confidential Structured Streaming “The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended” (Structured Streaming Documentation) Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
  • 26. 27Confidential Structured Streaming (Cont.) Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
  • 27. 28Confidential mapGroupsWithState A second iteration at stateful aggregations in Spark Resiliency & Stability -> Checkpointing • Checkpoints are incremental, only deltas! • Allows state recovery between upgrades * *According to a set of tests made by us, may not apply to all cases and isn’t documented behavior
  • 28. 29Confidential Spark Structured Streaming • More new features and cool stuff • Event based timeouts (previously only processing based)​ • Watermarking (New)​ • Deduplication (New)​ • Timeout per state item (Enhancement)​
  • 29. 30Confidential Our experience so far Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState: Pros: • Queries seem to take less time on average than Spark Streaming * • No need to save state manually • Deduplication out of the box is awesome • Event based timeouts + Watermarking for late data is also awesome * In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
  • 30. 31Confidential Our experience so far (Cont.) Neutral: • Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data locality (less shuffling). • This means that in order to scale up, you must have at least a 1:1 mapping between number of Kafka partitions and Spark Executors. Cons: • Creates a significantly larger memory overhead (due to internal state implementation) • Makes heavier use of HDFS (many small file writes) • Doesn’t support multiple states (yet) • UI not as good as Streaming
  • 31. 32Confidential Wrapping up • Overall, Spark Streaming is a great candidate for small-medium loads or none Stateful aggregations streams. • If you’re considering Spark as an option for your business, start with Structured Streaming from the get go. • Do consider Apache Flink and it’s similar state management module which allows pluggable state stores as an alternative.
  • 32. 33Confidential • Real-time Streaming ETL with Structured Streaming: https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured- streaming-apache-spark-2-1.html • Making Structured Streaming Ready for Production: https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be • Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark: https://www.youtube.com/watch?v=JAb4FIheP28 • Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful- streaming-with-apache-spark • Exploring Stateful Streaming with Spark Structured Streaming: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming Resources
  • 33. Thank you for listening! Questions?

Editor's Notes

  1. Monitor message sending mechanism 1 kafka topic
  2. Monitor message sending mechanism 1 kafka topic
  3. How do we aggregate user messages over time in a Streaming application??
  4. Do a brief overview of all points, 15-20 seconds per point. At the end of the slide do an intro to Spark and talk a little about why we chose it over alternatives
  5. Ask a question: How many people use Spark in production? How many people use Spark Streaming in production? How many do Sessionization?
  6. Spark is not real time streaming, but micro batching Where is the state held?
  7. Talk about each file system briefly