SlideShare a Scribd company logo
1 of 35
Download to read offline
Sessionization
with Spark streaming
Ramūnas Urbonas
@ Platform Lunar
Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help
History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform
The Problem
• Constant flow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, specific action, midnight, sanity checks
• Calculate session duration, length, reaction times
The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume
Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity
Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources
Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this benefit
Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average flow was around 150 MB, the loop varied from 2 - 8 GB
Algorithm
ingress
state
updateStateByKey
join
Algorithm
complete
incomplete
decision calculate results
store for later
Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery
The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")
fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan
Checkpoint
user-1 1477983123 page-26
user-1 1477983256 page-2
user-2 1477982342 home
user-2 1477982947 page-9
user-2 1477984343 home
Checkpoint
• Custom implementation
We wanted to maintain checkpoint grouping
• Nothing fancy
class SessionInputFormat
extends FileInputFormat[SessionKey, List[Record]]
fullOuterJoin
• Probably the most expensive operation
The average ratio is 1:35, with extremes of 1:100
We found IndexedRDD contribution
IndexedRDD
• IndexedRDD
https://github.com/amplab/spark-indexedrdd
• Partition control is essential
Avoid extra stage in your job, extra shuffles
Explicit partitioner, even if it is HashPartitioner
Get used to specifying partitioner for every groupBy / combineByKey
Exact and controllable partition count
IndexedRDD
cache & repetition
• Remember?
.cache .filter(complete).doStuff .filter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit
Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop configuration
• MultipleOutputFormat
Why we did not use it?
Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)


conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)


conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath)

sessions.saveAsNewAPIHadoopDataset(conf)
Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D
Resource utilisation
0
25
50
75
100
Resource utilisation
0
25
50
75
100
Parallelise
• Just rdds.par.foreach(processOne)
… or is it ? :D
• Limit thread pool
val par = rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)
The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)
Result
Configuration
• Current configuration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already
Metrics
My Pride
Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs
Thanks

More Related Content

What's hot

Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 

What's hot (20)

SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
 
PostgreSQL Terminology
PostgreSQL TerminologyPostgreSQL Terminology
PostgreSQL Terminology
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018How to upgrade to MongoDB 4.0 - Percona Europe 2018
How to upgrade to MongoDB 4.0 - Percona Europe 2018
 
GitLab PostgresMortem: Lessons Learned
GitLab PostgresMortem: Lessons LearnedGitLab PostgresMortem: Lessons Learned
GitLab PostgresMortem: Lessons Learned
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 

Viewers also liked

A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
Ben Stopford
 

Viewers also liked (20)

Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...
 
Debunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream ProcessingDebunking Six Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Wrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech EcosystemWrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech Ecosystem
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11gBest Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Big Data & the Enterprise
Big Data & the EnterpriseBig Data & the Enterprise
Big Data & the Enterprise
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 

Similar to Sessionization with Spark streaming

Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
Enkitec
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and await
vfabro
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similar to Sessionization with Spark streaming (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature Optimizations
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
End to-end async and await
End to-end async and awaitEnd to-end async and await
End to-end async and await
 
Performance
PerformancePerformance
Performance
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
PuppetConf 2016: Multi-Tenant Puppet at Scale – John Jawed, eBay, Inc.
 
Operating System
Operating SystemOperating System
Operating System
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 

Recently uploaded

Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 

Sessionization with Spark streaming

  • 3. Disclosure • This work was implemented in Adform • Thanks the Hadoop team for permission and help
  • 4. History • Original idea from Ted Alaska @ 2014 How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop • Hands on 2016 at Adform
  • 5. The Problem • Constant flow of page visits 110 GB average per day, volume variations, catch-up scenario • Wait for session interrupts Timeout, specific action, midnight, sanity checks • Calculate session duration, length, reaction times
  • 6. The Problem • Constant ingress / egress One car enters, car trailer exits Join for every incoming car • Some cars loop for hours • Uncontrollable loop volume
  • 7. Stream / Not • Still not 100% sure if it’s worth streaming People still frown when this topic is brought up • More frequent ingress means less effective join Is 2 minute period of ingress is still streaming? :) • Another degree of complexity
  • 8. Cons • More complex application Just like cars - ride to Work vs travel to Portugal • Steady pace is required Throttling is mandatory, volume control is essential, good GC • Permanently reserved resources
  • 9. Pros • Fun If this one is on your list, you should probably not do it :) • Speed This is “result speed”. Do you actually need it? • Stability You have to work really hard to get this benefit
  • 10. Extra context • User data is partitioned by nature User ID (range) is obvious partition key Helps us to control ingress size and most importantly - loop volume • Loop volume is hard to control Average flow was around 150 MB, the loop varied from 2 - 8 GB
  • 13. Copy & Paste • Ted solution relies on updateStateByKey This method requires checkpointing • Checkpoints Are good only on paper They are meant for soft-recovery
  • 14. The Thought val sc = new SparkContext(…) val ssc = new StreamingContext(sc, Minutes(2)) val ingress = ssc.textFileStream(“folder”).groupBy(userId) val checkpoint = sc.textFile("checkpoint").groupBy(userId) val sessions = checkpoint.fullOuterJoin(ingress)(userId) .cache sessions.filter(complete).map(enrich).saveAsTextFile("output") sessions.filter(inComplete).saveAsTextFile("checkpoint")
  • 15. fileStream • Works based on file timestamp with some memory Bit fuzzy, ugly for testing • We wanted to have more control and monitoring Our file names had meta information (source, oldest record time) Custom implementation with external state (key-valuestore) We could control ingress size Tip: persisting actual job plan
  • 16. Checkpoint user-1 1477983123 page-26 user-1 1477983256 page-2 user-2 1477982342 home user-2 1477982947 page-9 user-2 1477984343 home
  • 17. Checkpoint • Custom implementation We wanted to maintain checkpoint grouping • Nothing fancy class SessionInputFormat extends FileInputFormat[SessionKey, List[Record]]
  • 18. fullOuterJoin • Probably the most expensive operation The average ratio is 1:35, with extremes of 1:100 We found IndexedRDD contribution
  • 19. IndexedRDD • IndexedRDD https://github.com/amplab/spark-indexedrdd • Partition control is essential Avoid extra stage in your job, extra shuffles Explicit partitioner, even if it is HashPartitioner Get used to specifying partitioner for every groupBy / combineByKey Exact and controllable partition count
  • 21. cache & repetition • Remember? .cache .filter(complete).doStuff .filter(incomplete).doStuff • You never want to repeat actions when streaming We had to scan entire dataset twice Also… two phase commit
  • 22. Multi Output Format • Custom implementation We wanted different format for each output Not that hard, but lot’s of copy-paste Communication via Hadoop configuration • MultipleOutputFormat Why we did not use it?
  • 23. Gotcha val conf = new JobConf(rdd.context.hadoopConfiguration) 
 conf.set("mapreduce.job.outputformat.class", classOf[SessionMultiOutputFormat].getName) 
 conf.set(COMPLETE_SESSIONS_PATH, job.outputPath) conf.set(ONGOING_SESSION_PATH, job.checkpointPath)
 sessions.saveAsNewAPIHadoopDataset(conf)
  • 24. Non-natural partitioning • Our ingress comes pre-partitioned File names like server_oldest-record-timestamp.txt.gz Where server works on a range of user ids • Just foreachRDD … or is it? :D
  • 27. Parallelise • Just rdds.par.foreach(processOne) … or is it ? :D • Limit thread pool val par = rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
  • 28. The Algorithm val stream = new OurCustomDStream(..) stream.foreachRDD(processUnion) … val par = unionRdd.rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) unionRdd.rdds.par.foreach(processOne)
  • 29. The Algorithm val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20)) val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...) val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20)) val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc) val split = sessions.flatMap(splitSessionFunc) val conf = new JobConf(...) split.saveAsNewAPIHadoopDataset(conf)
  • 31. Configuration • Current configuration Driver: 6 GB RAM 15 executors: 4GB RAM and 2 cores each • Total size not that big 60 GB RAM and 30 cores Previously it was 52 SQL instances.. doing other things too • Hasn’t changed for half a year already
  • 34. Other tips • -XX:+UseG1GC For both driver and executors • Plan & store jobs, repeat if failed When repeating, environment changes • Use named RDDs Helps to read your DAGs