3. Disclosure
• This work was implemented in Adform
• Thanks the Hadoop team for permission and help
4. History
• Original idea from Ted Alaska @ 2014
How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop
• Hands on 2016 at Adform
5. The Problem
• Constant flow of page visits
110 GB average per day, volume variations, catch-up scenario
• Wait for session interrupts
Timeout, specific action, midnight, sanity checks
• Calculate session duration, length, reaction times
6. The Problem
• Constant ingress / egress
One car enters, car trailer exits
Join for every incoming car
• Some cars loop for hours
• Uncontrollable loop volume
7. Stream / Not
• Still not 100% sure if it’s worth streaming
People still frown when this topic is brought up
• More frequent ingress means less effective join
Is 2 minute period of ingress is still streaming? :)
• Another degree of complexity
8. Cons
• More complex application
Just like cars - ride to Work vs travel to Portugal
• Steady pace is required
Throttling is mandatory, volume control is essential, good GC
• Permanently reserved resources
9. Pros
• Fun
If this one is on your list, you should probably not do it :)
• Speed
This is “result speed”. Do you actually need it?
• Stability
You have to work really hard to get this benefit
10. Extra context
• User data is partitioned by nature
User ID (range) is obvious partition key
Helps us to control ingress size and most importantly - loop volume
• Loop volume is hard to control
Average flow was around 150 MB, the loop varied from 2 - 8 GB
13. Copy & Paste
• Ted solution relies on updateStateByKey
This method requires checkpointing
• Checkpoints
Are good only on paper
They are meant for soft-recovery
14. The Thought
val sc = new SparkContext(…)
val ssc = new StreamingContext(sc, Minutes(2))
val ingress = ssc.textFileStream(“folder”).groupBy(userId)
val checkpoint = sc.textFile("checkpoint").groupBy(userId)
val sessions = checkpoint.fullOuterJoin(ingress)(userId)
.cache
sessions.filter(complete).map(enrich).saveAsTextFile("output")
sessions.filter(inComplete).saveAsTextFile("checkpoint")
15. fileStream
• Works based on file timestamp with some memory
Bit fuzzy, ugly for testing
• We wanted to have more control and monitoring
Our file names had meta information (source, oldest record time)
Custom implementation with external state (key-valuestore)
We could control ingress size
Tip: persisting actual job plan
21. cache & repetition
• Remember?
.cache .filter(complete).doStuff .filter(incomplete).doStuff
• You never want to repeat actions when streaming
We had to scan entire dataset twice
Also… two phase commit
22. Multi Output Format
• Custom implementation
We wanted different format for each output
Not that hard, but lot’s of copy-paste
Communication via Hadoop configuration
• MultipleOutputFormat
Why we did not use it?
23. Gotcha
val conf = new JobConf(rdd.context.hadoopConfiguration)
conf.set("mapreduce.job.outputformat.class",
classOf[SessionMultiOutputFormat].getName)
conf.set(COMPLETE_SESSIONS_PATH, job.outputPath)
conf.set(ONGOING_SESSION_PATH, job.checkpointPath)
sessions.saveAsNewAPIHadoopDataset(conf)
24. Non-natural partitioning
• Our ingress comes pre-partitioned
File names like server_oldest-record-timestamp.txt.gz
Where server works on a range of user ids
• Just foreachRDD
… or is it? :D
28. The Algorithm
val stream = new OurCustomDStream(..)
stream.foreachRDD(processUnion)
…
val par = unionRdd.rdds.par
par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
unionRdd.rdds.par.foreach(processOne)
29. The Algorithm
val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20))
val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...)
val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20))
val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc)
val split = sessions.flatMap(splitSessionFunc)
val conf = new JobConf(...)
split.saveAsNewAPIHadoopDataset(conf)
31. Configuration
• Current configuration
Driver: 6 GB RAM
15 executors: 4GB RAM and 2 cores each
• Total size not that big
60 GB RAM and 30 cores
Previously it was 52 SQL instances.. doing other things too
• Hasn’t changed for half a year already
34. Other tips
• -XX:+UseG1GC
For both driver and executors
• Plan & store jobs, repeat if failed
When repeating, environment changes
• Use named RDDs
Helps to read your DAGs