Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sessionization with Spark streaming

978 views

Published on

Dry notes on our Sessionization implementation

Published in: Data & Analytics
  • Be the first to comment

Sessionization with Spark streaming

  1. 1. Sessionization with Spark streaming
  2. 2. Ramūnas Urbonas @ Platform Lunar
  3. 3. Disclosure • This work was implemented in Adform • Thanks the Hadoop team for permission and help
  4. 4. History • Original idea from Ted Alaska @ 2014 How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop • Hands on 2016 at Adform
  5. 5. The Problem • Constant flow of page visits 110 GB average per day, volume variations, catch-up scenario • Wait for session interrupts Timeout, specific action, midnight, sanity checks • Calculate session duration, length, reaction times
  6. 6. The Problem • Constant ingress / egress One car enters, car trailer exits Join for every incoming car • Some cars loop for hours • Uncontrollable loop volume
  7. 7. Stream / Not • Still not 100% sure if it’s worth streaming People still frown when this topic is brought up • More frequent ingress means less effective join Is 2 minute period of ingress is still streaming? :) • Another degree of complexity
  8. 8. Cons • More complex application Just like cars - ride to Work vs travel to Portugal • Steady pace is required Throttling is mandatory, volume control is essential, good GC • Permanently reserved resources
  9. 9. Pros • Fun If this one is on your list, you should probably not do it :) • Speed This is “result speed”. Do you actually need it? • Stability You have to work really hard to get this benefit
  10. 10. Extra context • User data is partitioned by nature User ID (range) is obvious partition key Helps us to control ingress size and most importantly - loop volume • Loop volume is hard to control Average flow was around 150 MB, the loop varied from 2 - 8 GB
  11. 11. Algorithm ingress state updateStateByKey join
  12. 12. Algorithm complete incomplete decision calculate results store for later
  13. 13. Copy & Paste • Ted solution relies on updateStateByKey This method requires checkpointing • Checkpoints Are good only on paper They are meant for soft-recovery
  14. 14. The Thought val sc = new SparkContext(…) val ssc = new StreamingContext(sc, Minutes(2)) val ingress = ssc.textFileStream(“folder”).groupBy(userId) val checkpoint = sc.textFile("checkpoint").groupBy(userId) val sessions = checkpoint.fullOuterJoin(ingress)(userId) .cache sessions.filter(complete).map(enrich).saveAsTextFile("output") sessions.filter(inComplete).saveAsTextFile("checkpoint")
  15. 15. fileStream • Works based on file timestamp with some memory Bit fuzzy, ugly for testing • We wanted to have more control and monitoring Our file names had meta information (source, oldest record time) Custom implementation with external state (key-valuestore) We could control ingress size Tip: persisting actual job plan
  16. 16. Checkpoint user-1 1477983123 page-26 user-1 1477983256 page-2 user-2 1477982342 home user-2 1477982947 page-9 user-2 1477984343 home
  17. 17. Checkpoint • Custom implementation We wanted to maintain checkpoint grouping • Nothing fancy class SessionInputFormat extends FileInputFormat[SessionKey, List[Record]]
  18. 18. fullOuterJoin • Probably the most expensive operation The average ratio is 1:35, with extremes of 1:100 We found IndexedRDD contribution
  19. 19. IndexedRDD • IndexedRDD https://github.com/amplab/spark-indexedrdd • Partition control is essential Avoid extra stage in your job, extra shuffles Explicit partitioner, even if it is HashPartitioner Get used to specifying partitioner for every groupBy / combineByKey Exact and controllable partition count
  20. 20. IndexedRDD
  21. 21. cache & repetition • Remember? .cache .filter(complete).doStuff .filter(incomplete).doStuff • You never want to repeat actions when streaming We had to scan entire dataset twice Also… two phase commit
  22. 22. Multi Output Format • Custom implementation We wanted different format for each output Not that hard, but lot’s of copy-paste Communication via Hadoop configuration • MultipleOutputFormat Why we did not use it?
  23. 23. Gotcha val conf = new JobConf(rdd.context.hadoopConfiguration) 
 conf.set("mapreduce.job.outputformat.class", classOf[SessionMultiOutputFormat].getName) 
 conf.set(COMPLETE_SESSIONS_PATH, job.outputPath) conf.set(ONGOING_SESSION_PATH, job.checkpointPath)
 sessions.saveAsNewAPIHadoopDataset(conf)
  24. 24. Non-natural partitioning • Our ingress comes pre-partitioned File names like server_oldest-record-timestamp.txt.gz Where server works on a range of user ids • Just foreachRDD … or is it? :D
  25. 25. Resource utilisation 0 25 50 75 100
  26. 26. Resource utilisation 0 25 50 75 100
  27. 27. Parallelise • Just rdds.par.foreach(processOne) … or is it ? :D • Limit thread pool val par = rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10))
  28. 28. The Algorithm val stream = new OurCustomDStream(..) stream.foreachRDD(processUnion) … val par = unionRdd.rdds.par par.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) unionRdd.rdds.par.foreach(processOne)
  29. 29. The Algorithm val delta = one.map(addSessionKey).combineByKey[List[Record]](..., new HashPartitioner(20)) val checkpoint = sc.newAPIHadoopFile[SessionKey, List[Record], SessionInputFormat](...) val withHash = HashPartitionerRDD(sc, checkpoint, Some(new HashPartitioner(20)) val sessions = IndexedRDD(withHash).fullOuterJoin(ingress)(joinFunc) val split = sessions.flatMap(splitSessionFunc) val conf = new JobConf(...) split.saveAsNewAPIHadoopDataset(conf)
  30. 30. Result
  31. 31. Configuration • Current configuration Driver: 6 GB RAM 15 executors: 4GB RAM and 2 cores each • Total size not that big 60 GB RAM and 30 cores Previously it was 52 SQL instances.. doing other things too • Hasn’t changed for half a year already
  32. 32. Metrics
  33. 33. My Pride
  34. 34. Other tips • -XX:+UseG1GC For both driver and executors • Plan & store jobs, repeat if failed When repeating, environment changes • Use named RDDs Helps to read your DAGs
  35. 35. Thanks

×