Spark Summit (2014-06-30)
http://spark-summit.org/2014/talk/building-a-data-processing-system-for-real-time-auctions
What do you do when you need to update your models sooner than your existing batch workflows provide? At Sharethrough we faced the same question. Although we use Hadoop extensively for batch processing we needed a system to process click stream data as soon as possible for our real time ad auction platform. We found Spark Streaming to be a perfect fit for us because of it’s easy integration into the Hadoop ecosystem, powerful functional programming API, and low friction interoperability with our existing batch workflows.
In this talk, we’ll present about some of the use cases that led us to choose a stream processing system and why we use Spark Streaming in particular. We’ll also discuss how we organized our jobs to promote reusability between batch and streaming workflows as well as improve testability
8. Spend Tracking
• Spend on visible impressions
and clicks
• Actual spend happens
asynchronously
• Want to correct prediction
for optimal serving
Predicted
Spend
Actual Spend
9. Operational Monitoring
• Detect issues with content
served on third party sites
• Use same logs as reporting
Placement Click Rates
P1
P5
P2 P3 P4
P6 P7 P8
10. We can directly measure business impact of
using this data sooner
12. Why Spark?
• Scala API
• Supports batch and streaming
• Active community support
• Easily integrates into existing Hadoop
ecosystem
• But it doesn’t require Hadoop in order to run
30. Other Learnings
• Keeping your driver program healthy is crucial
• 24/7 operation and monitoring
• Spark on Mesos? Use Marathon.
• Pay attention to settings for spark.cores.max
• Monitor data rate and increase as needed
• Serialization on classes
• Java
• Kryo