Spark Summit 2014: Spark Streaming for Realtime Auctions
Upcoming SlideShare
Loading in...5
×
 

Spark Summit 2014: Spark Streaming for Realtime Auctions

on

  • 1,212 views

Spark Summit (2014-06-30) ...

Spark Summit (2014-06-30)
http://spark-summit.org/2014/talk/building-a-data-processing-system-for-real-time-auctions

What do you do when you need to update your models sooner than your existing batch workflows provide? At Sharethrough we faced the same question. Although we use Hadoop extensively for batch processing we needed a system to process click stream data as soon as possible for our real time ad auction platform. We found Spark Streaming to be a perfect fit for us because of it’s easy integration into the Hadoop ecosystem, powerful functional programming API, and low friction interoperability with our existing batch workflows.
In this talk, we’ll present about some of the use cases that led us to choose a stream processing system and why we use Spark Streaming in particular. We’ll also discuss how we organized our jobs to promote reusability between batch and streaming workflows as well as improve testability

Statistics

Views

Total Views
1,212
Views on SlideShare
860
Embed Views
352

Actions

Likes
13
Downloads
39
Comments
0

8 Embeds 352

http://engineering.sharethrough.com 140
http://localhost 75
https://twitter.com 68
http://russellcardullo.com 34
http://www.slideee.com 28
https://www.linkedin.com 5
http://russellcardullo-newblog.herokuapp.com 1
http://www.russellcardullo.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Spark Summit 2014: Spark Streaming for Realtime Auctions Spark Summit 2014: Spark Streaming for Realtime Auctions Presentation Transcript

  • Spark Streaming for Realtime Auctions @russellcardullo Sharethrough
  • Agenda • Sharethrough? • Streaming use cases • How we use Spark • Next steps
  • Sharethrough vs
  • The Sharethrough Native Exchange
  • How can we use streaming data?
  • Use Cases Creative Optimization Spend Tracking Operational Monitoring $µ* = max { µk } k
  • Creative Optimization • Choose best performing variant • Short feedback cycle required impression: {! device: “iOS”,! geocode: “C967”,! pkey: “p193”,! …! } Variant 1 Variant 2 Variant 3 Click! Content Here Hello Friend ? ? ?
  • Spend Tracking • Spend on visible impressions and clicks • Actual spend happens asynchronously • Want to correct prediction for optimal serving Predicted Spend Actual Spend
  • Operational Monitoring • Detect issues with content served on third party sites • Use same logs as reporting Placement Click Rates P1 P5 P2 P3 P4 P6 P7 P8
  • We can directly measure business impact of using this data sooner
  • Why use Spark to build these features?
  • Why Spark? • Scala API • Supports batch and streaming • Active community support • Easily integrates into existing Hadoop ecosystem • But it doesn’t require Hadoop in order to run
  • How we’ve integrated Spark
  • Existing Data Pipeline web! server log file flume HDFS web! server log file flume web! server log file flume analytics! web! service database
  • Pipeline with Streaming web! server log file flume HDFS web! server log file flume web! server log file flume analytics! web! service database
  • Batch Streaming • “Real-Time” reporting • Low latency to use data • Only reliable as source • Low latency > correctness • Daily reporting • Billing / earnings • Anything with strict SLA • Correctness > low latency
  • Spark Job Abstractions
  • Job Organization nterSource Transform Sink Job
  • Sources ! case class BeaconLogLine(! timestamp: String,! uri: String,! beaconType: String,! pkey: String,! ckey: String! )! ! object BeaconLogLine {! ! def newDStream(ssc: StreamingContext, inputPath: String): DStream[BeaconLogLine] = {! ssc.textFileStream(inputPath).map { parseRawBeacon(_) }! }! ! def parseRawBeacon(b: String): BeaconLogLine = {! ...! }! }! case class! for pattern! matching generate! DStream encapsulate ! common! operations
  • Transformations ! ! def visibleByPlacement(source: DStream[BeaconLogLine]): DStream[(String, Long)] = {! source.! filter(data => {! data.uri == "/strbeacon" && data.beaconType == "visible"! }).! map(data => (data.pkey, 1L)).! reduceByKey(_ + _)! }! ! type safety! from! case class
  • Sinks ! ! class RedisSink @Inject()(store: RedisStore) {! ! def sink(result: DStream[(String, Long)]) = {! result.foreachRDD { rdd =>! rdd.foreach { element =>! val (key, value) = element! store.merge(key, value)! }! }! }! }! ! custom! sinks for! new stores
  • Jobs ! object ImpressionsForPlacements {! ! def run(config: Config, inputPath: String) {! val conf = new SparkConf().! setMaster(config.getString("master")).! setAppName("Impressions for Placement")! ! val sc = new SparkContext(conf)! val ssc = new StreamingContext(sc, Seconds(5))! ! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)! ! ssc.start! ssc.awaitTermination! }! }! source transform sink
  • Advantages?
  • Code Reuse ! object PlacementVisibles {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)! …! }! ! …! ! object PlacementEngagements {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val engagements = engagementsByPlacement(source)! sink(engagements)! …! } composable! jobs
  • Readability ! ssc.textFileStream(inputPath).! map { parseRawBeacon(_) }.! filter(data => {! data._2 == "/strbeacon" && data._3 == "visible"! }).! map(data => (data._4, 1L)).! reduceByKey(_ + _).! foreachRDD { rdd =>! rdd.foreach { element =>! store.merge(element._1, element._2)! }! }! ?
  • Readability ! ! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! redis.sink(visible)! !
  • Testing def assertTransformation[T: Manifest, U: Manifest](! transformation: T => U,! input: Seq[T],! expectedOutput: Seq[U]! ): Unit = {! val ssc = new StreamingContext("local[1]", "Testing", Seconds(1))! val source = ssc.queueStream(new SynchronizedQueue[RDD[T]]())! val results = transformation(source)! ! var output = Array[U]()! results.foreachRDD { rdd => output = output ++ rdd.collect() }! ssc.start! rddQueue += ssc.sparkContext.makeRDD(input, 2)! Thread.sleep(jobCompletionWaitTimeMillis)! ssc.stop(true)! ! assert(output.toSet === expectedOutput.toSet)! }! function,! input,! expectation test
  • Testing ! ! test("#visibleByPlacement") {! ! val input = Seq(! "pkey=abcd, …",! "pkey=abcd, …",! "pkey=wxyz, …",! )! ! val expectedOutput = Seq( ("abcd",2),("wxyz", 1) )! ! assertTransformation(visibleByPlacement, input, expectedOutput)! }! ! use our! test helper
  • Other Learnings
  • Other Learnings • Keeping your driver program healthy is crucial • 24/7 operation and monitoring • Spark on Mesos? Use Marathon. • Pay attention to settings for spark.cores.max • Monitor data rate and increase as needed • Serialization on classes • Java • Kryo
  • What’s next?
  • Twitter Summingbird • Write-once, run anywhere • Supports: • Hadoop MapReduce • Storm • Spark (maybe?)
  • Amazon Kinesis web! server Kinesis web! server web! server other! applications mobile! device app! logs
  • Thanks!