Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Productionalizing
Spark Streaming
Spark Summit 2013
Ryan Weald
@rweald

@rweald
Monday, December 2, 13
What We’re Going to Cover
•What we do and Why we choose Spark
•Fault tolerance for long lived streaming jobs
•Common patte...
Special focus on
common patterns and
their solutions
@rweald
Monday, December 2, 13
What is Sharethrough?
Advertising for the Modern Internet

Form

@rweald
Monday, December 2, 13

Function
What is Sharethrough?

@rweald
Monday, December 2, 13
Why Spark Streaming?

@rweald
Monday, December 2, 13
Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with oppor...
Great...
Now productionalize it

@rweald
Monday, December 2, 13
Fault Tolerance

@rweald
Monday, December 2, 13
Keys to Fault Tolerance

1.Receiver fault tolerance
2.Monitoring job progress

@rweald
Monday, December 2, 13
Receiver Fault Tolerance

•Use Actors with supervisors
•Use self healing connection pools

@rweald
Monday, December 2, 13
Use Actors
class RabbitMQStreamReceiver (uri:String, exchangeName: String,
routingKey: String) extends Actor with Receiver...
Track All Outputs
•Low watermarks - Google MillWheel
•Database updated_at
•Expected output file size alerting

@rweald
Mon...
Common Patterns
&
Functional
Programming
@rweald
Monday, December 2, 13
Common Job Pattern

Map -> Aggregate ->Store

@rweald
Monday, December 2, 13
Mapping Data

inputData.map { rawRequest =>
val params = QueryParams.parse(rawRequest)
(params.getOrElse("beaconType", "un...
Aggregation

@rweald
Monday, December 2, 13
Basic Aggregation

//beacons is DStream[String, Long]
//example Seq(("click", 1L), ("click", 1L))
val sum: (Long, Long) =>...
What Happens when
we want to sum
multiple things?
@rweald
Monday, December 2, 13
Long Basic Aggregation
val inputData = Seq(
("user_1",(1L, 1L, 1L)),
("user_1",(2L, 2L, 2L))
)
def sum(l: (Long, Long, Lon...
Now Sum 4 Ints
instead
(ノಥ益ಥ)ノ ┻━┻
@rweald
Monday, December 2, 13
Monoids to the Rescue

@rweald
Monday, December 2, 13
WTF is a Monoid?
trait Monoid[T] {
def zero: T
def plus(r: T, l: T): T
}
* Just need to make sure plus is associative.
(1+...
Monoid Based Aggregation
object LongMonoid extends Monoid[(Long, Long, Long)]
{
def zero = (0, 0, 0)
def plus(r: (Long, Lo...
Twitter Algebird
http://github.com/twitter/algebird

@rweald
Monday, December 2, 13
Algebird Based Aggregation

import com.twitter.algebird._
val aggregator = implicitly[Monoid[(Long,Long, Long)]]
inputData...
How many unique
users per publisher?

@rweald
Monday, December 2, 13
Too big for memory
based naive Map

@rweald
Monday, December 2, 13
HyperLogLog FTW

@rweald
Monday, December 2, 13
HLL Aggregation

import com.twitter.algebird._
val aggregator = new HyperLogLogMonoid(12)
inputData.reduceByKey(aggregator...
Monoids == Reusable
Aggregation

@rweald
Monday, December 2, 13
Common Job Pattern

Map -> Aggregate ->Store

@rweald
Monday, December 2, 13
Store

@rweald
Monday, December 2, 13
How do we store the
results?

@rweald
Monday, December 2, 13
Storage API Requirements
•Incremental updates (preferably associative)
•Pluggable to support “big data” stores
•Allow for ...
Storage API
trait MergeableStore[K,
def get(key: K): V
def put(kv: (K,V)): V
/*
* Should follow same
* as our Monoid from
...
Twitter Storehaus
http://github.com/twitter/storehaus

@rweald
Monday, December 2, 13
Storing Spark Results
def saveResults(result: DStream[String, Long],
store: RedisStore[String, Long]) = {
result.foreach {...
Everyone can benefit

@rweald
Monday, December 2, 13
Potential API additions?

class PairDStreamFunctions[K, V] {
def aggregateByKey(aggregator: Monoid[V])
def store(store: Me...
Twitter Summingbird
http://github.com/twitter/summingbird
*https://github.com/twitter/summingbird/issues/387

@rweald
Mond...
Testing Your Jobs

@rweald
Monday, December 2, 13
Testing best Practices
•Try and avoid full integration tests
•Use in-memory stores for testing
•Keep logic outside of Spar...
Thank You
Ryan Weald
@rweald

@rweald
Monday, December 2, 13
Upcoming SlideShare
Loading in …5
×

Productionalizing Spark Streaming

3,860 views

Published on

Spark Summit 2013 Talk:

At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them.

First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production

Published in: Technology

Productionalizing Spark Streaming

  1. 1. Productionalizing Spark Streaming Spark Summit 2013 Ryan Weald @rweald @rweald Monday, December 2, 13
  2. 2. What We’re Going to Cover •What we do and Why we choose Spark •Fault tolerance for long lived streaming jobs •Common patterns and functional abstractions •Testing before we “do it live” @rweald Monday, December 2, 13
  3. 3. Special focus on common patterns and their solutions @rweald Monday, December 2, 13
  4. 4. What is Sharethrough? Advertising for the Modern Internet Form @rweald Monday, December 2, 13 Function
  5. 5. What is Sharethrough? @rweald Monday, December 2, 13
  6. 6. Why Spark Streaming? @rweald Monday, December 2, 13
  7. 7. Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms @rweald Monday, December 2, 13
  8. 8. Great... Now productionalize it @rweald Monday, December 2, 13
  9. 9. Fault Tolerance @rweald Monday, December 2, 13
  10. 10. Keys to Fault Tolerance 1.Receiver fault tolerance 2.Monitoring job progress @rweald Monday, December 2, 13
  11. 11. Receiver Fault Tolerance •Use Actors with supervisors •Use self healing connection pools @rweald Monday, December 2, 13
  12. 12. Use Actors class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String) extends Actor with Receiver with Logging { implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages } def receive: Receive = { case _ => logInfo("unknown message") } } @rweald Monday, December 2, 13
  13. 13. Track All Outputs •Low watermarks - Google MillWheel •Database updated_at •Expected output file size alerting @rweald Monday, December 2, 13
  14. 14. Common Patterns & Functional Programming @rweald Monday, December 2, 13
  15. 15. Common Job Pattern Map -> Aggregate ->Store @rweald Monday, December 2, 13
  16. 16. Mapping Data inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L) } @rweald Monday, December 2, 13
  17. 17. Aggregation @rweald Monday, December 2, 13
  18. 18. Basic Aggregation //beacons is DStream[String, Long] //example Seq(("click", 1L), ("click", 1L)) val sum: (Long, Long) => Long = _ + _ beacons.reduceByKey(sum) @rweald Monday, December 2, 13
  19. 19. What Happens when we want to sum multiple things? @rweald Monday, December 2, 13
  20. 20. Long Basic Aggregation val inputData = Seq( ("user_1",(1L, 1L, 1L)), ("user_1",(2L, 2L, 2L)) ) def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } inputData.reduceByKey(sum) @rweald Monday, December 2, 13
  21. 21. Now Sum 4 Ints instead (ノಥ益ಥ)ノ ┻━┻ @rweald Monday, December 2, 13
  22. 22. Monoids to the Rescue @rweald Monday, December 2, 13
  23. 23. WTF is a Monoid? trait Monoid[T] { def zero: T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 @rweald Monday, December 2, 13
  24. 24. Monoid Based Aggregation object LongMonoid extends Monoid[(Long, Long, Long)] { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } } inputData.reduceByKey(LongMonid.plus(_, _)) @rweald Monday, December 2, 13
  25. 25. Twitter Algebird http://github.com/twitter/algebird @rweald Monday, December 2, 13
  26. 26. Algebird Based Aggregation import com.twitter.algebird._ val aggregator = implicitly[Monoid[(Long,Long, Long)]] inputData.reduceByKey(aggregator.plus(_, _)) @rweald Monday, December 2, 13
  27. 27. How many unique users per publisher? @rweald Monday, December 2, 13
  28. 28. Too big for memory based naive Map @rweald Monday, December 2, 13
  29. 29. HyperLogLog FTW @rweald Monday, December 2, 13
  30. 30. HLL Aggregation import com.twitter.algebird._ val aggregator = new HyperLogLogMonoid(12) inputData.reduceByKey(aggregator.plus(_, _)) @rweald Monday, December 2, 13
  31. 31. Monoids == Reusable Aggregation @rweald Monday, December 2, 13
  32. 32. Common Job Pattern Map -> Aggregate ->Store @rweald Monday, December 2, 13
  33. 33. Store @rweald Monday, December 2, 13
  34. 34. How do we store the results? @rweald Monday, December 2, 13
  35. 35. Storage API Requirements •Incremental updates (preferably associative) •Pluggable to support “big data” stores •Allow for testing jobs @rweald Monday, December 2, 13
  36. 36. Storage API trait MergeableStore[K, def get(key: K): V def put(kv: (K,V)): V /* * Should follow same * as our Monoid from */ def merge(kv: (K,V)): } @rweald Monday, December 2, 13 V] { associative property earlier V
  37. 37. Twitter Storehaus http://github.com/twitter/storehaus @rweald Monday, December 2, 13
  38. 38. Storing Spark Results def saveResults(result: DStream[String, Long], store: RedisStore[String, Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } @rweald Monday, December 2, 13
  39. 39. Everyone can benefit @rweald Monday, December 2, 13
  40. 40. Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) } @rweald Monday, December 2, 13
  41. 41. Twitter Summingbird http://github.com/twitter/summingbird *https://github.com/twitter/summingbird/issues/387 @rweald Monday, December 2, 13
  42. 42. Testing Your Jobs @rweald Monday, December 2, 13
  43. 43. Testing best Practices •Try and avoid full integration tests •Use in-memory stores for testing •Keep logic outside of Spark •Use Summingbird in memory platform??? @rweald Monday, December 2, 13
  44. 44. Thank You Ryan Weald @rweald @rweald Monday, December 2, 13

×