Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

4,074 views

Published on

Talk I gave at a Spark Meetup on 01/16/2014

Abstract:
One of the most difficult aspects of deploying spark streaming as part of your technology stack is maintaining all the job associated with stream processing jobs. In this talk I will discuss the the tools and techniques that Sharethrough has found most useful for maintaining a large number of spark streaming jobs. We will look in detail at the way Monoids and Twitter's Algebrid library can be used to create generic aggregations. As well as the way we can create generic interfaces for writing the results of streaming jobs to multiple data stores. Finally we will look at the way dependency injection can be used to tie all the pieces together, enabling raping development of new streaming jobs.

Published in: Technology
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,074
On SlideShare
0
From Embeds
0
Number of Embeds
65
Actions
Shares
0
Downloads
78
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide

Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

  1. 1. Stores, Monoids and Dependency Injection Spark Meetup 01/16/2014 Ryan Weald @rweald @rweald
  2. 2. What We’re Going to Cover •What we do and Why we choose Spark •Common patterns in spark streaming jobs •Monoids as an abstraction for aggregation •Abstraction for saving the results of jobs •Using dependency injection for improved testability and developer happiness @rweald
  3. 3. What is Sharethrough? Advertising for the Modern Internet Form @rweald Function
  4. 4. What is Sharethrough? @rweald
  5. 5. Why Spark Streaming? @rweald
  6. 6. Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms @rweald
  7. 7. Great... Now maintain dozens of streaming jobs @rweald
  8. 8. Common Patterns & Functional Programming @rweald
  9. 9. Common Job Pattern Map -> Aggregate ->Store @rweald
  10. 10. Real World Example Which publisher pages has an ad unit appeared on? @rweald
  11. 11. Mapping Data inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage) } @rweald
  12. 12. Aggregation @rweald
  13. 13. Basic Aggregation Add each pub page to a creative’s set @rweald
  14. 14. Basic Aggregation val sum: (Set[String], Set[String]) => Set[String] = _ ++ _ ! creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage)) }.reduceByKey(sum) @rweald
  15. 15. Way too much memory usage in production as data size grows @rweald
  16. 16. We need bloom filter to keep memory usage fixed @rweald
  17. 17. Total code re-write :( @rweald
  18. 18. Monoids to the Rescue @rweald
  19. 19. WTF is a Monoid? trait Monoid[T] { def zero: T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 @rweald
  20. 20. Monoid Example SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r } ! SetMonoid.plus(Set("a"), Set("b")) //returns Set("a", "b") ! SetMonoid.plus(Set("a"), Set("a")) //returns Set("a") @rweald
  21. 21. Twitter Algebird ! http://github.com/twitter/algebird @rweald
  22. 22. Algebird Based Aggregation import com.twitter.algebird._ ! val bfMonoid = BloomFilter(500000, 0.01) ! creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage)) }.reduceByKey(bfMonoid.plus(_, _)) @rweald
  23. 23. Add set of users who have seen creative to same job @rweald
  24. 24. Algebird Based Aggregation val aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) } } ! creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) ) }.reduceByKey(aggregator.plus(_, _)) @rweald
  25. 25. Monoids == Reusable Aggregation @rweald
  26. 26. Common Job Pattern Map -> Aggregate ->Store @rweald
  27. 27. Store @rweald
  28. 28. How do we store the results? @rweald
  29. 29. Storage API Requirements •Incremental updates (preferably associative) •Pluggable to support “big data” stores •Allow for testing jobs @rweald
  30. 30. Storage API trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V } @rweald
  31. 31. Twitter Storehaus ! http://github.com/twitter/storehaus @rweald
  32. 32. Storing Spark Results def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } @rweald
  33. 33. What if we don’t have HBase locally? @rweald
  34. 34. Dependency Injection to the rescue @rweald
  35. 35. Generic storage with environment specific binding @rweald
  36. 36. Generic Storage Method def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } @rweald
  37. 37. Google Guice ! https://github.com/sptz45/sse-guice @rweald
  38. 38. DI the Store You Need! trait StorageFactory { def create: Store[String, BF] } ! class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] } } ! class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] } } @rweald
  39. 39. Moving Forward @rweald
  40. 40. Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) } @rweald
  41. 41. Twitter Summingbird ! http://github.com/twitter/summingbird *https://github.com/twitter/summingbird/issues/387 @rweald
  42. 42. Thank You Ryan Weald @rweald @rweald

×