• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs
 

Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

on

  • 643 views

Talk I gave at a Spark Meetup on 01/16/2014 ...

Talk I gave at a Spark Meetup on 01/16/2014

Abstract:
One of the most difficult aspects of deploying spark streaming as part of your technology stack is maintaining all the job associated with stream processing jobs. In this talk I will discuss the the tools and techniques that Sharethrough has found most useful for maintaining a large number of spark streaming jobs. We will look in detail at the way Monoids and Twitter's Algebrid library can be used to create generic aggregations. As well as the way we can create generic interfaces for writing the results of streaming jobs to multiple data stores. Finally we will look at the way dependency injection can be used to tie all the pieces together, enabling raping development of new streaming jobs.

Statistics

Views

Total Views
643
Views on SlideShare
642
Embed Views
1

Actions

Likes
3
Downloads
23
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs Presentation Transcript

    • Stores, Monoids and Dependency Injection Spark Meetup 01/16/2014 Ryan Weald @rweald @rweald
    • What We’re Going to Cover •What we do and Why we choose Spark •Common patterns in spark streaming jobs •Monoids as an abstraction for aggregation •Abstraction for saving the results of jobs •Using dependency injection for improved testability and developer happiness @rweald
    • What is Sharethrough? Advertising for the Modern Internet Form @rweald Function
    • What is Sharethrough? @rweald
    • Why Spark Streaming? @rweald
    • Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms @rweald
    • Great... Now maintain dozens of streaming jobs @rweald
    • Common Patterns & Functional Programming @rweald
    • Common Job Pattern Map -> Aggregate ->Store @rweald
    • Real World Example Which publisher pages has an ad unit appeared on? @rweald
    • Mapping Data inputData.map { rawRequest => val params = QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage) } @rweald
    • Aggregation @rweald
    • Basic Aggregation Add each pub page to a creative’s set @rweald
    • Basic Aggregation val sum: (Set[String], Set[String]) => Set[String] = _ ++ _ ! creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage)) }.reduceByKey(sum) @rweald
    • Way too much memory usage in production as data size grows @rweald
    • We need bloom filter to keep memory usage fixed @rweald
    • Total code re-write :( @rweald
    • Monoids to the Rescue @rweald
    • WTF is a Monoid? trait Monoid[T] { def zero: T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 @rweald
    • Monoid Example SetMonoid extends Monoid[Set[String]] { def zero = Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r } ! SetMonoid.plus(Set("a"), Set("b")) //returns Set("a", "b") ! SetMonoid.plus(Set("a"), Set("a")) //returns Set("a") @rweald
    • Twitter Algebird ! http://github.com/twitter/algebird @rweald
    • Algebird Based Aggregation import com.twitter.algebird._ ! val bfMonoid = BloomFilter(500000, 0.01) ! creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage)) }.reduceByKey(bfMonoid.plus(_, _)) @rweald
    • Add set of users who have seen creative to same job @rweald
    • Algebird Based Aggregation val aggregator = new Monoid[(BF, BF)] { def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) } } ! creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) ) }.reduceByKey(aggregator.plus(_, _)) @rweald
    • Monoids == Reusable Aggregation @rweald
    • Common Job Pattern Map -> Aggregate ->Store @rweald
    • Store @rweald
    • How do we store the results? @rweald
    • Storage API Requirements •Incremental updates (preferably associative) •Pluggable to support “big data” stores •Allow for testing jobs @rweald
    • Storage API trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V } @rweald
    • Twitter Storehaus ! http://github.com/twitter/storehaus @rweald
    • Storing Spark Results def saveResults(result: DStream[String, BF], store: HBaseStore[String, BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } @rweald
    • What if we don’t have HBase locally? @rweald
    • Dependency Injection to the rescue @rweald
    • Generic storage with environment specific binding @rweald
    • Generic Storage Method def saveResults(result: DStream[String, BF], store: StorageFactory) = { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } @rweald
    • Google Guice ! https://github.com/sptz45/sse-guice @rweald
    • DI the Store You Need! trait StorageFactory { def create: Store[String, BF] } ! class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] } } ! class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] } } @rweald
    • Moving Forward @rweald
    • Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) } @rweald
    • Twitter Summingbird ! http://github.com/twitter/summingbird *https://github.com/twitter/summingbird/issues/387 @rweald
    • Thank You Ryan Weald @rweald @rweald