Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

Stores, Monoids and
Dependency Injection
Spark Meetup 01/16/2014
Ryan Weald
@rweald

@rweald

What We’re Going to Cover
•What we do and Why we choose Spark
•Common patterns in spark streaming jobs
•Monoids as an abstraction for aggregation
•Abstraction for saving the results of jobs
•Using dependency injection for improved
testability and developer happiness
@rweald

What is Sharethrough?
Advertising for the Modern Internet

Form

@rweald

Function

What is Sharethrough?

@rweald

Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with opportunities to contribute
•Batch model for iterative ML algorithms

@rweald

Great...
Now maintain dozens
of streaming jobs
@rweald

Common Patterns
&
Functional
Programming
@rweald

Common Job Pattern

Map -> Aggregate ->Store

@rweald

Real World Example

Which publisher pages has
an ad unit appeared on?

@rweald

Mapping Data
inputData.map { rawRequest =>
val params = QueryParams.parse(rawRequest)
val pubPage = params.getOrElse(
"pub_page_location",
"http://example.com")
val creative = params.getOrElse(
"creative_key",
"unknown")

val uri = new java.net.URI(pubPage)
val cleanPubPage = uri.getHost + "/" + uri.getPath
(creative, cleanPubPage)
}

@rweald

Basic Aggregation

Add each pub page to a
creative’s set

@rweald

Basic Aggregation
val sum: (Set[String], Set[String]) =>
Set[String] = _ ++ _
!

creativePubPages.map { case(ckey, pubPage)
(ckey, Set(pubPage))
}.reduceByKey(sum)

@rweald

Way too much
memory usage in
production as data size
grows
@rweald

We need bloom filter
to keep memory usage
fixed
@rweald

Total code re-write :(

@rweald

Monoids to the Rescue

@rweald

WTF is a Monoid?
trait Monoid[T] {
def zero: T
def plus(r: T, l: T): T
}
* Just need to make sure plus is associative.
(1+ 5) + 2 == (2 + 1) + 5

@rweald

Monoid Example
SetMonoid extends Monoid[Set[String]] {
def zero = Set.empty[String]
def plus(l: Set[String], r: Set[String]) = l ++ r
}
!

SetMonoid.plus(Set("a"), Set("b"))
//returns Set("a", "b")
!

SetMonoid.plus(Set("a"), Set("a"))
//returns Set("a")

@rweald

Twitter Algebird
!

http://github.com/twitter/algebird

@rweald

Algebird Based Aggregation
import com.twitter.algebird._
!

val bfMonoid = BloomFilter(500000, 0.01)
!

creativePubPages.map { case(ckey, pubPage)
(ckey, bfMonoid.create(pubPage))
}.reduceByKey(bfMonoid.plus(_, _))

@rweald

Add set of users who
have seen creative to
same job
@rweald

Algebird Based Aggregation
val aggregator = new Monoid[(BF, BF)] {
def zero = (bfMonoid.zero, bfMonoid.zero)
def plus(l: (BF, BF), r: (BF, BF)) = {
(bfMonoid.plus(l._1, r._1),
bfMonoid.plus(l._2, r._2))
}
}
!

creativePubPages.map { case(ckey, pubPage, userId)
(
ckey,
bfMonoid.create(pubPage),
bfMonoid.create(userID)
)
}.reduceByKey(aggregator.plus(_, _))

@rweald

Monoids == Reusable
Aggregation

@rweald

How do we store the
results?

@rweald

Storage API Requirements
•Incremental updates (preferably associative)
•Pluggable to support “big data” stores
•Allow for testing jobs

@rweald

Storage API
trait MergeableStore[K, V] {
def get(key: K): V
def put(kv: (K,V)): V
/*
* Should follow same associative property
* as our Monoid from earlier
*/
def merge(kv: (K,V)): V
}

@rweald

Twitter Storehaus
!

http://github.com/twitter/storehaus

@rweald

Storing Spark Results
def saveResults(result: DStream[String, BF],
store: HBaseStore[String, BF]) = {
result.foreach { rdd =>
rdd.foreach { element =>
val (keys, value) = element
store.merge(keys, impressions)
}
}

}

@rweald

What if we don’t have
HBase locally?

@rweald

Dependency Injection
to the rescue

@rweald

Generic storage with
environment specific
binding
@rweald

Generic Storage Method
def saveResults(result: DStream[String, BF],
store: StorageFactory) = {
val store = StorageFactory.create
result.foreach { rdd =>
rdd.foreach { element =>
val (keys, value) = element
store.merge(keys, impressions)
}
}

}

@rweald

Google Guice
!

https://github.com/sptz45/sse-guice

@rweald

DI the Store You Need!
trait StorageFactory {
def create: Store[String, BF]
}
!

class DevModule extends ScalaModule {
def configure() {
bind[StorageFactory].to[InMemoryStorageFactory]
}
}
!

class ProdModule extends ScalaModule {
def configure() {
bind[StorageFactory].to[HBaseStorageFactory]
}
}
@rweald

Potential API additions?

class PairDStreamFunctions[K, V] {
def aggregateByKey(aggregator: Monoid[V])
def store(store: MergeableStore[K, V])
}

@rweald

Twitter Summingbird
!

http://github.com/twitter/summingbird
*https://github.com/twitter/summingbird/issues/387

@rweald

Thank You
Ryan Weald
@rweald

@rweald

Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

Similar to Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs (20)

Recently uploaded

Recently uploaded (20)

Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs