This document discusses common patterns in Spark streaming jobs, including mapping data, aggregating using monoids, and storing results. It describes using monoids to abstract aggregation, allowing different implementations like Bloom filters. It also discusses using dependency injection to make storage pluggable for different environments. The talk suggests potential additions to Spark's API to directly support these patterns.
2. What We’re Going to Cover
•What we do and Why we choose Spark
•Common patterns in spark streaming jobs
•Monoids as an abstraction for aggregation
•Abstraction for saving the results of jobs
•Using dependency injection for improved
testability and developer happiness
@rweald
6. Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with opportunities to contribute
•Batch model for iterative ML algorithms
@rweald
11. Mapping Data
inputData.map { rawRequest =>
val params = QueryParams.parse(rawRequest)
val pubPage = params.getOrElse(
"pub_page_location",
"http://example.com")
val creative = params.getOrElse(
"creative_key",
"unknown")
val uri = new java.net.URI(pubPage)
val cleanPubPage = uri.getHost + "/" + uri.getPath
(creative, cleanPubPage)
}
@rweald
19. WTF is a Monoid?
trait Monoid[T] {
def zero: T
def plus(r: T, l: T): T
}
* Just need to make sure plus is associative.
(1+ 5) + 2 == (2 + 1) + 5
@rweald
20. Monoid Example
SetMonoid extends Monoid[Set[String]] {
def zero = Set.empty[String]
def plus(l: Set[String], r: Set[String]) = l ++ r
}
!
SetMonoid.plus(Set("a"), Set("b"))
//returns Set("a", "b")
!
SetMonoid.plus(Set("a"), Set("a"))
//returns Set("a")
@rweald
30. Storage API
trait MergeableStore[K, V] {
def get(key: K): V
def put(kv: (K,V)): V
/*
* Should follow same associative property
* as our Monoid from earlier
*/
def merge(kv: (K,V)): V
}
@rweald