Spark Streaming Snippets

Spark Streaming Snippets
@a#y303

今まで作った Spark Streaming アプリ
• rtg -- リアルタイムリタゲ用データ生成
• pixelwriter -- Dynamic Crea4ve 用マークデータ書き込み
• feedsync -- 商品フィード同期
• segment:elas4c -- リアルタイムセグメント化
• (logblend -- 異なるイベントログの Join)

これらのアプリから適当に共有すると嬉しそうなとこ
ろを抜いてみた
• SparkBoot
• Cron
• UpdatableBroadcast
• Connector
• SparkStreamingSpec

SparkBoot
h"ps://gist.github.com/a"y303/c83f3c8cb8a930951be0
• Spark アプリの main 実装
• SparkContext / StreamingContext を提供する
• Conﬁgura4on 管理
• spark-submit の --files で applica4on.conf を送ってカスタ
マイズ

バッチの場合
object TrainingBatchApp extends SparkBatchBoot {
val appName = "TrainingBatchApp"
override def mkApp(sc: SparkContext, args: Array[String]): SparkApp =
new TrainingBatchApp(sc, appConfig)
}
class TrainingBatchApp(
sc: SparkContext, appConfig: Config)
extends SparkApp {
def run(): Try[Int] = Try {
0
}
}

ストリーミングの場合
object PredictStreamingApp extends SparkStreamingBoot {
val appName = "PredictStreamingApp"
override val checkpointPath: String = "app.training.streaming.checkpoint-path"
override val batchDurationPath: String = "app.training.streaming.batch-duration"
override def mkApp(sc: SparkContext, ssc: StreamingContext, args: Array[String]): SparkApp =
new PredictStreamingApp(ssc, batchDuration, appConfig)
}
class PredictStreamingApp(
ssc: StreamingContext, appConfig: Config)
extends SparkApp {
val sparkContext = ssc.sparkContext
def run(): Try[Int] = Try {
0
}
}

Cron 的なことをやる
• batch-dura+on より長い間隔で定期実行したい処理がある
• 外部データストアから読んでいるマスタデータのリフレッシュ
など

def repeatedly(streamingContext: StreamingContext, interval: Duration)
(f: (SparkContext, Time) => Unit): Unit = {
// トリガーを生成する DStream
val s = streamingContext.queueStream(
mutable.Queue.empty[RDD[Unit]],
oneAtATime = true,
defaultRDD = streamingContext.sparkContext.makeRDD(Seq(())))
.repartition(1)
s.window(s.slideDuration, interval)
.foreachRDD { (rdd, time) =>
f(rdd.context, time)
rdd.foreach(_ => ())
}
}

使い方
repetedly(streamingContext, Durations.seconds(300)) { (sc, time) =>
// @driver: 5 分毎に実行する処理
}

更新可能な Broadcast
• Streaming が動き始めた後に Broadcast を更新したい
• 単なる Broadcast を保持するラッパー

/**
* 値の更新(再ブロードキャスト)が可能な Broadcast
*
* https://gist.github.com/Reinvigorate/040a362ca8100347e1a6
* @author Reinvigorate
*/
case class UpdatableBroadcast[T: ClassTag](
@transient private val ssc: StreamingContext,
@transient private val _v: T) {
@transient private var v = ssc.sparkContext.broadcast(_v)
def update(newValue: T, blocking: Boolean = false): Unit = {
v.unpersist(blocking)
v = ssc.sparkContext.broadcast(newValue)
}
def value: T = v.value
private def writeObject(out: ObjectOutputStream): Unit = {
out.writeObject(v)
}
private def readObject(in: ObjectInputStream): Unit = {
v = in.readObject().asInstanceOf[Broadcast[T]]
}
}

使い方
def loadModel(): Model = ???
val ub: UpdatableBroadcast[Model] =
UpdatableBroadcast(streamingContext, loadModel())
StreamingUtil.repeatedly(streamingContext, refreshInterval) { (_, _) =>
ub.update(loadModel())
}
dstream.foreachRDD { rdd =>
val model = ub.value
// use model
}

外部接続の抽象化
• Spark で外部リソースにアクセスするとき、個々の Executor が
接続を維持する必要がある
• Driver から Executor に「接続そのもの」を送信することはでき
ない
• 「接続する方法」を Connector trait として抽象化している

trait Connector[A] extends java.io.Closeable with Serializable {
def get: A
def close(): Unit
def using[B](f: A => B): B = f(get)
}

case class PoolAerospikeConnector(name: Symbol, config: AerospikeConfig)
extends Connector[AerospikeClient] {
def get: AerospikeClient =
PoolAerospikeConnector.defaultHolder.getOrCreate(name, mkClient)
def close(): Unit =
PoolAerospikeConnector.defaultHolder.remove(name)(
AerospikeConnector.aerospikeClientClosable)
private val mkClient: () => AerospikeClient =
() => new AerospikeClient(config.clientPolicy.underlying,
config.asHosts:_*)
}
object PoolAerospikeConnector {
private val defaultHolder = new DefaultResourceHolder[AerospikeClient]
}

case class ScalikeJdbcConnector(name: Symbol, config: Config)
extends Connector[Unit] {
def get: Unit = {
if (!ConnectionPool.isInitialized(name)) {
// Load MySQL JDBC driver class
Class.forName("com.mysql.jdbc.Driver")
ConnectionPool.add(name, config.getString("url"),
config.getString("user"), config.getString("password"))
}
}
def close(): Unit = ConnectionPool.close(name)
}

case class KafkaProducerConnector[K :ClassTag, V :ClassTag](
name: Symbol, config: java.util.Map[String, AnyRef])
extends Connector[ScalaKafkaProducer[K, V]] {
def get: ScalaKafkaProducer[K, V] =
KafkaProducerConnector.defaultHolder.getOrCreate(name, mkResource)
.asInstanceOf[ScalaKafkaProducer[K, V]]
def close(): Unit = KafkaProducerConnector.defaultHolder.remove(name)(
KafkaProducerConnector.kafkaProducerClosable)
private val mkResource = () => {
val keySer = mkDefaultSerializer[K](ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG)
val valueSer = mkDefaultSerializer[V](ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG)
new ScalaKafkaProducer[K, V](
new KafkaProducer[K, V](config, keySer.orNull, valueSer.orNull))
}
private def mkDefaultSerializer[A :ClassTag](configKey: String): Option[Serializer[A]] = {
if (!config.containsKey(configKey)) {
implicitly[ClassTag[A]].runtimeClass match {
case c if c == classOf[Array[Byte]] => Some(new ByteArraySerializer().asInstanceOf[Serializer[A]])
case c if c == classOf[String] => Some(new StringSerializer().asInstanceOf[Serializer[A]])
case _ => None
}
} else None
}
}

Spark Streaming のテスト
h"ps://gist.github.com/a"y303/18e64e718f0cf3261c0e

class CountProductSpec extends SpecWithJUnit with SparkStreamingSpec {
val batchDuration: Duration = Duration(1000)
"Count" >> {
val (sourceQueue, resultQueue) = startQueueStream[Product, (Product, Long)] { inStream =>
// テスト対象の Streaming 処理
CountProduct(inStream).run(sc)
}
// 入力キューにテストデータを投入する
sourceQueue += sc.parallelize(Seq(
Product(1, "id"), Product(1, "id"), Product(2, "id")))
// 時間を進める
advance()
// 出力されるデータをテストする
resultQueue.dequeue must eventually(contain(exactly(
Product(1, "id") -> 2L, Product(2, "id") -> 1L
)))
}
}

Spark Streaming Snippets

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Spark Streaming Snippets

Similar to Spark Streaming Snippets (20)

Spark Streaming Snippets