Spark Streaming
/ @laclefyoshi
<ysaeki@r.recruit.co.jp>
•
• Spark Streaming
•
•
• Spark Streaming Tips
•
2
: / SAEKI Yoshiyasu
:
IT
: Web 4 9
R&D
Hadoop, Kafka, Storm, Spark, Druid
: RICOH Theta ( ) + Google Cardboard
3
Spark Streaming
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
4
5
•
• =
•
•
http://www.recruit.jp/company/about/structure.html
6
•
• ≒ …
•
• !
OS etc.
7
1. Web 

(JavaScript)
2. fluentd Kafka
8
: fluentd → Kafka
• fluent-plugin-kafka
• https://github.com/htgc/fluent-plugin-kafka
• output type = kafka_buffered (on file)
• Kafka 0.8.2.2
• 0.9.0
• ACL
9
10
Suro
• Netflix
• https://github.com/Netflix/suro
• : Kafka Consumer API Thrift API
• :
• HDFS
• AWS S3
• Kafka Producer
• Elasticsearch
•
11
LinkedIn
Gobblin
Hadoop
•
• HDFS
• MLlib 

• Streaming linear regression (Classification)
• Streaming k-means (Clustering)
•
12
Spark Streaming
13
Kafka
• Direct Approach (>= Spark 1.3)
•
• Exactly-once
• Kafka Simple Consumer API
Direct Approach
14
Spark Streaming 1
15
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
Spark Streaming 2
16
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Micro-batch
17
1Micro-batch
(Cookie)
Window-based micro-batch
1
1Micro-batch1Micro-batch
18
Micro-batch
• RDD HBase
dstream.foreachRDD { rdd =>
val hbaseConf = createHbaseConfiguration()
val jobConf = new Configuration(hbaseConf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[Text].getName)
jobConf.set("mapreduce.outputformat.class",
classOf[TableOutputFormat[Text]].getName)
new PairRDDFunctions(rdd.map(hbaseConvert)).saveAsNewAPIHadoopDataset(jobConf)
}
// RDD[(String, Map[K,V])] RDD[(String, Put)]
def hbaseConvert(t:(String, Map[String, String])) = {
val p = new Put(Bytes.toBytes(t._1))
t._2.toSeq.foreach(
m => p.addColumn(Bytes.toBytes("seg"),
Bytes.toBytes(m._1), Bytes.toBytes(m._2))
)
(t._1, p)
}
19
0.5 1
20
Spark Streaming :
• DStream RDD
• Spark 

Spark Streaming
21
http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html
Spark Streaming :
• Fault Tolerance
• Micro-batch
• YARN
• YARN Dynamic Resource Allocation
•
22
Spark Streaming :
• : → 

RDD → RDD DStream → DStream
• 1Micro-batch
23
// RDD → RDD
val input:RDD[String] = sparkContext.makeRDD(Seq("a", "b", “c"))
// DStream → DStream
val queue = scala.collection.mutable.Queue(rdd)
val dstream:DStream[String] =
sparkStreamingContext.queueStream(queue)
Spark Streaming :
• spark-testing-base
• https://github.com/holdenk/spark-testing-base
class JsonElementCountTest extends StreamingSuiteBase {
test("simple") {
val input = List(List("aa"), List("bb"))
val expected = List(List("AA"), List(“BB"))
testOperation[String, String](
input, converterMethod _, expected, useSet = true)
}

}
24
Spark Streaming :
• Window-based micro-batch
•
• o.a.spark.streaming.util.ManualClock

• private class Scala
• http://mkuthan.github.io/blog/2015/03/01/spark-
unit-testing/
25
Spark Streaming :
• Scala Java
•
• Spark Streaming Kafka HBase Scala
• Java
26
// api/java/JavaRDD.scala
object JavaRDD {
implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] =
new JavaRDD[T](rdd)
implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd
}
27
•
•
• =
• Spark Streaming
• MLlib
• GraphX

Spark Streamingによるリアルタイムユーザ属性推定