Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Instructions on how to replace photo/image on cover
• Open Slide Master view
• Click on white gradated overlay and send to back
• Select grey logo pattern and delete
• Insert photo or other graphic no larger than 10” wide by 4” tall
• Move photo to top edge of slide
• Send photo to back
• Delete these instructions
Development of software for scalable anomaly detection modeling of
time-series data using Apache Spark
Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe,
IBM Research – Tokyo
2016/02/08, Spark Conference Japan
Apache Sparkを用いたスケーラ
ブルな時系列データの異常検知
モデル学習ソフトウェアの開発

©2015 IBM Corporation2 10 February 2016
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Sensor values are correlated

temperature acceleration pressure density

(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
 Correlation changes at anomaly situation
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013

Prediction model of correct behavior
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Value of Sensor A is predicted from
other sensors B, C, and D
Sensor
A
Sensor
B
Sensor
C
Sensor
D

Prediction model of correct behavior
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Value of Sensor A is predicted from
other sensors B, C, and D
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Motivation:
The prediction model is computed in
advance by Machine Learning.
It takes a very long time and requires
much memory.
 Improve the scalability with Spark!

How we applied Spark (before)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)

LASSO regression
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
Hyper-parameter λ
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ

 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
LASSO regression
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
the best prediction
accuracy)
training evaluation
model
original time-
series data
(big)
xtj
D
T
prediction accuracy
the model

 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
LASSO regression
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
the best prediction
accuracy)
training evaluation
model
Computed
in advance
(small)
original time-
series data
(big)
𝑆𝑗𝑘 =
1
𝑇
𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
Sjk
xtj
D
D
D
T
prediction accuracy
the model

How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes

Model is copied to
all the nodes
How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes
Big data is not
copied or moved.

Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Cross validation
framework
Random split Block split

algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)

algorithm
using RDD
(maybe) better
Cross validation
framework
using RDD
xtj
train
test
Cross validation for
usual data
(random sampling)

algorithm
using RDD
(maybe) better
Cross validation
framework
using RDD
xtj
train
test
usual data
(random sampling)
xtj
train
test
time-series data
(block sampling)

algorithm
using RDD
(maybe) better
Cross validation
framework
using RDD
xtj
train
test
usual data
(random sampling)
xtj
train
test
time-series data
(block sampling) Balance optimization of CV
xtj
model 4
Pred1
Pred2
Pred3
Pred4
model 3
model 2
model 1
map reduceRDD
(original)
RDD
(prediction)
test 4
test 3
test 2
test 1
average
prediction
accuracy

Performance
0
200
400
600
800
1000
1200
1400
1600
1800
1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores
Executiontime(s)
Model computation time with various data sizes
10000 20000 40000 80000 160000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores
Executiontime(seconds)
Model computation time
50 sensors,10k
Item Specification Item Specification
Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB
Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet
OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0
Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0
Speed up by
7.8 times
16 times larger data
can be handled within
the same time.
Number of
samples

 Sliding window is not in
RDD
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)

RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
3
1
2
3,4,5
1,2,3
2,3,4
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved

RDD
RDD operation
– zip (preserved)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?

RDD
RDD operation
– zip (preserved)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?
But in most cases, Spark programming is easy and fun.
Thank you!

 JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他
の国における商標または登録商標です。
 インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron,
Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ
びその他の国における商標または登録商標です。

 Data is a high dimensional time-series
generated by sensors
 Typical sizes (long in vertical direction)
– D : number of sensors < 1k
– T : number of samples ~ 1M or more
– File size: ~ 1GB or more
 Data is processed in batch
Data
Time Sensor 1 … Sensor D
01:10:23 456 0.10 … -0.91
01:10:23 556 0.15 … -0.99
01:10:23 656 0.12 … -0.87
01:10:23 756 0.17 … -0.54
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
23:59:59 956 -0.49 … -0.29
T
D

Architecture
Driver
Model
creation
tool server
Executor
Executor
Model
creation
tool GUI
Java RMI Spark HDFS
Physical architecture
Logical architecture
Frameworks
/ Middleware
Client PC
Master
server
Worker
servers Storages
OS
JVM (JRE)
HDFS
Other Libraries
Modeling creation tool server
Spark
Model creation engine (ML)
Standalone
scheduler

 計算の性質
– Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない
– Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce
– 両者ともセンサー(予測対象の変数)ごとに独立に計算可能
 ハイパーパラメーター探索ループの並列化の場合
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 1反復全体をセンサーごとで並列化
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化
– 行列Sとモデルは全ノードで共有  サイズが小さいので可能
– Evaluationは典型的なmap-reduce  元データは分散配置可能
並列化の設計
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model

 Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築
– 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする
 min
{𝑎}
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
+ 𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 係数 {aji} はShooting algorithmによりgi を最小化するように決定
 ハイパーパラメーターλは適当な小さい数(後で決める)
– さらに以下の最適化を行う (先にSjkをループ外で計算しておく)
 min
𝑎
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑘≠𝑖
𝐷
𝑗≠𝑖
𝐷
𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖
𝐷
|𝑏𝑗𝑖| ,
 𝑏𝑗𝑖 =
𝑎𝑗𝑖, (𝑗 ≠ 𝑖)
−1, (𝑗 = 𝑖)
, 𝑆𝑗𝑘 = 𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
 計算量: 1変数あたりおよそO(D3)
 Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価)
– 計算量: 1変数あたりO(TD)
モデリング手法
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
全体構造:
最も予測精度が良くなる
ハイパーパラメーターλの探索

 We have developed a scalable modeling software for anomaly detection of time-series
using Spark
– Modeling is done in batch
– implemented own LASSO regression algorithm with RDD
– optimized to a time-series with T >> D situation
 Performance improvements
(2 nodes x 32 cores)
– Speed up by 7.8 times
– 16 times larger data set can be handled within a same time
Conclusion

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

More Related Content

What's hot

Viewers also liked

Similar to Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Recently uploaded

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発