Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

4,697 views

Published on

Development of Software for scalable anomaly detection modeling of time-series data using Apache Spark.
私たちはこれまで、様々な機器類を監視するセンサーの時系列データを分析し、異常を検知する手法およびソフトウェアの研究開発を行ってきた。
今回紹介するソフトウェアでは、バッチ処理で複数のセンサーから得られた高次元の時系列データから線形のLASSO回帰により学習、モデル化し、異常時を識別する。
しかし学習時間やメモリー使用量の増大が課題になってきたため、Sparkを活用し並列分散化を行った。
SparkにはMLlibという汎用的な機械学習ライブラリが存在するが、今回は使用するアルゴリズムの特殊性を考慮し、既存実装を基に新規に開発した。
本講演では当開発におけるデザインチョイスや性能計測結果について報告する。
a

Published in: Data & Analytics

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

  1. 1. Instructions on how to replace photo/image on cover • Open Slide Master view • Click on white gradated overlay and send to back • Select grey logo pattern and delete • Insert photo or other graphic no larger than 10” wide by 4” tall • Move photo to top edge of slide • Send photo to back • Delete these instructions Development of software for scalable anomaly detection modeling of time-series data using Apache Spark Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo 2016/02/08, Spark Conference Japan Apache Sparkを用いたスケーラ ブルな時系列データの異常検知 モデル学習ソフトウェアの開発
  2. 2. ©2015 IBM Corporation2 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  temperature acceleration pressure density
  3. 3. ©2015 IBM Corporation3 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  Correlation changes at anomaly situation temperature acceleration pressure density • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  4. 4. ©2015 IBM Corporation4 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  5. 5. ©2015 IBM Corporation5 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D Motivation: The prediction model is computed in advance by Machine Learning. It takes a very long time and requires much memory.  Improve the scalability with Spark! • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  6. 6. ©2015 IBM Corporation6 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy)
  7. 7. ©2015 IBM Corporation7 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  Evaluation: cross validation of prediction accuracy – Other data is used to test the model Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ
  8. 8. ©2015 IBM Corporation8 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ original time- series data (big) xtj D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  9. 9. ©2015 IBM Corporation9 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ Computed in advance (small) original time- series data (big) 𝑆𝑗𝑘 = 1 𝑇 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘 Sjk xtj D D D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  10. 10. ©2015 IBM Corporation10 10 February 2016 How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes
  11. 11. ©2015 IBM Corporation11 10 February 2016 Model is copied to all the nodes How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes Big data is not copied or moved.
  12. 12. ©2015 IBM Corporation12 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Cross validation framework Random split Block split
  13. 13. ©2015 IBM Corporation13 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series)
  14. 14. ©2015 IBM Corporation14 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling)
  15. 15. ©2015 IBM Corporation15 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling)
  16. 16. ©2015 IBM Corporation16 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling) Balance optimization of CV xtj model 4 Pred1 Pred2 Pred3 Pred4 model 3 model 2 model 1 map reduceRDD (original) RDD (prediction) test 4 test 3 test 2 test 1 average prediction accuracy
  17. 17. ©2015 IBM Corporation17 10 February 2016 Performance 0 200 400 600 800 1000 1200 1400 1600 1800 1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores Executiontime(s) Model computation time with various data sizes 10000 20000 40000 80000 160000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores Executiontime(seconds) Model computation time 50 sensors,10k Item Specification Item Specification Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0 Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0 Speed up by 7.8 times 16 times larger data can be handled within the same time. Number of samples
  18. 18. ©2015 IBM Corporation18 10 February 2016  Sliding window is not in RDD Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  19. 19. ©2015 IBM Corporation19 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved
  20. 20. ©2015 IBM Corporation20 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  21. 21. ©2015 IBM Corporation21 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? But in most cases, Spark programming is easy and fun. Thank you! import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  22. 22. ©2015 IBM Corporation23 10 February 2016  JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他 の国における商標または登録商標です。  インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron, Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ びその他の国における商標または登録商標です。
  23. 23. ©2015 IBM Corporation25 10 February 2016  Data is a high dimensional time-series generated by sensors  Typical sizes (long in vertical direction) – D : number of sensors < 1k – T : number of samples ~ 1M or more – File size: ~ 1GB or more  Data is processed in batch Data Time Sensor 1 … Sensor D 01:10:23 456 0.10 … -0.91 01:10:23 556 0.15 … -0.99 01:10:23 656 0.12 … -0.87 01:10:23 756 0.17 … -0.54 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 23:59:59 956 -0.49 … -0.29 T D
  24. 24. ©2015 IBM Corporation26 10 February 2016 Architecture Driver Model creation tool server Executor Executor Model creation tool GUI Java RMI Spark HDFS Physical architecture Logical architecture Frameworks / Middleware Client PC Master server Worker servers Storages OS JVM (JRE) HDFS Other Libraries Modeling creation tool server Spark Model creation engine (ML) Standalone scheduler
  25. 25. ©2015 IBM Corporation27 10 February 2016  計算の性質 – Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない – Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce – 両者ともセンサー(予測対象の変数)ごとに独立に計算可能  ハイパーパラメーター探索ループの並列化の場合 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  1反復全体をセンサーごとで並列化 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化 – 行列Sとモデルは全ノードで共有  サイズが小さいので可能 – Evaluationは典型的なmap-reduce  元データは分散配置可能 並列化の設計 Sjk training eval. Hyper parameter search loop xtj D D D T model
  26. 26. ©2015 IBM Corporation28 10 February 2016  Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築 – 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする  min {𝑎} 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 + 𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  係数 {aji} はShooting algorithmによりgi を最小化するように決定  ハイパーパラメーターλは適当な小さい数(後で決める) – さらに以下の最適化を行う (先にSjkをループ外で計算しておく)  min 𝑎 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑘≠𝑖 𝐷 𝑗≠𝑖 𝐷 𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖 𝐷 |𝑏𝑗𝑖| ,  𝑏𝑗𝑖 = 𝑎𝑗𝑖, (𝑗 ≠ 𝑖) −1, (𝑗 = 𝑖) , 𝑆𝑗𝑘 = 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘  計算量: 1変数あたりおよそO(D3)  Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価) – 計算量: 1変数あたりO(TD) モデリング手法 Sjk training eval. Hyper parameter search loop xtj D D D T model 全体構造: 最も予測精度が良くなる ハイパーパラメーターλの探索
  27. 27. ©2015 IBM Corporation29 10 February 2016  We have developed a scalable modeling software for anomaly detection of time-series using Spark – Modeling is done in batch – implemented own LASSO regression algorithm with RDD – optimized to a time-series with T >> D situation  Performance improvements (2 nodes x 32 cores) – Speed up by 7.8 times – 16 times larger data set can be handled within a same time Conclusion

×