SlideShare a Scribd company logo
1 of 29
Instructions on how to replace photo/image on cover
• Open Slide Master view
• Click on white gradated overlay and send to back
• Select grey logo pattern and delete
• Insert photo or other graphic no larger than 10” wide by 4” tall
• Move photo to top edge of slide
• Send photo to back
• Delete these instructions
Development of software for scalable anomaly detection modeling of
time-series data using Apache Spark
Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe,
IBM Research – Tokyo
2016/02/08, Spark Conference Japan
Apache Sparkを用いたスケーラ
ブルな時系列データの異常検知
モデル学習ソフトウェアの開発
©2015 IBM Corporation2 10 February 2016
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Sensor values are correlated

temperature acceleration pressure density
©2015 IBM Corporation3 10 February 2016
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Sensor values are correlated
 Correlation changes at anomaly situation
temperature acceleration pressure density
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation4 10 February 2016
Prediction model of correct behavior
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Sensor values are correlated
 Correlation changes at anomaly situation
Value of Sensor A is predicted from
other sensors B, C, and D
temperature acceleration pressure density
Sensor
A
Sensor
B
Sensor
C
Sensor
D
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation5 10 February 2016
Prediction model of correct behavior
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Sensor values are correlated
 Correlation changes at anomaly situation
Value of Sensor A is predicted from
other sensors B, C, and D
temperature acceleration pressure density
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Motivation:
The prediction model is computed in
advance by Machine Learning.
It takes a very long time and requires
much memory.
 Improve the scalability with Spark!
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation6 10 February 2016
How we applied Spark (before)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
©2015 IBM Corporation7 10 February 2016
How we applied Spark (before)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
©2015 IBM Corporation8 10 February 2016
How we applied Spark (before)
 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
original time-
series data
(big)
xtj
D
T
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
©2015 IBM Corporation9 10 February 2016
How we applied Spark (before)
 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
Computed
in advance
(small)
original time-
series data
(big)
𝑆𝑗𝑘 =
1
𝑇
𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
Sjk
xtj
D
D
D
T
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
©2015 IBM Corporation10 10 February 2016
How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes
©2015 IBM Corporation11 10 February 2016
Model is copied to
all the nodes
How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes
Big data is not
copied or moved.
©2015 IBM Corporation12 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Cross validation
framework
Random split Block split
©2015 IBM Corporation13 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
©2015 IBM Corporation14 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
©2015 IBM Corporation15 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
xtj
train
test
Cross validation for
time-series data
(block sampling)
©2015 IBM Corporation16 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
xtj
train
test
Cross validation for
time-series data
(block sampling) Balance optimization of CV
xtj
model 4
Pred1
Pred2
Pred3
Pred4
model 3
model 2
model 1
map reduceRDD
(original)
RDD
(prediction)
test 4
test 3
test 2
test 1
average
prediction
accuracy
©2015 IBM Corporation17 10 February 2016
Performance
0
200
400
600
800
1000
1200
1400
1600
1800
1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores
Executiontime(s)
Model computation time with various data sizes
10000 20000 40000 80000 160000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores
Executiontime(seconds)
Model computation time
50 sensors,10k
Item Specification Item Specification
Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB
Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet
OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0
Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0
Speed up by
7.8 times
16 times larger data
can be handled within
the same time.
Number of
samples
©2015 IBM Corporation18 10 February 2016
 Sliding window is not in
RDD
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation19 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
©2015 IBM Corporation20 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation21 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?
But in most cases, Spark programming is easy and fun.
Thank you!
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation23 10 February 2016
 JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他
の国における商標または登録商標です。
 インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron,
Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ
びその他の国における商標または登録商標です。
©2015 IBM Corporation25 10 February 2016
 Data is a high dimensional time-series
generated by sensors
 Typical sizes (long in vertical direction)
– D : number of sensors < 1k
– T : number of samples ~ 1M or more
– File size: ~ 1GB or more
 Data is processed in batch
Data
Time Sensor 1 … Sensor D
01:10:23 456 0.10 … -0.91
01:10:23 556 0.15 … -0.99
01:10:23 656 0.12 … -0.87
01:10:23 756 0.17 … -0.54
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
23:59:59 956 -0.49 … -0.29
T
D
©2015 IBM Corporation26 10 February 2016
Architecture
Driver
Model
creation
tool server
Executor
Executor
Model
creation
tool GUI
Java RMI Spark HDFS
Physical architecture
Logical architecture
Frameworks
/ Middleware
Client PC
Master
server
Worker
servers Storages
OS
JVM (JRE)
HDFS
Other Libraries
Modeling creation tool server
Spark
Model creation engine (ML)
Standalone
scheduler
©2015 IBM Corporation27 10 February 2016
 計算の性質
– Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない
– Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce
– 両者ともセンサー(予測対象の変数)ごとに独立に計算可能
 ハイパーパラメーター探索ループの並列化の場合
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 1反復全体をセンサーごとで並列化
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化
– 行列Sとモデルは全ノードで共有  サイズが小さいので可能
– Evaluationは典型的なmap-reduce  元データは分散配置可能
並列化の設計
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
©2015 IBM Corporation28 10 February 2016
 Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築
– 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする
 min
{𝑎}
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
+ 𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 係数 {aji} はShooting algorithmによりgi を最小化するように決定
 ハイパーパラメーターλは適当な小さい数(後で決める)
– さらに以下の最適化を行う (先にSjkをループ外で計算しておく)
 min
𝑎
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑘≠𝑖
𝐷
𝑗≠𝑖
𝐷
𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖
𝐷
|𝑏𝑗𝑖| ,
 𝑏𝑗𝑖 =
𝑎𝑗𝑖, (𝑗 ≠ 𝑖)
−1, (𝑗 = 𝑖)
, 𝑆𝑗𝑘 = 𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
 計算量: 1変数あたりおよそO(D3)
 Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価)
– 計算量: 1変数あたりO(TD)
モデリング手法
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
全体構造:
最も予測精度が良くなる
ハイパーパラメーターλの探索
©2015 IBM Corporation29 10 February 2016
 We have developed a scalable modeling software for anomaly detection of time-series
using Spark
– Modeling is done in batch
– implemented own LASSO regression algorithm with RDD
– optimized to a time-series with T >> D situation
 Performance improvements
(2 nodes x 32 cores)
– Speed up by 7.8 times
– 16 times larger data set can be handled within a same time
Conclusion

More Related Content

What's hot

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsShankar M S
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningAsim Jalis
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...NAVER D2
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsSteven Gustafson
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowLi Jin
 

What's hot (20)

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFs
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
 

Viewers also liked

2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016Yu Ishikawa
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Hadoop / Spark Conference Japan
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Hadoop / Spark Conference Japan
 
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始めHadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始めオラクルエンジニア通信
 
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Hadoop / Spark Conference Japan
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16Yifeng Jiang
 

Viewers also liked (8)

2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始めHadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
 
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 

Similar to Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 

Similar to Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発 (20)

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 

Recently uploaded

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...mikehavy0
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 

Recently uploaded (20)

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

  • 1. Instructions on how to replace photo/image on cover • Open Slide Master view • Click on white gradated overlay and send to back • Select grey logo pattern and delete • Insert photo or other graphic no larger than 10” wide by 4” tall • Move photo to top edge of slide • Send photo to back • Delete these instructions Development of software for scalable anomaly detection modeling of time-series data using Apache Spark Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo 2016/02/08, Spark Conference Japan Apache Sparkを用いたスケーラ ブルな時系列データの異常検知 モデル学習ソフトウェアの開発
  • 2. ©2015 IBM Corporation2 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  temperature acceleration pressure density
  • 3. ©2015 IBM Corporation3 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  Correlation changes at anomaly situation temperature acceleration pressure density • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 4. ©2015 IBM Corporation4 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 5. ©2015 IBM Corporation5 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D Motivation: The prediction model is computed in advance by Machine Learning. It takes a very long time and requires much memory.  Improve the scalability with Spark! • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 6. ©2015 IBM Corporation6 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy)
  • 7. ©2015 IBM Corporation7 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  Evaluation: cross validation of prediction accuracy – Other data is used to test the model Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ
  • 8. ©2015 IBM Corporation8 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ original time- series data (big) xtj D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  • 9. ©2015 IBM Corporation9 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ Computed in advance (small) original time- series data (big) 𝑆𝑗𝑘 = 1 𝑇 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘 Sjk xtj D D D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  • 10. ©2015 IBM Corporation10 10 February 2016 How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes
  • 11. ©2015 IBM Corporation11 10 February 2016 Model is copied to all the nodes How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes Big data is not copied or moved.
  • 12. ©2015 IBM Corporation12 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Cross validation framework Random split Block split
  • 13. ©2015 IBM Corporation13 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series)
  • 14. ©2015 IBM Corporation14 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling)
  • 15. ©2015 IBM Corporation15 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling)
  • 16. ©2015 IBM Corporation16 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling) Balance optimization of CV xtj model 4 Pred1 Pred2 Pred3 Pred4 model 3 model 2 model 1 map reduceRDD (original) RDD (prediction) test 4 test 3 test 2 test 1 average prediction accuracy
  • 17. ©2015 IBM Corporation17 10 February 2016 Performance 0 200 400 600 800 1000 1200 1400 1600 1800 1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores Executiontime(s) Model computation time with various data sizes 10000 20000 40000 80000 160000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores Executiontime(seconds) Model computation time 50 sensors,10k Item Specification Item Specification Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0 Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0 Speed up by 7.8 times 16 times larger data can be handled within the same time. Number of samples
  • 18. ©2015 IBM Corporation18 10 February 2016  Sliding window is not in RDD Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 19. ©2015 IBM Corporation19 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved
  • 20. ©2015 IBM Corporation20 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 21. ©2015 IBM Corporation21 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? But in most cases, Spark programming is easy and fun. Thank you! import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 22.
  • 23. ©2015 IBM Corporation23 10 February 2016  JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他 の国における商標または登録商標です。  インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron, Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ びその他の国における商標または登録商標です。
  • 24.
  • 25. ©2015 IBM Corporation25 10 February 2016  Data is a high dimensional time-series generated by sensors  Typical sizes (long in vertical direction) – D : number of sensors < 1k – T : number of samples ~ 1M or more – File size: ~ 1GB or more  Data is processed in batch Data Time Sensor 1 … Sensor D 01:10:23 456 0.10 … -0.91 01:10:23 556 0.15 … -0.99 01:10:23 656 0.12 … -0.87 01:10:23 756 0.17 … -0.54 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 23:59:59 956 -0.49 … -0.29 T D
  • 26. ©2015 IBM Corporation26 10 February 2016 Architecture Driver Model creation tool server Executor Executor Model creation tool GUI Java RMI Spark HDFS Physical architecture Logical architecture Frameworks / Middleware Client PC Master server Worker servers Storages OS JVM (JRE) HDFS Other Libraries Modeling creation tool server Spark Model creation engine (ML) Standalone scheduler
  • 27. ©2015 IBM Corporation27 10 February 2016  計算の性質 – Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない – Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce – 両者ともセンサー(予測対象の変数)ごとに独立に計算可能  ハイパーパラメーター探索ループの並列化の場合 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  1反復全体をセンサーごとで並列化 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化 – 行列Sとモデルは全ノードで共有  サイズが小さいので可能 – Evaluationは典型的なmap-reduce  元データは分散配置可能 並列化の設計 Sjk training eval. Hyper parameter search loop xtj D D D T model
  • 28. ©2015 IBM Corporation28 10 February 2016  Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築 – 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする  min {𝑎} 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 + 𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  係数 {aji} はShooting algorithmによりgi を最小化するように決定  ハイパーパラメーターλは適当な小さい数(後で決める) – さらに以下の最適化を行う (先にSjkをループ外で計算しておく)  min 𝑎 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑘≠𝑖 𝐷 𝑗≠𝑖 𝐷 𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖 𝐷 |𝑏𝑗𝑖| ,  𝑏𝑗𝑖 = 𝑎𝑗𝑖, (𝑗 ≠ 𝑖) −1, (𝑗 = 𝑖) , 𝑆𝑗𝑘 = 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘  計算量: 1変数あたりおよそO(D3)  Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価) – 計算量: 1変数あたりO(TD) モデリング手法 Sjk training eval. Hyper parameter search loop xtj D D D T model 全体構造: 最も予測精度が良くなる ハイパーパラメーターλの探索
  • 29. ©2015 IBM Corporation29 10 February 2016  We have developed a scalable modeling software for anomaly detection of time-series using Spark – Modeling is done in batch – implemented own LASSO regression algorithm with RDD – optimized to a time-series with T >> D situation  Performance improvements (2 nodes x 32 cores) – Speed up by 7.8 times – 16 times larger data set can be handled within a same time Conclusion