SlideShare a Scribd company logo
Instructions on how to replace photo/image on cover
• Open Slide Master view
• Click on white gradated overlay and send to back
• Select grey logo pattern and delete
• Insert photo or other graphic no larger than 10” wide by 4” tall
• Move photo to top edge of slide
• Send photo to back
• Delete these instructions
Development of software for scalable anomaly detection modeling of
time-series data using Apache Spark
Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe,
IBM Research – Tokyo
2016/02/08, Spark Conference Japan
Apache Sparkを用いたスケーラ
ブルな時系列データの異常検知
モデル学習ソフトウェアの開発
©2015 IBM Corporation2 10 February 2016
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Sensor values are correlated

temperature acceleration pressure density
©2015 IBM Corporation3 10 February 2016
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Sensor values are correlated
 Correlation changes at anomaly situation
temperature acceleration pressure density
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation4 10 February 2016
Prediction model of correct behavior
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Sensor values are correlated
 Correlation changes at anomaly situation
Value of Sensor A is predicted from
other sensors B, C, and D
temperature acceleration pressure density
Sensor
A
Sensor
B
Sensor
C
Sensor
D
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation5 10 February 2016
Prediction model of correct behavior
How we detect anomaly
System under monitoring
(ex. Factory plant)
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Compare predicted sensor value
with the observed value
 It is anomaly if the two are different
Sensor values are correlated
 Correlation changes at anomaly situation
Value of Sensor A is predicted from
other sensors B, C, and D
temperature acceleration pressure density
Sensor
A
Sensor
B
Sensor
C
Sensor
D
Motivation:
The prediction model is computed in
advance by Machine Learning.
It takes a very long time and requires
much memory.
 Improve the scalability with Spark!
• T. Idé, et al., SDM 2009.
• T. Idé, IBM ProVISION No. 78, 2013
©2015 IBM Corporation6 10 February 2016
How we applied Spark (before)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
©2015 IBM Corporation7 10 February 2016
How we applied Spark (before)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
©2015 IBM Corporation8 10 February 2016
How we applied Spark (before)
 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
original time-
series data
(big)
xtj
D
T
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
©2015 IBM Corporation9 10 February 2016
How we applied Spark (before)
 Time-series xtj
– T ~ 106 or more samples
(time)
– D ~ 102 sensors
(dimensions)
– (i.e., T >> D)
 Training: A linear model using
LASSO regression
(Least square + L1 regularization)
– min
{𝑎}
𝑔𝑖 , where
– 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
– +𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
Hyper-parameter λ
(tuned later to achieve
the best prediction
accuracy)
training evaluation
model
Search loop of hyper parameter λ
Computed
in advance
(small)
original time-
series data
(big)
𝑆𝑗𝑘 =
1
𝑇
𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
Sjk
xtj
D
D
D
T
 Evaluation: cross validation of
prediction accuracy
– Other data is used to test
the model
©2015 IBM Corporation10 10 February 2016
How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes
©2015 IBM Corporation11 10 February 2016
Model is copied to
all the nodes
How we applied Spark (after)
training
sensor 1
training
sensor D
training
sensor D-1
training
sensor 2
evaluation
evaluation
evaluation
evaluation
By sensors By time (map-reduce)
model
Search loop of hyper parameter λ
Sjk
xtj
D
D
D
T
The small data is
copied to all the
nodes
Big data is not
copied or moved.
©2015 IBM Corporation12 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Cross validation
framework
Random split Block split
©2015 IBM Corporation13 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
©2015 IBM Corporation14 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
©2015 IBM Corporation15 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
xtj
train
test
Cross validation for
time-series data
(block sampling)
©2015 IBM Corporation16 10 February 2016
Why we did not use Spark MLlib
Spark MLlib Our method Decision Reason
LASSO regression SGD Shooting
algorithm
Implement by ourselves
using RDD
(maybe) better
accuracy when T >> D
Cross validation
framework
Random split Block split Implement by ourselves
using RDD
To avoid overfitting
(specific to time-series)
xtj
train
test
Cross validation for
usual data
(random sampling)
xtj
train
test
Cross validation for
time-series data
(block sampling) Balance optimization of CV
xtj
model 4
Pred1
Pred2
Pred3
Pred4
model 3
model 2
model 1
map reduceRDD
(original)
RDD
(prediction)
test 4
test 3
test 2
test 1
average
prediction
accuracy
©2015 IBM Corporation17 10 February 2016
Performance
0
200
400
600
800
1000
1200
1400
1600
1800
1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores
Executiontime(s)
Model computation time with various data sizes
10000 20000 40000 80000 160000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores
Executiontime(seconds)
Model computation time
50 sensors,10k
Item Specification Item Specification
Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB
Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet
OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0
Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0
Speed up by
7.8 times
16 times larger data
can be handled within
the same time.
Number of
samples
©2015 IBM Corporation18 10 February 2016
 Sliding window is not in
RDD
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation19 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
©2015 IBM Corporation20 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation21 10 February 2016
 Sliding window is not in
RDD
 Pitfall: Order preservation in
RDD operation
– join (not preserved)
– zip (preserved)
Lessons learned (in time-series handling)
3
1
2
3,4,5
1,2,3
2,3,4
 Alternative APIs
– DataFrame
 (Spark MLlib)
– Dstream
 (Spark Streaming)
– TimeSeriesRDD
 (Cloudera Spark TS)
c
a
b 4,d
3,c
1,a
3
1
2
3,c
1,a
2,b
sliding
window
map -
reduce
Bug!
OK
OK
OK
not
preserved
preserved
Is it better to use higher
level API for future
extensions instead of
RDD?
But in most cases, Spark programming is easy and fun.
Thank you!
import org.apache.spark.mllib.rdd.RDDFunctions._
val x = sc.parallelize(1 to 1000).sliding(3)
©2015 IBM Corporation23 10 February 2016
 JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他
の国における商標または登録商標です。
 インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron,
Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ
びその他の国における商標または登録商標です。
©2015 IBM Corporation25 10 February 2016
 Data is a high dimensional time-series
generated by sensors
 Typical sizes (long in vertical direction)
– D : number of sensors < 1k
– T : number of samples ~ 1M or more
– File size: ~ 1GB or more
 Data is processed in batch
Data
Time Sensor 1 … Sensor D
01:10:23 456 0.10 … -0.91
01:10:23 556 0.15 … -0.99
01:10:23 656 0.12 … -0.87
01:10:23 756 0.17 … -0.54
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
… … … …
23:59:59 956 -0.49 … -0.29
T
D
©2015 IBM Corporation26 10 February 2016
Architecture
Driver
Model
creation
tool server
Executor
Executor
Model
creation
tool GUI
Java RMI Spark HDFS
Physical architecture
Logical architecture
Frameworks
/ Middleware
Client PC
Master
server
Worker
servers Storages
OS
JVM (JRE)
HDFS
Other Libraries
Modeling creation tool server
Spark
Model creation engine (ML)
Standalone
scheduler
©2015 IBM Corporation27 10 February 2016
 計算の性質
– Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない
– Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce
– 両者ともセンサー(予測対象の変数)ごとに独立に計算可能
 ハイパーパラメーター探索ループの並列化の場合
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 1反復全体をセンサーごとで並列化
– 全ノードに元データのコピーが必要
–  1ノードのメモリーに乗り切らないかもしれない
 Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化
– 行列Sとモデルは全ノードで共有  サイズが小さいので可能
– Evaluationは典型的なmap-reduce  元データは分散配置可能
並列化の設計
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
©2015 IBM Corporation28 10 February 2016
 Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築
– 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする
 min
{𝑎}
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑡=1
𝑇
( 𝑗≠𝑖
𝐷
𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2
+ 𝜆 𝑗≠𝑖
𝐷
|𝑎 𝑗𝑖|
 係数 {aji} はShooting algorithmによりgi を最小化するように決定
 ハイパーパラメーターλは適当な小さい数(後で決める)
– さらに以下の最適化を行う (先にSjkをループ外で計算しておく)
 min
𝑎
𝑔𝑖 , where 𝑔𝑖 =
1
𝑇 𝑘≠𝑖
𝐷
𝑗≠𝑖
𝐷
𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖
𝐷
|𝑏𝑗𝑖| ,
 𝑏𝑗𝑖 =
𝑎𝑗𝑖, (𝑗 ≠ 𝑖)
−1, (𝑗 = 𝑖)
, 𝑆𝑗𝑘 = 𝑡=1
𝑇
𝑥 𝑡𝑗 𝑥 𝑡𝑘
 計算量: 1変数あたりおよそO(D3)
 Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価)
– 計算量: 1変数あたりO(TD)
モデリング手法
Sjk
training eval.
Hyper parameter search loop
xtj
D
D
D
T
model
全体構造:
最も予測精度が良くなる
ハイパーパラメーターλの探索
©2015 IBM Corporation29 10 February 2016
 We have developed a scalable modeling software for anomaly detection of time-series
using Spark
– Modeling is done in batch
– implemented own LASSO regression algorithm with RDD
– optimized to a time-series with T >> D situation
 Performance improvements
(2 nodes x 32 cores)
– Speed up by 7.8 times
– 16 times larger data set can be handled within a same time
Conclusion

More Related Content

What's hot

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
DataWorks Summit
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
Asim Jalis
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
clive boulton
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Databricks
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
SnappyData
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFs
Databricks
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
NAVER D2
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Jan Wiegelmann
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jim Dowling
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
Steven Gustafson
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
Li Jin
 

What's hot (20)

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Neural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep LearningNeural Networks, Spark MLlib, Deep Learning
Neural Networks, Spark MLlib, Deep Learning
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFs
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
 

Viewers also liked

2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
Yu Ishikawa
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Hadoop / Spark Conference Japan
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
 
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始めHadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
オラクルエンジニア通信
 
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Hadoop / Spark Conference Japan
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Nagato Kasaki
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
Yifeng Jiang
 

Viewers also liked (8)

2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 20162016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
2016-02-08 Spark MLlib Now and Beyond@Spark Conference Japan 2016
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始めHadoop Conference Japan 2016 LT資料 グラフデータベース事始め
Hadoop Conference Japan 2016 LT資料 グラフデータベース事始め
 
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 

Similar to Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
vithakur
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
CodeOps Technologies LLP
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
Matthew Gerring
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
Mark Nichols, P.E.
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 

Similar to Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発 (20)

Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 

Recently uploaded

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発

  • 1. Instructions on how to replace photo/image on cover • Open Slide Master view • Click on white gradated overlay and send to back • Select grey logo pattern and delete • Insert photo or other graphic no larger than 10” wide by 4” tall • Move photo to top edge of slide • Send photo to back • Delete these instructions Development of software for scalable anomaly detection modeling of time-series data using Apache Spark Ryo Kawahara, Toshihiro Takahashi, Hideo Watanabe, IBM Research – Tokyo 2016/02/08, Spark Conference Japan Apache Sparkを用いたスケーラ ブルな時系列データの異常検知 モデル学習ソフトウェアの開発
  • 2. ©2015 IBM Corporation2 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  temperature acceleration pressure density
  • 3. ©2015 IBM Corporation3 10 February 2016 How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Sensor values are correlated  Correlation changes at anomaly situation temperature acceleration pressure density • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 4. ©2015 IBM Corporation4 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 5. ©2015 IBM Corporation5 10 February 2016 Prediction model of correct behavior How we detect anomaly System under monitoring (ex. Factory plant) Sensor A Sensor B Sensor C Sensor D Compare predicted sensor value with the observed value  It is anomaly if the two are different Sensor values are correlated  Correlation changes at anomaly situation Value of Sensor A is predicted from other sensors B, C, and D temperature acceleration pressure density Sensor A Sensor B Sensor C Sensor D Motivation: The prediction model is computed in advance by Machine Learning. It takes a very long time and requires much memory.  Improve the scalability with Spark! • T. Idé, et al., SDM 2009. • T. Idé, IBM ProVISION No. 78, 2013
  • 6. ©2015 IBM Corporation6 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy)
  • 7. ©2015 IBM Corporation7 10 February 2016 How we applied Spark (before)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  Evaluation: cross validation of prediction accuracy – Other data is used to test the model Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ
  • 8. ©2015 IBM Corporation8 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ original time- series data (big) xtj D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  • 9. ©2015 IBM Corporation9 10 February 2016 How we applied Spark (before)  Time-series xtj – T ~ 106 or more samples (time) – D ~ 102 sensors (dimensions) – (i.e., T >> D)  Training: A linear model using LASSO regression (Least square + L1 regularization) – min {𝑎} 𝑔𝑖 , where – 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 – +𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖| Hyper-parameter λ (tuned later to achieve the best prediction accuracy) training evaluation model Search loop of hyper parameter λ Computed in advance (small) original time- series data (big) 𝑆𝑗𝑘 = 1 𝑇 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘 Sjk xtj D D D T  Evaluation: cross validation of prediction accuracy – Other data is used to test the model
  • 10. ©2015 IBM Corporation10 10 February 2016 How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes
  • 11. ©2015 IBM Corporation11 10 February 2016 Model is copied to all the nodes How we applied Spark (after) training sensor 1 training sensor D training sensor D-1 training sensor 2 evaluation evaluation evaluation evaluation By sensors By time (map-reduce) model Search loop of hyper parameter λ Sjk xtj D D D T The small data is copied to all the nodes Big data is not copied or moved.
  • 12. ©2015 IBM Corporation12 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Cross validation framework Random split Block split
  • 13. ©2015 IBM Corporation13 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series)
  • 14. ©2015 IBM Corporation14 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling)
  • 15. ©2015 IBM Corporation15 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling)
  • 16. ©2015 IBM Corporation16 10 February 2016 Why we did not use Spark MLlib Spark MLlib Our method Decision Reason LASSO regression SGD Shooting algorithm Implement by ourselves using RDD (maybe) better accuracy when T >> D Cross validation framework Random split Block split Implement by ourselves using RDD To avoid overfitting (specific to time-series) xtj train test Cross validation for usual data (random sampling) xtj train test Cross validation for time-series data (block sampling) Balance optimization of CV xtj model 4 Pred1 Pred2 Pred3 Pred4 model 3 model 2 model 1 map reduceRDD (original) RDD (prediction) test 4 test 3 test 2 test 1 average prediction accuracy
  • 17. ©2015 IBM Corporation17 10 February 2016 Performance 0 200 400 600 800 1000 1200 1400 1600 1800 1 node x 1 core 1 nodes x 32 cores 2 nodes x 32 cores Executiontime(s) Model computation time with various data sizes 10000 20000 40000 80000 160000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 node x 1 core 1 node x 32 cores 2 nodes x 32 cores Executiontime(seconds) Model computation time 50 sensors,10k Item Specification Item Specification Processor Intel(R) Xeon(R) E5-2680 0, 2.70GHz Memory / node 32GB Cores / node 32 (2 processors X 8 cores X 2 Hyper threads) NW 1Gb Ethernet OS Red Hat Enterprise Linux Server release 6.3 (Santiago) x86_64 JVM IBM® Java 1.8.0 Spark Version 1.5.0, standalone scheduler Hadoop (HDFS) Version 2.6.0 Speed up by 7.8 times 16 times larger data can be handled within the same time. Number of samples
  • 18. ©2015 IBM Corporation18 10 February 2016  Sliding window is not in RDD Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 19. ©2015 IBM Corporation19 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4 import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved
  • 20. ©2015 IBM Corporation20 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 21. ©2015 IBM Corporation21 10 February 2016  Sliding window is not in RDD  Pitfall: Order preservation in RDD operation – join (not preserved) – zip (preserved) Lessons learned (in time-series handling) 3 1 2 3,4,5 1,2,3 2,3,4  Alternative APIs – DataFrame  (Spark MLlib) – Dstream  (Spark Streaming) – TimeSeriesRDD  (Cloudera Spark TS) c a b 4,d 3,c 1,a 3 1 2 3,c 1,a 2,b sliding window map - reduce Bug! OK OK OK not preserved preserved Is it better to use higher level API for future extensions instead of RDD? But in most cases, Spark programming is easy and fun. Thank you! import org.apache.spark.mllib.rdd.RDDFunctions._ val x = sc.parallelize(1 to 1000).sliding(3)
  • 22.
  • 23. ©2015 IBM Corporation23 10 February 2016  JavaおよびすべてのJava関連の商標およびロゴは Oracleやその関連会社の米国およびその他 の国における商標または登録商標です。  インテル, Intel, Intelロゴ, Intel Inside, Intel Insideロゴ, Centrino, Intel Centrinoロゴ, Celeron, Xeon, Intel SpeedStep, Itanium, およびPentium は Intel Corporationまたは子会社の米国およ びその他の国における商標または登録商標です。
  • 24.
  • 25. ©2015 IBM Corporation25 10 February 2016  Data is a high dimensional time-series generated by sensors  Typical sizes (long in vertical direction) – D : number of sensors < 1k – T : number of samples ~ 1M or more – File size: ~ 1GB or more  Data is processed in batch Data Time Sensor 1 … Sensor D 01:10:23 456 0.10 … -0.91 01:10:23 556 0.15 … -0.99 01:10:23 656 0.12 … -0.87 01:10:23 756 0.17 … -0.54 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 23:59:59 956 -0.49 … -0.29 T D
  • 26. ©2015 IBM Corporation26 10 February 2016 Architecture Driver Model creation tool server Executor Executor Model creation tool GUI Java RMI Spark HDFS Physical architecture Logical architecture Frameworks / Middleware Client PC Master server Worker servers Storages OS JVM (JRE) HDFS Other Libraries Modeling creation tool server Spark Model creation engine (ML) Standalone scheduler
  • 27. ©2015 IBM Corporation27 10 February 2016  計算の性質 – Training: 行列S(D×D)のみに依存し大きな元データx (T×D)によらない – Evaluation: 元データx (T×D)のサンプル(1行, D)を要素とする map-reduce – 両者ともセンサー(予測対象の変数)ごとに独立に計算可能  ハイパーパラメーター探索ループの並列化の場合 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  1反復全体をセンサーごとで並列化 – 全ノードに元データのコピーが必要 –  1ノードのメモリーに乗り切らないかもしれない  Trainingはセンサーごとの並列化、Evaluationは時間ごとの並列化 – 行列Sとモデルは全ノードで共有  サイズが小さいので可能 – Evaluationは典型的なmap-reduce  元データは分散配置可能 並列化の設計 Sjk training eval. Hyper parameter search loop xtj D D D T model
  • 28. ©2015 IBM Corporation28 10 February 2016  Training:線形回帰モデルをLASSO回帰(最小二乗法+L1正則化)を使ってデータから構築 – 変数iを応答変数(予測対象)、変数i以外の変数を説明変数とする  min {𝑎} 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑡=1 𝑇 ( 𝑗≠𝑖 𝐷 𝑥 𝑡𝑗 𝑎𝑗𝑖 − 𝑥 𝑡𝑖)2 + 𝜆 𝑗≠𝑖 𝐷 |𝑎 𝑗𝑖|  係数 {aji} はShooting algorithmによりgi を最小化するように決定  ハイパーパラメーターλは適当な小さい数(後で決める) – さらに以下の最適化を行う (先にSjkをループ外で計算しておく)  min 𝑎 𝑔𝑖 , where 𝑔𝑖 = 1 𝑇 𝑘≠𝑖 𝐷 𝑗≠𝑖 𝐷 𝑏𝑗𝑖 𝑏 𝑘𝑖 𝑆𝑗𝑘 + 𝜆 𝑗≠𝑖 𝐷 |𝑏𝑗𝑖| ,  𝑏𝑗𝑖 = 𝑎𝑗𝑖, (𝑗 ≠ 𝑖) −1, (𝑗 = 𝑖) , 𝑆𝑗𝑘 = 𝑡=1 𝑇 𝑥 𝑡𝑗 𝑥 𝑡𝑘  計算量: 1変数あたりおよそO(D3)  Evaluation: クロスバリデーション (別データでサンプル毎の予測精度の平均を評価) – 計算量: 1変数あたりO(TD) モデリング手法 Sjk training eval. Hyper parameter search loop xtj D D D T model 全体構造: 最も予測精度が良くなる ハイパーパラメーターλの探索
  • 29. ©2015 IBM Corporation29 10 February 2016  We have developed a scalable modeling software for anomaly detection of time-series using Spark – Modeling is done in batch – implemented own LASSO regression algorithm with RDD – optimized to a time-series with T >> D situation  Performance improvements (2 nodes x 32 cores) – Speed up by 7.8 times – 16 times larger data set can be handled within a same time Conclusion