Real-time health score app using Spark on Kubernetes

Real-time health score application
using Spark on Kubernetes
Daeyoung Kim - BISTel Research
Seungchul Lee - BISTel Research

Agenda
Introduction to BISTel and GrandView APM
Real-time health score application
▪ What is a health score?
▪ Real-time streaming service
▪ Spark on Kubernetes
Conclusions

Introduction to BISTel
https://www.bistel.com/

BISTel’s business areas
Providing analytic solutions based on Artificial Intelligence (AI) and Big Data to the customers for
Smart Factory

BISTel’s application areas
Adaptive Intelligence for Smart Manufacturing

GrandView® APM
https://www.grandview-apm.com/

Detailed Insights are One Click Away

Asset Health Scores in Predictive Maintenance

What is a health score in smart manufacturing?
▪ A health score represents a machine’s status by analyzing multiple sensor
data records
▪ It can be used to be a core metric of the prognostics and health
management (PHM) system for predicting machine’s lifespan.
▪ Various machine learning algorithms can be used to compute a health
score in manufacturing industry.

Defect Point based on Asset Health Score
Algorithm
+
deep asset
knowledge
Defect Identified Monotonically
increasing section
Source : XenonStack
https://www.xenonstack.com/blog/log-analytics-deep-machine-learning/

Real-time health score applications

Data flow: Real-time health score application
Unbounded sensor
data from Kafka Main data stream
Interactively
Monitoring status
Event stream
- Train models offline
- Model change ETL into time series storage
- Prevent data loss
- Be able to query for need
- Summarizing statistics
- Anomaly detection
- Aggregating data records on demand

Stateful Operation - UpdateStateByKey
▪ Model context should be cached while
an application is maintained
▪ Know nothing about the previous
records on DStreams of key-value pairs
▪ UpdateStateByKey can maintain state
across mini batches even if there is no
data input afterwards.
modelPairStream
.UpdateStateByKey(modelStateFunc)
.join(tracePairStream)
Function2<List<V>, Optional<S>, Optional<S>>
modelStateFunc = (v, s) -> {
// update or remove logics
// return value
}

Stateful Streaming for Operating Models
State
UpdateStateByKey
Batch 1
RDD @ t
Batch 2
RDD @ t+1
State
State
Batch 3
RDD @ t+2
State
State
Batch 4
RDD @ t+3
State
State
Event DStream
Main DStream

Problem with updateStateByKey
Big data for predictive maintenance
▪ The number of assets are greatly increasing with predictive maintenance
powered by the Internet of Things (IOT).
Performance
▪ The UpdateStateByKey is invoked on every key in Spark Streaming.
▪ This can affect performance degrading when dealing with a large amount
of state.

Almost empty batches in model stream
▪ Contrast to the mainstream, the model stream is always resting unless
model change occurs.
▪ fullOuterJoin + MapWithState
t
t+1
t+2
assetId values assetId models
assetId values
assetId values
assetId values model
assetId values
assetId values
Joined
Stream
absent
absent
assetId models
- State -

Challenges in Spark Standalone

Is standalone mode sufficient?
Case1 (very common case)
# of assets : up to 10
# of parameters : up to 10
Case2 (Big data analytics)
# of assets : 1,000,000
# of parameters : 10,000

Execution Model - Standalone
• 4 * 12 executors in total
• 4 GB memory / executor
• 192 = 12 * 16 cores
• 4 cores /executor
• 1 executors in total
• 16 GB memory / executor
• 16 = 1 * 16 cores
• 16 cores /executor

Consideration Points in multi clusters
Communication between workers
▪ Needs to shuffle data over the networks
▪ No broadcast operation for small data in Dstream.
▪ Join or .groupByKey() – Need to think before use them
Are the sensor data records is easily split across the worker nodes?
▪ Time sequence is important to predict failure of the machines
▪ Watermarks to discard the late sensor data records

It is not easy to manage multiple nodes
xpanes --log=~/log --ssh bistel@host1 bistel@host2 bistel@host3 ……

Container-based application - Docker
- https://www.docker.com/resources/what-container -
▪ Real-time applications requires
many applications working
together
▪ Algorithm modules
▪ Streaming engines (spark,kafka..)
▪ Database
▪ Dockerfile
…..
RUN mkdir -p /app
COPY /target/realtime-app.jar /app/spark-examples.jar
ENV SPARK_MASTER_NAME spark-master
ENV SPARK_MASTER_PORT 7077

Kubernetes
▪ Open source for scaling, management and automating deployment of
container services
▪ https://kubernetes.io/

Spark on Kubernetes Operations

Running Spark Job with Kubernetes

Acknowledgements
▪ This work was supported by the ICT
R&D program oh MSIP/IITP
[2020(2020-0-00358),
Development of Knowledge & AI
based decision support system for
Manufacturing full automation]

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Real-time health score app using Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-time health score app using Spark on Kubernetes

Similar to Real-time health score app using Spark on Kubernetes (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Real-time health score app using Spark on Kubernetes