Machine Learning Based Resource Utilization Prediction in the Computing Continuum

MACHINE LEARNING BASED RESOURCE UTILIZATION
PREDICTION IN THE COMPUTING CONTINUUM
Christian Bauer, Narges Mehran, Dr. Radu Prodan and Dr. Dragi Kimovski
1
[1] - HTTPS://CAMAD2023.IEEE-CAMAD.ORG/

TABLE OF CONTENTS
Introduction
UtilML
Evaluation
References
2

MOTIVATION
 Hardware estimations provided by users usually lead
to over-provisioning
 Schedulers often require proper task specifications
 Increasing computational demand requires improved
resource utilization
4

GOAL
Improve hardware estimations before scheduling/deployment
Better utilize existing hardware
5

CONTRIBUTIONS
Analysis of publicly available monitoring traces
Development of a POC machine learning approach called UtilML
that improves utilization prediction (CPU and memory)
Evaluation of different models based on regression metrics
6

THE SCENARIO
Distributed computing resources are (often)
managed by a resource manager
This resource manager accepts requests
from users and allocates resources
based on the user estimations
7

COMPUTING
CONTINUUM
CONSISTS OF A COMBINATION OF CLOUD,
FOG AND EDGE LAYERS
8
[2] - HOSSEIN ASHTARI. EDGE COMPUTING VS. FOG COMPUTING: 10 KEY COMPARISONS.
HTTPS://WWW.SPICEWORKS.COM/TECH/CLOUD/ARTICLES/EDGE-VS-FOG-COMPUTING, 2022. [ONLINE; ACCESSED 01-NOV.-2023]

INPUT-OUTPUT
User estimations,
Cluster capacity,
Task metadata,
…
The estimated CPU or
memory utilization of a
task
Input Output
9

THE TASK TYPES
Task Number of requests Description
tensorflow 621415 Machine learning
worker 275785 Machine learning
parameter server (PS) 183283 Machine learning
PyTorchWorker 110784 Machine learning
xComputeWorker 27402 Machine learning
TensorboardTask 10681 Machine learning
ReduceTask 4136 Hadoop or Spark
JupyterTask 2066 A task of Jupyter notebooks
TVMTuneMain 1158 Auto-scheduling a Neural Network by Apache TVM
OpenmpiTracker 745 Programming paradigm in high performance computing
OssToVolumeWorker 672 Object Storage Service (OSS) volume to persist data
11

LSTM Layer
fc1 fc2 fc3
LSTM transform layers
LSTM
fully connected +
LeakyReLU
absolute
input
Transpose
expand
abs
Gather Unsqueeze Concat
output
init_layer
input passing through NN
Long-Short Term Memory
UTIL-ML
ARCHITECTURE
13

EMBEDDING UTILML IN DATA PROCESSING PIPELINE
Resource
Resource
Resource
Resource
Resource
Resource
Resource
Resource
App
App
App
App
App
App
Orchestrator
Scheduler
LSTM-Based Utilization Prediction
Model
Historical Resource
Utilization Database
Data
Preprocessing
Prometheus
Monitoring
14
UtilML
Task

EVALUATION
METRICS
Root-Mean-Squared Error
(RMSE)
Symmetric Mean Absolute
Percentage Error (sMAPE)
Over-, Under-estimation
16

EVALUATION RESULTS
 Evaluation results contain prediction performance analysis for CPU and memory utilization of
tasks
 User predictions
 Baseline-LSTM predictions – a simple LSTM variant with
 Capacity of CPU and memory
 User prediction of CPU and memory
 UtilML predictions
 More complex than Baseline-LSTM
 Additionally, it uses task knowledge
17

EVALUATION RESULTS – CPU [%]
Actual CPU UtilML-LSTM Baseline-LSTM User
mean 516.073 454.205 392.630 632.809
std 881.832 579.213 705.771 496.245
min 1.023 2.395 3.030 5
25% 103.632 129.884 97.586 400
50% 208.749 249.392 118.014 600
75% 528.076 662.490 281.472 600
max 7790.371 5634.635 5793.996 6400
18

EVALUATION RESULTS – CPU METRICS
RMSE sMAPE OA/UA
UtilML-LSTM 688.089 83.017 48.14/51.86
Baseline-LSTM 797.289 85.626 41.94/58.06
User-Predicted 812.497 89.466 72.80/27.20
19

EVALUATION RESULTS – MEMORY [GB]
Actual Memory UtilML-LSTM Baseline-LSTM User
mean 17.203 29.134 29.904 26.895
std 74.761 63.342 39.634 15.259
min 0.003 0.156 1.951 2
25% 2.160 4.537 22.178 14.648
50% 7.699 14.679 24.620 29.297
75% 15.976 27.924 24.620 29.297
max 1992.484 698.983 550.056 146.484
20

EVALUATION RESULTS – MEMORY METRICS
RMSE sMAPE OE/UE
UtilML-LSTM 61.897 119.715 56.66/43.43
Baseline-LSTM 77.902 109.870 77.8/22.2
User-Predicted 73.853 97.613 80.80/19.20
21

FUTURE WORK
Make model more
suitable to predict
resource utilization
spikes
Incorporate other ML-
techniques or models to
improve predictions
23

ANY QUESTIONS?
THANK YOU FOR YOUR
ATTENTION AND LET US
CONNECT!
25

Machine Learning Based Resource Utilization Prediction in the Computing Continuum

Recommended

Recommended

More Related Content

Similar to Machine Learning Based Resource Utilization Prediction in the Computing Continuum

Similar to Machine Learning Based Resource Utilization Prediction in the Computing Continuum (20)

More from Alpen-Adria-Universität

More from Alpen-Adria-Universität (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Based Resource Utilization Prediction in the Computing Continuum

Editor's Notes