MACHINE LEARNING BASED RESOURCE UTILIZATION
PREDICTION IN THE COMPUTING CONTINUUM
Christian Bauer, Narges Mehran, Dr. Radu Prodan and Dr. Dragi Kimovski
1
[1] - HTTPS://CAMAD2023.IEEE-CAMAD.ORG/
TABLE OF CONTENTS
Introduction
UtilML
Evaluation
References
2
INTRODUCTION
3
MOTIVATION
 Hardware estimations provided by users usually lead
to over-provisioning
 Schedulers often require proper task specifications
 Increasing computational demand requires improved
resource utilization
4
GOAL
Improve hardware estimations before scheduling/deployment
Better utilize existing hardware
5
CONTRIBUTIONS
Analysis of publicly available monitoring traces
Development of a POC machine learning approach called UtilML
that improves utilization prediction (CPU and memory)
Evaluation of different models based on regression metrics
6
THE SCENARIO
Distributed computing resources are (often)
managed by a resource manager
This resource manager accepts requests
from users and allocates resources
based on the user estimations
7
COMPUTING
CONTINUUM
CONSISTS OF A COMBINATION OF CLOUD,
FOG AND EDGE LAYERS
8
[2] - HOSSEIN ASHTARI. EDGE COMPUTING VS. FOG COMPUTING: 10 KEY COMPARISONS.
HTTPS://WWW.SPICEWORKS.COM/TECH/CLOUD/ARTICLES/EDGE-VS-FOG-COMPUTING, 2022. [ONLINE; ACCESSED 01-NOV.-2023]
INPUT-OUTPUT
User estimations,
Cluster capacity,
Task metadata,
…
The estimated CPU or
memory utilization of a
task
Input Output
9
TASK METADATA
10
THE TASK TYPES
Task Number of requests Description
tensorflow 621415 Machine learning
worker 275785 Machine learning
parameter server (PS) 183283 Machine learning
PyTorchWorker 110784 Machine learning
xComputeWorker 27402 Machine learning
TensorboardTask 10681 Machine learning
ReduceTask 4136 Hadoop or Spark
JupyterTask 2066 A task of Jupyter notebooks
TVMTuneMain 1158 Auto-scheduling a Neural Network by Apache TVM
OpenmpiTracker 745 Programming paradigm in high performance computing
OssToVolumeWorker 672 Object Storage Service (OSS) volume to persist data
11
UTILML
12
LSTM Layer
fc1 fc2 fc3
LSTM transform layers
LSTM
fully connected +
LeakyReLU
absolute
input
Transpose
expand
abs
Gather Unsqueeze Concat
output
init_layer
input passing through NN
Long-Short Term Memory
UTIL-ML
ARCHITECTURE
13
EMBEDDING UTILML IN DATA PROCESSING PIPELINE
Resource
Resource
Resource
Resource
Resource
Resource
Resource
Resource
App
App
App
App
App
App
Orchestrator
Scheduler
LSTM-Based Utilization Prediction
Model
Historical Resource
Utilization Database
Data
Preprocessing
Prometheus
Monitoring
14
UtilML
Task
EVALUATION
15
EVALUATION
METRICS
Root-Mean-Squared Error
(RMSE)
Symmetric Mean Absolute
Percentage Error (sMAPE)
Over-, Under-estimation
16
EVALUATION RESULTS
 Evaluation results contain prediction performance analysis for CPU and memory utilization of
tasks
 User predictions
 Baseline-LSTM predictions – a simple LSTM variant with
 Capacity of CPU and memory
 User prediction of CPU and memory
 UtilML predictions
 More complex than Baseline-LSTM
 Additionally, it uses task knowledge
17
EVALUATION RESULTS – CPU [%]
Actual CPU UtilML-LSTM Baseline-LSTM User
mean 516.073 454.205 392.630 632.809
std 881.832 579.213 705.771 496.245
min 1.023 2.395 3.030 5
25% 103.632 129.884 97.586 400
50% 208.749 249.392 118.014 600
75% 528.076 662.490 281.472 600
max 7790.371 5634.635 5793.996 6400
18
EVALUATION RESULTS – CPU METRICS
RMSE sMAPE OA/UA
UtilML-LSTM 688.089 83.017 48.14/51.86
Baseline-LSTM 797.289 85.626 41.94/58.06
User-Predicted 812.497 89.466 72.80/27.20
19
EVALUATION RESULTS – MEMORY [GB]
Actual Memory UtilML-LSTM Baseline-LSTM User
mean 17.203 29.134 29.904 26.895
std 74.761 63.342 39.634 15.259
min 0.003 0.156 1.951 2
25% 2.160 4.537 22.178 14.648
50% 7.699 14.679 24.620 29.297
75% 15.976 27.924 24.620 29.297
max 1992.484 698.983 550.056 146.484
20
EVALUATION RESULTS – MEMORY METRICS
RMSE sMAPE OE/UE
UtilML-LSTM 61.897 119.715 56.66/43.43
Baseline-LSTM 77.902 109.870 77.8/22.2
User-Predicted 73.853 97.613 80.80/19.20
21
FUTURE WORK
Make model more
suitable to predict
resource utilization
spikes
Incorporate other ML-
techniques or models to
improve predictions
23
ACKNOWLEDGEMENTS
24
ANY QUESTIONS?
THANK YOU FOR YOUR
ATTENTION AND LET US
CONNECT!
25

Machine Learning Based Resource Utilization Prediction in the Computing Continuum

  • 1.
    MACHINE LEARNING BASEDRESOURCE UTILIZATION PREDICTION IN THE COMPUTING CONTINUUM Christian Bauer, Narges Mehran, Dr. Radu Prodan and Dr. Dragi Kimovski 1 [1] - HTTPS://CAMAD2023.IEEE-CAMAD.ORG/
  • 2.
  • 3.
  • 4.
    MOTIVATION  Hardware estimationsprovided by users usually lead to over-provisioning  Schedulers often require proper task specifications  Increasing computational demand requires improved resource utilization 4
  • 5.
    GOAL Improve hardware estimationsbefore scheduling/deployment Better utilize existing hardware 5
  • 6.
    CONTRIBUTIONS Analysis of publiclyavailable monitoring traces Development of a POC machine learning approach called UtilML that improves utilization prediction (CPU and memory) Evaluation of different models based on regression metrics 6
  • 7.
    THE SCENARIO Distributed computingresources are (often) managed by a resource manager This resource manager accepts requests from users and allocates resources based on the user estimations 7
  • 8.
    COMPUTING CONTINUUM CONSISTS OF ACOMBINATION OF CLOUD, FOG AND EDGE LAYERS 8 [2] - HOSSEIN ASHTARI. EDGE COMPUTING VS. FOG COMPUTING: 10 KEY COMPARISONS. HTTPS://WWW.SPICEWORKS.COM/TECH/CLOUD/ARTICLES/EDGE-VS-FOG-COMPUTING, 2022. [ONLINE; ACCESSED 01-NOV.-2023]
  • 9.
    INPUT-OUTPUT User estimations, Cluster capacity, Taskmetadata, … The estimated CPU or memory utilization of a task Input Output 9
  • 10.
  • 11.
    THE TASK TYPES TaskNumber of requests Description tensorflow 621415 Machine learning worker 275785 Machine learning parameter server (PS) 183283 Machine learning PyTorchWorker 110784 Machine learning xComputeWorker 27402 Machine learning TensorboardTask 10681 Machine learning ReduceTask 4136 Hadoop or Spark JupyterTask 2066 A task of Jupyter notebooks TVMTuneMain 1158 Auto-scheduling a Neural Network by Apache TVM OpenmpiTracker 745 Programming paradigm in high performance computing OssToVolumeWorker 672 Object Storage Service (OSS) volume to persist data 11
  • 12.
  • 13.
    LSTM Layer fc1 fc2fc3 LSTM transform layers LSTM fully connected + LeakyReLU absolute input Transpose expand abs Gather Unsqueeze Concat output init_layer input passing through NN Long-Short Term Memory UTIL-ML ARCHITECTURE 13
  • 14.
    EMBEDDING UTILML INDATA PROCESSING PIPELINE Resource Resource Resource Resource Resource Resource Resource Resource App App App App App App Orchestrator Scheduler LSTM-Based Utilization Prediction Model Historical Resource Utilization Database Data Preprocessing Prometheus Monitoring 14 UtilML Task
  • 15.
  • 16.
    EVALUATION METRICS Root-Mean-Squared Error (RMSE) Symmetric MeanAbsolute Percentage Error (sMAPE) Over-, Under-estimation 16
  • 17.
    EVALUATION RESULTS  Evaluationresults contain prediction performance analysis for CPU and memory utilization of tasks  User predictions  Baseline-LSTM predictions – a simple LSTM variant with  Capacity of CPU and memory  User prediction of CPU and memory  UtilML predictions  More complex than Baseline-LSTM  Additionally, it uses task knowledge 17
  • 18.
    EVALUATION RESULTS –CPU [%] Actual CPU UtilML-LSTM Baseline-LSTM User mean 516.073 454.205 392.630 632.809 std 881.832 579.213 705.771 496.245 min 1.023 2.395 3.030 5 25% 103.632 129.884 97.586 400 50% 208.749 249.392 118.014 600 75% 528.076 662.490 281.472 600 max 7790.371 5634.635 5793.996 6400 18
  • 19.
    EVALUATION RESULTS –CPU METRICS RMSE sMAPE OA/UA UtilML-LSTM 688.089 83.017 48.14/51.86 Baseline-LSTM 797.289 85.626 41.94/58.06 User-Predicted 812.497 89.466 72.80/27.20 19
  • 20.
    EVALUATION RESULTS –MEMORY [GB] Actual Memory UtilML-LSTM Baseline-LSTM User mean 17.203 29.134 29.904 26.895 std 74.761 63.342 39.634 15.259 min 0.003 0.156 1.951 2 25% 2.160 4.537 22.178 14.648 50% 7.699 14.679 24.620 29.297 75% 15.976 27.924 24.620 29.297 max 1992.484 698.983 550.056 146.484 20
  • 21.
    EVALUATION RESULTS –MEMORY METRICS RMSE sMAPE OE/UE UtilML-LSTM 61.897 119.715 56.66/43.43 Baseline-LSTM 77.902 109.870 77.8/22.2 User-Predicted 73.853 97.613 80.80/19.20 21
  • 22.
    FUTURE WORK Make modelmore suitable to predict resource utilization spikes Incorporate other ML- techniques or models to improve predictions 23
  • 23.
  • 24.
    ANY QUESTIONS? THANK YOUFOR YOUR ATTENTION AND LET US CONNECT! 25

Editor's Notes

  • #14 LSTM are good for long-term dependencies between data points in a time series dataset
  • #17 Small notes for symmetric MAPE vs MAPE
  • #19 Change color to show where we perform better
  • #25 Include FFG under GAIA Horizon under DC
  • #26 Page number;
  • #29 Add feature sets explanation Input output of LSTM!!! Much less talk about this slide,