Data centers are centralized facilities where computing and storage hardware are aggregated to handle large amounts of data and computation.
In a data center, computing and storage resources are usually managed by a resource manager.
A commonly known problem in resource management is that users often request more resources than their applications actually use.
This leads to the degradation of overall resource availability in a data center.
In this thesis, I propose a method to improve the resource availability in data centers using deep learning.
The proposed method is divided into two methods depending on the type of resources.
For improving the availability of computing resources, I propose a deep neural network model based on LSTM to predict
the suitable resource allocation for job. Google’s cluster scheduler simulator was used to evaluate the proposed method.
The results indicated that the proposed method improved the CPU and memory utilization by 10.71% and 47.36%, respectively.
For improving the availability of storage resources, I propose a method to retrain compressed neural network models while keeping the model accuracy.
I compared the proposed retraining method with conventional retraining. The proposed method reduced the size of VGG-16 and ResNet-50 by 81.10% and 52.45%, respectively, without significant loss of accuracy.
A Critique of the Proposed National Education Policy Reform
Improving Resource Availability in Data Center using Deep Learning.pdf
1. Improving Resource Availability in
Data Centers using Deep Learning
(深層学習を使用したデータセンタにおける資源利用効率の向上)
Kundjanasith Thonglek
Software Design & Analysis Laboratory
4. Data Centers
Data centers are centralized facilities where computing and storage
hardware are aggregated to handle large amounts of data and computation.
4
Software Design & Analysis Laboratory
Technical challenges
➢ System monitoring
➢ Energy management
➢ Continuous migration
➢ Availability improvement
5. Objective
I aim to improve the availability of computing and storage resources in
data centers by applying deep learning.
5
Software Design & Analysis Laboratory
Resource utilization is paramount to many cloud providers as they need
to utilize their hardware resources efficiently to maximize profit.
Storage Resources
❖ Hard Disk
Computing Resources
❖ CPU, Memory
7. Users excessively request computing resources
➢ Users tend to request more computing resources than their applications actually need
○ Unused computing resources by application are wasted
○ Overall computing resource utilization in the data centers degrades
7
Software Design & Analysis Laboratory
wasted resource
8. Overview of Proposed Method
8
Software Design & Analysis Laboratory
Analyzing
Cluster Usage
Designing
Neural Network
Evaluation
Training
LSTM Model
Analyzing Cluster Usage
Designing Neural Network
Training LSTM Model
Evaluation
Analyze Google’s cluster
usage trace obtained from
a production data center
Design an LSTM-based model to
predict better resource
allocation from historical data of
resource usage and allocation.
Train our model using
Google’s cluster usage trace
Evaluate improvement of
resource utilization using
Google’s cluster scheduler
simulator
9. Analyzing Cluster Usage
9
Software Design & Analysis Laboratory
Google’s cluster usage trace is real workload data in Google’s data center
Computing Resource Requested Resource Used Resource
CPU Requested CPU Used CPU
Memory Requested memory Used memory
10. Long Short-Term Memory
Recurrent Neural Network (RNN)
➢ Deep learning model for time-series forecasting
➢ Model size not increasing with size of input
➢ Weights are shared across time
10
Software Design & Analysis Laboratory
Long Short-Term Memory (LSTM) introduces long-term memory into RNN
➢ LSTM migrates the vanishing gradient problem, where the neural
network stops learning because the updates to the various weights
within a given neural network become smaller and smaller
➢ The memory cell replaces hidden neurons used in traditional RNNs to
build a hidden layer
11. Proposed neural network
Input: The requested and used of CPU and memory resources
1st
LSTM: Finding the correlation between CPU and memory
2nd
LSTM: Finding the correlation between allocated and used
Fully Connected: Connected each neuron to one layers
Output: The efficient CPU and memory allocation
11
Software Design & Analysis Laboratory
12. Training LSTM Model
Improving resource utilization by
implement Long Short-Term Memory
model using requested CPU, requested
memory, used CPU and used memory.
12
Software Design & Analysis Laboratory
Allocated Resource
Used Resource
Memory (%)
CPU (%)
Memory (%)
CPU (%)
M
O
D
E
L
Memory cell size
➔ 20 minutes
➔ 40 minutes
➔ 60 minutes
The memory cell size in
Long Short-Term Memory
model is memorizing each
step input-output pair of
values in each sequence.
13. Usage Simulation
Simulate resource utilization in
data center from allocated resource
which is predicted using our time-
series predictive model to apply with
the actual computing resources.
13
Software Design & Analysis Laboratory
Google’s cluster usage data
(513,000 jobs)
Training dataset
(80%)
Testing dataset
(20%)
[LSTM/RNN] MODEL
Allocated Resource [Predicted]
CPU (%)
Memory (%)
Resource Allocation
CPU (%)
Memory (%)
Google’s simulation
16. Training time & Inference time
16
Software Design & Analysis Laboratory
408.93
35.67
130.82
49.77
35.13
28.78
Training time Inference time
*For 100 epochs *For 102,600 jobs
18. ML models are becoming larger
ML model compression improves the storage usage efficiency by reducing
the size of ML models, and increases the availability of storage resources.
18
Software Design & Analysis Laboratory
Model Name Model Size Application
GPT-3 700 GB Language Processing
VGG-16 528 MB Image Classification
Mask RCNN 256 MB Object Detection
Normally, ML model compression reduces the model size, but it also
decreases the accuracy.
19. Compressing models while maintaining accuracy
19
Software Design & Analysis Laboratory
Quantization Retraining
Original Model Compressed Model
Quantized Model
Decrease model size
with loss of accuracy
Increase model accuracy
while keeping the model size
20. 20
Software Design & Analysis Laboratory
Quantization Retraining
Original Model Compressed Model
Quantized Model
Decrease model size
with loss of accuracy
Increase model accuracy
while keeping the model size
Compressing models while maintaining accuracy
22. 22
Software Design & Analysis Laboratory
Quantization Retraining
Original Model Compressed Model
Quantized Model
Decrease model size
with loss of accuracy
Increase model accuracy
while keeping the model size
Compressing models while maintaining accuracy
23. Retraining using unlabeled data
23
Software Design & Analysis Laboratory
Most existing retraining methods require the labeled datasets to retrain.
Using unlabeled dataset for retraining is highly useful when the labeled
dataset is unavailable.
DATA
LABEL
Researcher/Developer Labeled Dataset
Privacy policy, License limitation
DATA
Researcher/Developer Unlabeled Dataset
24. Proposed Retraining Method
24
Software Design & Analysis Laboratory
Unlabeled
Data set
Quantized model
Non-trainable layer
Trainable layer
Original model
Trainable layer
Output vector
Output vector
Loss
27. Case Study of VGG-16
27
Software Design & Analysis Laboratory
Model Architecture Bias Value Weight Value
10
8
10
3
28. Model Quantization
28
Software Design & Analysis Laboratory
Size of Quantized
VGG-16 models
Accuracy of Quantized
VGG-16 models
# of
quanized
layers
29. Model Retraining
29
Software Design & Analysis Laboratory
Retraining Quantized VGG-16 models
Quantizing the 14th
and 15th
layers using 32-256 centroids
achieved nearly the accuracy of the original model.
The best configuration for quantizing
VGG-16 model
- Quantize the biases in all layer using
1 centroid
- Quantize the weights in 14th
and 15th
layers using 32 centroids
It compressed to possible smallest model size without
significant accuracy loss.
# of centroids
30. Case Study of ResNet-50
30
Software Design & Analysis Laboratory
Model Architecture Bias Value Weight Value
31. Model Quantization
31
Software Design & Analysis Laboratory
Size of Quantized
ResNet-50 models
Accuracy of Quantized
ResNet-50 models
# of
quanized
layers
32. Model Retraining
32
Software Design & Analysis Laboratory
Retraining Quantized ResNet-50 models
Quantizing the 13th
- 49th
layers using 128 or less centroids
clearly degrades the accuracy of the model.
The best configuration for quantizing
ResNet-50 model
- Quantize the biases in all layer using
1 centroid
- Quantize the weights in 13th
- 49th
layers using 256 centroids
It compressed to possible smallest model size without
significant accuracy loss.
# of centroids
33. Conventional & Proposed Retraining
33
Software Design & Analysis Laboratory
Accuracy of quantized model through retraining Retraining time of quantized model
85%
82%
*Conventional retraining method is retraining all layers in the model
35. Conclusion
➢ Improving availability of computing resources
○ We proposed the method for predicting the efficient allocated computing resources from the
proposed LSTM-based prediction model to improve computing resource availability
○ The proposed method is able to improve computing resource availability of the CPU and
memory by 11% and 48%, respectively
➢ Improving availability of storage resources
○ We proposed the method for reducing the size of the neural network models without the
significant accuracy loss to improve storage resource availability
○ The proposed method is able to improve storage resource availability of VGG16 and
ResNet50 by 81% and 52%, respectively
35
Software Design & Analysis Laboratory
36. Future Work
➢ Improving availability of computing resources
○ The significant features that impact to computing resource availability should be
investigated for conducting the efficient method
○ We would like to apply the other time-series forecasting techniques to improve the
availability of computing resources
➢ Improving availability of storage resources
○ The structure of other neural network models should be investigated to conduct the efficient
retraining method
○ We would like to apply the compression techniques other than quantization technique for
reducing the size of neural network models
36
Software Design & Analysis Laboratory
37. Publications
➢ Improving availability of computing resources
○ Kundjanasith Thonglek, Kohei Ichikawa, Keichi Takahashi, Chawanat Nakasan, and Hajimu
Iida, “Improving Resource Utilization in Data Centers using an LSTM-based Prediction
Model”, Proceedings of Workshop on Monitoring and Analysis for High Performance
Computing System Plus Applications (HCPMASPA 2019), September, 2019.
➢ Improving availability of storage resources
○ Kundjanasith Thonglek, Keichi Takahashi, Kohei Ichikawa, Chawanat Nakasan, Nakada
Hidemoto, Ryousei Takano, and Hajimu Iida, “Retraining Quantized Neural Network Model
without Unlabeled Data”, Proceedings of International Joint Conference on Neural Networks
(IJCNN 2020), July, 2020.
37
Software Design & Analysis Laboratory