1. GROUP – B3
CLOUD WORKLOAD ANALYSIS AND SIMULATION
Investigating the behaviors of a cloud, focusing on their workload
patterns
Jan – May 2014
2. TABLE OF CONTENTS
Contents
Highlights:_________________________________________________________________________________________________ 1
The approach ............................................................................................................................................................................................1
Dataset preprocessing and analysis................................................................................................................................................1
Clustering analysis..................................................................................................................................................................................1
Time series analysis...............................................................................................................................................................................1
Workload prediction..............................................................................................................................................................................1
Looking ahead ..........................................................................................................................................................................................1
1.Objective:________________________________________________________________________________________________ 2
2.The Approach: __________________________________________________________________________________________ 3
3.Dataset preprocessing and analysis___________________________________________________________________ 4
3.1 Preprocessing _________________________________________________________________________________________4
3.2 Analysis: _______________________________________________________________________________________________4
4.Calculation resource usage statistics:_________________________________________________________________ 6
5.Classification of users and identifying target users: _________________________________________________ 8
6.Time series analysis__________________________________________________________________________________ 10
7.Workload Prediction _________________________________________________________________________________ 11
9.Issues faced and possible solutions: ________________________________________________________________ 14
10.Looking ahead_______________________________________________________________________________________ 15
GROUP MEMBERS ______________________________________________________________________________________ 16
References:______________________________________________________________________________________________ 16
3. Page 1
Highlights:
The approach
Studied google trace data schema
Studied related technical papers and summarized useful observations
Devised an approach to analyze cloud workload using observations from technical papers and considering
google trace data’s schema
Dataset preprocessing and analysis
Preprocessed the data to prepare it for analysis
Visualized important statistics for feasibility decision and computed relevant attributes
The main attributes were analyzed and visualized and observations were made.
Clustering analysis
Applied various clustering algorithm, compared the results and chose the best clustering for user and task
analysis
Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage
Time series analysis
Target users and their task were identified from the clustering results
Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage
Workload prediction
Users with specific resource usage patterns are identified
Resource for users is allocated based on the identified usage pattern with a threshold value
Looking ahead
Improvisations to our approach
4. Page 2
1. Objective:
Analyze and report the cloud workload data based on Google cloud trace
Use graphical tools to visualize the data (you may need to write programs to process the data
in order to feed them into visual tools)
Study and summarize the papers regarding other people’s experience in Google cloud trace
analysis
Determine the workload characteristics from your analysis of the Google cloud trace
Try to reallocate unused resources of a user to other users who require them
5. Page 3
2. The Approach:
Based on our study on the google cloud trace data and the gathered observations from the technical
papers we devised the following approach for the problem:
Analyze and visualize the data to identify important attributes that determine user workload
pattern and ignore rest of the attributes
Calculate resource usage statistics of users to identify the feasibility of resource re-allocation
Classify users based on their resource usage quality[1] (amount of unused resource/resource
requested) using clustering analysis
Identify target users based on the clustering analysis for resource re-allocation
Study the workload pattern of tasks of the target users and classify tasks based on their lengths
Perform time series analysis on long tasks
Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters
of tasks of all users that have similar workload based on time series analysis
Predict the usage pattern of a user if the current task’s pattern matches the pattern associated
with that user (or) matches the one of the cluster formed in the previous step.
6. Page 4
3. Dataset preprocessing and analysis
3.1 Preprocessing
Inconsistent and vague data was processed to perform analysis. The task-usage table has many
records for a same Job ID-task index pair because the same task might be re-submitted or re-
scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair
pre-processing was done.
Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of
repeating task records was considered and stored as a single record.
Time is converted into days and hours for per day analysis
3.2 Analysis:
The data in the cloud trace tables were visualized. The data which were found to be constant/within
a small range of values for most of the records were not considered for analysis. The attributes that
play a major part in shaping the user profile and task profile are considered important attributes. The
main attributes from a table were analyzed and visualized and certain observations were made.
Figure 1CPU requested per user (blue) Vs CPU used per user (red)
Observation: Most users overestimate the resources they need and use less than 5% of the requested resources
A few users under estimate the resources and use more than thrice the amount of requested resources.
7. Page 5
Figure 2Memory requested per user (blue) Vs Memory used per user (red)
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
8. Page 6
4. Calculation resource usage statistics:
As we are concerned with re-allocation of unused resources, we should look at those users who over-
estimate the resources as observed in the previous section.
To identify those users who over-estimate the resources a new attribute is calculated.
Estimation ratio [1] = (requested resource – used resource)/requested resource.
Estimation ratio varies from 0 to 1.
0 – User has used up/more than the requested resource
1 – User has not used any of the requested resource
Also from the visualizations and observations made, the following are identified as important
attributes:
User: Submission rate, CPU estimation ratio, Memory estimation ratio
Task: Task length, CPU usage, Memory usage
Figure 3CPU Estimation ratio per User
Users with negative (red) CPU estimation ratio have used resources more than requested.
Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
9. Page 7
Figure 4Memory Estimation ratio per User
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
10. Page 8
5. Classification of users and identifying target users:
The dimensions for classification are
User: Submission rate, CPU estimation ratio, Memory estimation ratio
We use the following clustering algorithms to identify optimal number of clusters for users and tasks
K- means
Expectation – Maximization (EM)
Cascade Simple K-means
X-means[2]
We categorize the users and tasks using these clustering algorithms with the above dimensions for
users.
We compare and choose the best clustering for users and tasks.
K-means (4 clusters) EM Clustering
11. Page 9
X- means Cascade Simple K-means
K means clustering with 4 clusters was selected as it offers good clustering of users based on the
CPU and memory estimation ratios.
From the clustering results we observed:
97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more
than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 %
unused)
We targeted tasks that were long enough to perform efficient resource allocation. We performed
clustering on task lengths of these users to filter out short tasks
12. Page 10
6. Time series analysis
To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and
Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer
Issues faced section)
Clustered tasks that have same DTW value
These tasks were identified to have similar workload curve.
Two tasks with same DTW distances having similar workloads
The clusters hence formed a reference workload curve was randomly selected from one of task’s
workload in the group of tasks in that cluster. (due to time constraint)
13. Page 11
7. Workload Prediction
When a user from the targeted list of users issues a task, the task’s workload is studied for pre-
determined amount of time. This time period was determined by trial and error basis, as the
minimum time at which all reference curves are different.
During this time period, the task’s workload is compared with the reference curve of all task
clusters formed in the previous step.
If the current task’s workload curve has zero distance with one of the reference curves i.e., similar
to a reference curve, the current curve is expected to behave similar to the reference curve and its
workload is predicted.
Since resource allocation and de-allocation cannot be done dynamically because of:
Huge overhead
Delay in allocating resources
Resource allocation must happen once in every pre-determined interval of time and cannot happen
continuously.
Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the
predicted curves slope a step-up or step-down is performed. Also a threshold value is set to
accommodate unexpected spikes in the workload.
Successful prediction:
Average unused resource: 94% Average resource stolen:65% (Req – Allocated)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
0.105
0.11
0.115
0.12
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
CPUUsage
Time (s)
Efficient resource allocation curve
Used
Allocated
Req
14. Page 12
Failed prediction:
Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is
because of the random selection of the reference curve for task clusters. Though the randomly selected
reference curve has generated a descent resource allocation curve, there are points at which the current
task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced
section.(Solution discussed later)
0
0.005
0.01
0.015
0.02
0.025 1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
CPUuse
Time (s)
Allocated
Usage
15. Page 13
8. Tools and Algorithms Used:
JAVA: For extracting required data out of the datasets, we used Java programming (csv
reader/writer, hashmaps).
DTW on MATLAB: Implemented DTW using Matlab’s in-built function.
WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans
TABLEAU 8.1: To visualize the datasets and results.
Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was
continuous and the algorithm needed discrete data.
Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this
algorithm.
16. Page 14
9. Issues faced and possible solutions:
MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly
9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in
MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time
complexity to great extent.
MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along
with other parameters. We had to run Java programs to map users of tasks to the corresponding
DTW values.
Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using
Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t
implement this algorithm.
We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a
recurring workload pattern. As very few users had such workload pattern, we ended up ignoring
lot of data. So we considered tasks of all users for DTW instead per user’s task.
17. Page 15
10. Looking ahead
10.1Improvements and optimizations
Choosing a good reference curve for running DTW was difficult. Having a straight line as a
reference curve gave us mediocre results as curves with peaks at different instances of time were
grouped as similar. So we compared results with the line x=y and sine curve as reference curve
and we got good results for sine curve.
Choosing a representative curve for a task cluster was performed on a random basis due to time
constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for
a cluster.
During prediction incidents such as the workload of a task changing to look like some other
cluster is not handled now. This can be handled by comparing the current task’s workload with all
cluster’s reference curve continuously and when current task looks like shifting to some other
cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This
constant monitoring and dynamic mapping improves the prediction accuracy.
18. Page 16
GROUP MEMBERS
PRABHAKAR
GANESAMURHTY
PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA
SHANMUGARAJ
prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo
.com
abias1702@gmail.com
References:
1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google
Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th
International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6
2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore
3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time
series data." Data Mining and Knowledge Discovery (2010): 1-35.