GROUP – B3
CLOUD WORKLOAD ANALYSIS AND SIMULATION
Investigating the behaviors of a cloud, focusing on their workload
patterns
Jan – May 2014
TABLE OF CONTENTS
Contents
Highlights:_________________________________________________________________________________________________ 1
The approach ............................................................................................................................................................................................1
Dataset preprocessing and analysis................................................................................................................................................1
Clustering analysis..................................................................................................................................................................................1
Time series analysis...............................................................................................................................................................................1
Workload prediction..............................................................................................................................................................................1
Looking ahead ..........................................................................................................................................................................................1
1.Objective:________________________________________________________________________________________________ 2
2.The Approach: __________________________________________________________________________________________ 3
3.Dataset preprocessing and analysis___________________________________________________________________ 4
3.1 Preprocessing _________________________________________________________________________________________4
3.2 Analysis: _______________________________________________________________________________________________4
4.Calculation resource usage statistics:_________________________________________________________________ 6
5.Classification of users and identifying target users: _________________________________________________ 8
6.Time series analysis__________________________________________________________________________________ 10
7.Workload Prediction _________________________________________________________________________________ 11
9.Issues faced and possible solutions: ________________________________________________________________ 14
10.Looking ahead_______________________________________________________________________________________ 15
GROUP MEMBERS ______________________________________________________________________________________ 16
References:______________________________________________________________________________________________ 16
Page 1
Highlights:
The approach
 Studied google trace data schema
 Studied related technical papers and summarized useful observations
 Devised an approach to analyze cloud workload using observations from technical papers and considering
google trace data’s schema
Dataset preprocessing and analysis
 Preprocessed the data to prepare it for analysis
 Visualized important statistics for feasibility decision and computed relevant attributes
 The main attributes were analyzed and visualized and observations were made.
Clustering analysis
 Applied various clustering algorithm, compared the results and chose the best clustering for user and task
analysis
 Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage
Time series analysis
 Target users and their task were identified from the clustering results
 Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage
Workload prediction
 Users with specific resource usage patterns are identified
 Resource for users is allocated based on the identified usage pattern with a threshold value
Looking ahead
 Improvisations to our approach
Page 2
1. Objective:
 Analyze and report the cloud workload data based on Google cloud trace
 Use graphical tools to visualize the data (you may need to write programs to process the data
 in order to feed them into visual tools)
 Study and summarize the papers regarding other people’s experience in Google cloud trace
analysis
 Determine the workload characteristics from your analysis of the Google cloud trace
 Try to reallocate unused resources of a user to other users who require them
Page 3
2. The Approach:
Based on our study on the google cloud trace data and the gathered observations from the technical
papers we devised the following approach for the problem:
 Analyze and visualize the data to identify important attributes that determine user workload
pattern and ignore rest of the attributes
 Calculate resource usage statistics of users to identify the feasibility of resource re-allocation
 Classify users based on their resource usage quality[1] (amount of unused resource/resource
requested) using clustering analysis
 Identify target users based on the clustering analysis for resource re-allocation
 Study the workload pattern of tasks of the target users and classify tasks based on their lengths
 Perform time series analysis on long tasks
 Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters
of tasks of all users that have similar workload based on time series analysis
 Predict the usage pattern of a user if the current task’s pattern matches the pattern associated
with that user (or) matches the one of the cluster formed in the previous step.
Page 4
3. Dataset preprocessing and analysis
3.1 Preprocessing
Inconsistent and vague data was processed to perform analysis. The task-usage table has many
records for a same Job ID-task index pair because the same task might be re-submitted or re-
scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair
pre-processing was done.
Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of
repeating task records was considered and stored as a single record.
Time is converted into days and hours for per day analysis
3.2 Analysis:
The data in the cloud trace tables were visualized. The data which were found to be constant/within
a small range of values for most of the records were not considered for analysis. The attributes that
play a major part in shaping the user profile and task profile are considered important attributes. The
main attributes from a table were analyzed and visualized and certain observations were made.
Figure 1CPU requested per user (blue) Vs CPU used per user (red)
Observation: Most users overestimate the resources they need and use less than 5% of the requested resources
A few users under estimate the resources and use more than thrice the amount of requested resources.
Page 5
Figure 2Memory requested per user (blue) Vs Memory used per user (red)
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 6
4. Calculation resource usage statistics:
As we are concerned with re-allocation of unused resources, we should look at those users who over-
estimate the resources as observed in the previous section.
To identify those users who over-estimate the resources a new attribute is calculated.
Estimation ratio [1] = (requested resource – used resource)/requested resource.
Estimation ratio varies from 0 to 1.
0 – User has used up/more than the requested resource
1 – User has not used any of the requested resource
Also from the visualizations and observations made, the following are identified as important
attributes:
User: Submission rate, CPU estimation ratio, Memory estimation ratio
Task: Task length, CPU usage, Memory usage
Figure 3CPU Estimation ratio per User
Users with negative (red) CPU estimation ratio have used resources more than requested.
Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 7
Figure 4Memory Estimation ratio per User
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 8
5. Classification of users and identifying target users:
The dimensions for classification are
User: Submission rate, CPU estimation ratio, Memory estimation ratio
We use the following clustering algorithms to identify optimal number of clusters for users and tasks
 K- means
 Expectation – Maximization (EM)
 Cascade Simple K-means
 X-means[2]
We categorize the users and tasks using these clustering algorithms with the above dimensions for
users.
We compare and choose the best clustering for users and tasks.
K-means (4 clusters) EM Clustering
Page 9
X- means Cascade Simple K-means
K means clustering with 4 clusters was selected as it offers good clustering of users based on the
CPU and memory estimation ratios.
From the clustering results we observed:
97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more
than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 %
unused)
We targeted tasks that were long enough to perform efficient resource allocation. We performed
clustering on task lengths of these users to filter out short tasks
Page 10
6. Time series analysis
To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and
Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer
Issues faced section)
Clustered tasks that have same DTW value
These tasks were identified to have similar workload curve.
Two tasks with same DTW distances having similar workloads
The clusters hence formed a reference workload curve was randomly selected from one of task’s
workload in the group of tasks in that cluster. (due to time constraint)
Page 11
7. Workload Prediction
 When a user from the targeted list of users issues a task, the task’s workload is studied for pre-
determined amount of time. This time period was determined by trial and error basis, as the
minimum time at which all reference curves are different.
 During this time period, the task’s workload is compared with the reference curve of all task
clusters formed in the previous step.
 If the current task’s workload curve has zero distance with one of the reference curves i.e., similar
to a reference curve, the current curve is expected to behave similar to the reference curve and its
workload is predicted.
Since resource allocation and de-allocation cannot be done dynamically because of:
 Huge overhead
 Delay in allocating resources
Resource allocation must happen once in every pre-determined interval of time and cannot happen
continuously.
Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the
predicted curves slope a step-up or step-down is performed. Also a threshold value is set to
accommodate unexpected spikes in the workload.
Successful prediction:
Average unused resource: 94% Average resource stolen:65% (Req – Allocated)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
0.105
0.11
0.115
0.12
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
CPUUsage
Time (s)
Efficient resource allocation curve
Used
Allocated
Req
Page 12
Failed prediction:
Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is
because of the random selection of the reference curve for task clusters. Though the randomly selected
reference curve has generated a descent resource allocation curve, there are points at which the current
task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced
section.(Solution discussed later)
0
0.005
0.01
0.015
0.02
0.025 1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
CPUuse
Time (s)
Allocated
Usage
Page 13
8. Tools and Algorithms Used:
 JAVA: For extracting required data out of the datasets, we used Java programming (csv
reader/writer, hashmaps).
 DTW on MATLAB: Implemented DTW using Matlab’s in-built function.
 WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans
 TABLEAU 8.1: To visualize the datasets and results.
 Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was
continuous and the algorithm needed discrete data.
 Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this
algorithm.
Page 14
9. Issues faced and possible solutions:
 MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly
9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in
MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time
complexity to great extent.
 MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along
with other parameters. We had to run Java programs to map users of tasks to the corresponding
DTW values.
 Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using
Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t
implement this algorithm.
 We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a
recurring workload pattern. As very few users had such workload pattern, we ended up ignoring
lot of data. So we considered tasks of all users for DTW instead per user’s task.
Page 15
10. Looking ahead
10.1Improvements and optimizations
 Choosing a good reference curve for running DTW was difficult. Having a straight line as a
reference curve gave us mediocre results as curves with peaks at different instances of time were
grouped as similar. So we compared results with the line x=y and sine curve as reference curve
and we got good results for sine curve.
 Choosing a representative curve for a task cluster was performed on a random basis due to time
constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for
a cluster.
 During prediction incidents such as the workload of a task changing to look like some other
cluster is not handled now. This can be handled by comparing the current task’s workload with all
cluster’s reference curve continuously and when current task looks like shifting to some other
cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This
constant monitoring and dynamic mapping improves the prediction accuracy.
Page 16
GROUP MEMBERS
PRABHAKAR
GANESAMURHTY
PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA
SHANMUGARAJ
prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo
.com
abias1702@gmail.com
References:
1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google
Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th
International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6
2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore
3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time
series data." Data Mining and Knowledge Discovery (2010): 1-35.

Cloud workload analysis and simulation

  • 1.
    GROUP – B3 CLOUDWORKLOAD ANALYSIS AND SIMULATION Investigating the behaviors of a cloud, focusing on their workload patterns Jan – May 2014
  • 2.
    TABLE OF CONTENTS Contents Highlights:_________________________________________________________________________________________________1 The approach ............................................................................................................................................................................................1 Dataset preprocessing and analysis................................................................................................................................................1 Clustering analysis..................................................................................................................................................................................1 Time series analysis...............................................................................................................................................................................1 Workload prediction..............................................................................................................................................................................1 Looking ahead ..........................................................................................................................................................................................1 1.Objective:________________________________________________________________________________________________ 2 2.The Approach: __________________________________________________________________________________________ 3 3.Dataset preprocessing and analysis___________________________________________________________________ 4 3.1 Preprocessing _________________________________________________________________________________________4 3.2 Analysis: _______________________________________________________________________________________________4 4.Calculation resource usage statistics:_________________________________________________________________ 6 5.Classification of users and identifying target users: _________________________________________________ 8 6.Time series analysis__________________________________________________________________________________ 10 7.Workload Prediction _________________________________________________________________________________ 11 9.Issues faced and possible solutions: ________________________________________________________________ 14 10.Looking ahead_______________________________________________________________________________________ 15 GROUP MEMBERS ______________________________________________________________________________________ 16 References:______________________________________________________________________________________________ 16
  • 3.
    Page 1 Highlights: The approach Studied google trace data schema  Studied related technical papers and summarized useful observations  Devised an approach to analyze cloud workload using observations from technical papers and considering google trace data’s schema Dataset preprocessing and analysis  Preprocessed the data to prepare it for analysis  Visualized important statistics for feasibility decision and computed relevant attributes  The main attributes were analyzed and visualized and observations were made. Clustering analysis  Applied various clustering algorithm, compared the results and chose the best clustering for user and task analysis  Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage Time series analysis  Target users and their task were identified from the clustering results  Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage Workload prediction  Users with specific resource usage patterns are identified  Resource for users is allocated based on the identified usage pattern with a threshold value Looking ahead  Improvisations to our approach
  • 4.
    Page 2 1. Objective: Analyze and report the cloud workload data based on Google cloud trace  Use graphical tools to visualize the data (you may need to write programs to process the data  in order to feed them into visual tools)  Study and summarize the papers regarding other people’s experience in Google cloud trace analysis  Determine the workload characteristics from your analysis of the Google cloud trace  Try to reallocate unused resources of a user to other users who require them
  • 5.
    Page 3 2. TheApproach: Based on our study on the google cloud trace data and the gathered observations from the technical papers we devised the following approach for the problem:  Analyze and visualize the data to identify important attributes that determine user workload pattern and ignore rest of the attributes  Calculate resource usage statistics of users to identify the feasibility of resource re-allocation  Classify users based on their resource usage quality[1] (amount of unused resource/resource requested) using clustering analysis  Identify target users based on the clustering analysis for resource re-allocation  Study the workload pattern of tasks of the target users and classify tasks based on their lengths  Perform time series analysis on long tasks  Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters of tasks of all users that have similar workload based on time series analysis  Predict the usage pattern of a user if the current task’s pattern matches the pattern associated with that user (or) matches the one of the cluster formed in the previous step.
  • 6.
    Page 4 3. Datasetpreprocessing and analysis 3.1 Preprocessing Inconsistent and vague data was processed to perform analysis. The task-usage table has many records for a same Job ID-task index pair because the same task might be re-submitted or re- scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair pre-processing was done. Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of repeating task records was considered and stored as a single record. Time is converted into days and hours for per day analysis 3.2 Analysis: The data in the cloud trace tables were visualized. The data which were found to be constant/within a small range of values for most of the records were not considered for analysis. The attributes that play a major part in shaping the user profile and task profile are considered important attributes. The main attributes from a table were analyzed and visualized and certain observations were made. Figure 1CPU requested per user (blue) Vs CPU used per user (red) Observation: Most users overestimate the resources they need and use less than 5% of the requested resources A few users under estimate the resources and use more than thrice the amount of requested resources.
  • 7.
    Page 5 Figure 2Memoryrequested per user (blue) Vs Memory used per user (red) Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 8.
    Page 6 4. Calculationresource usage statistics: As we are concerned with re-allocation of unused resources, we should look at those users who over- estimate the resources as observed in the previous section. To identify those users who over-estimate the resources a new attribute is calculated. Estimation ratio [1] = (requested resource – used resource)/requested resource. Estimation ratio varies from 0 to 1. 0 – User has used up/more than the requested resource 1 – User has not used any of the requested resource Also from the visualizations and observations made, the following are identified as important attributes: User: Submission rate, CPU estimation ratio, Memory estimation ratio Task: Task length, CPU usage, Memory usage Figure 3CPU Estimation ratio per User Users with negative (red) CPU estimation ratio have used resources more than requested. Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 9.
    Page 7 Figure 4MemoryEstimation ratio per User Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 10.
    Page 8 5. Classificationof users and identifying target users: The dimensions for classification are User: Submission rate, CPU estimation ratio, Memory estimation ratio We use the following clustering algorithms to identify optimal number of clusters for users and tasks  K- means  Expectation – Maximization (EM)  Cascade Simple K-means  X-means[2] We categorize the users and tasks using these clustering algorithms with the above dimensions for users. We compare and choose the best clustering for users and tasks. K-means (4 clusters) EM Clustering
  • 11.
    Page 9 X- meansCascade Simple K-means K means clustering with 4 clusters was selected as it offers good clustering of users based on the CPU and memory estimation ratios. From the clustering results we observed: 97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 % unused) We targeted tasks that were long enough to perform efficient resource allocation. We performed clustering on task lengths of these users to filter out short tasks
  • 12.
    Page 10 6. Timeseries analysis To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer Issues faced section) Clustered tasks that have same DTW value These tasks were identified to have similar workload curve. Two tasks with same DTW distances having similar workloads The clusters hence formed a reference workload curve was randomly selected from one of task’s workload in the group of tasks in that cluster. (due to time constraint)
  • 13.
    Page 11 7. WorkloadPrediction  When a user from the targeted list of users issues a task, the task’s workload is studied for pre- determined amount of time. This time period was determined by trial and error basis, as the minimum time at which all reference curves are different.  During this time period, the task’s workload is compared with the reference curve of all task clusters formed in the previous step.  If the current task’s workload curve has zero distance with one of the reference curves i.e., similar to a reference curve, the current curve is expected to behave similar to the reference curve and its workload is predicted. Since resource allocation and de-allocation cannot be done dynamically because of:  Huge overhead  Delay in allocating resources Resource allocation must happen once in every pre-determined interval of time and cannot happen continuously. Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the predicted curves slope a step-up or step-down is performed. Also a threshold value is set to accommodate unexpected spikes in the workload. Successful prediction: Average unused resource: 94% Average resource stolen:65% (Req – Allocated) 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 CPUUsage Time (s) Efficient resource allocation curve Used Allocated Req
  • 14.
    Page 12 Failed prediction: Reason:The above chart shows a case where our algorithm has failed to predict correctly. This is because of the random selection of the reference curve for task clusters. Though the randomly selected reference curve has generated a descent resource allocation curve, there are points at which the current task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced section.(Solution discussed later) 0 0.005 0.01 0.015 0.02 0.025 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 CPUuse Time (s) Allocated Usage
  • 15.
    Page 13 8. Toolsand Algorithms Used:  JAVA: For extracting required data out of the datasets, we used Java programming (csv reader/writer, hashmaps).  DTW on MATLAB: Implemented DTW using Matlab’s in-built function.  WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans  TABLEAU 8.1: To visualize the datasets and results.  Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was continuous and the algorithm needed discrete data.  Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this algorithm.
  • 16.
    Page 14 9. Issuesfaced and possible solutions:  MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly 9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time complexity to great extent.  MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along with other parameters. We had to run Java programs to map users of tasks to the corresponding DTW values.  Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t implement this algorithm.  We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a recurring workload pattern. As very few users had such workload pattern, we ended up ignoring lot of data. So we considered tasks of all users for DTW instead per user’s task.
  • 17.
    Page 15 10. Lookingahead 10.1Improvements and optimizations  Choosing a good reference curve for running DTW was difficult. Having a straight line as a reference curve gave us mediocre results as curves with peaks at different instances of time were grouped as similar. So we compared results with the line x=y and sine curve as reference curve and we got good results for sine curve.  Choosing a representative curve for a task cluster was performed on a random basis due to time constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for a cluster.  During prediction incidents such as the workload of a task changing to look like some other cluster is not handled now. This can be handled by comparing the current task’s workload with all cluster’s reference curve continuously and when current task looks like shifting to some other cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This constant monitoring and dynamic mapping improves the prediction accuracy.
  • 18.
    Page 16 GROUP MEMBERS PRABHAKAR GANESAMURHTY PRIYANKAMEHTA ARUNRAJA SRINIVASAN ABINAYA SHANMUGARAJ prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo .com abias1702@gmail.com References: 1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6 2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore 3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time series data." Data Mining and Knowledge Discovery (2010): 1-35.