SlideShare a Scribd company logo
1 of 18
GROUP – B3
CLOUD WORKLOAD ANALYSIS AND SIMULATION
Investigating the behaviors of a cloud, focusing on their workload
patterns
Jan – May 2014
TABLE OF CONTENTS
Contents
Highlights:_________________________________________________________________________________________________ 1
The approach ............................................................................................................................................................................................1
Dataset preprocessing and analysis................................................................................................................................................1
Clustering analysis..................................................................................................................................................................................1
Time series analysis...............................................................................................................................................................................1
Workload prediction..............................................................................................................................................................................1
Looking ahead ..........................................................................................................................................................................................1
1.Objective:________________________________________________________________________________________________ 2
2.The Approach: __________________________________________________________________________________________ 3
3.Dataset preprocessing and analysis___________________________________________________________________ 4
3.1 Preprocessing _________________________________________________________________________________________4
3.2 Analysis: _______________________________________________________________________________________________4
4.Calculation resource usage statistics:_________________________________________________________________ 6
5.Classification of users and identifying target users: _________________________________________________ 8
6.Time series analysis__________________________________________________________________________________ 10
7.Workload Prediction _________________________________________________________________________________ 11
9.Issues faced and possible solutions: ________________________________________________________________ 14
10.Looking ahead_______________________________________________________________________________________ 15
GROUP MEMBERS ______________________________________________________________________________________ 16
References:______________________________________________________________________________________________ 16
Page 1
Highlights:
The approach
 Studied google trace data schema
 Studied related technical papers and summarized useful observations
 Devised an approach to analyze cloud workload using observations from technical papers and considering
google trace data’s schema
Dataset preprocessing and analysis
 Preprocessed the data to prepare it for analysis
 Visualized important statistics for feasibility decision and computed relevant attributes
 The main attributes were analyzed and visualized and observations were made.
Clustering analysis
 Applied various clustering algorithm, compared the results and chose the best clustering for user and task
analysis
 Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage
Time series analysis
 Target users and their task were identified from the clustering results
 Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage
Workload prediction
 Users with specific resource usage patterns are identified
 Resource for users is allocated based on the identified usage pattern with a threshold value
Looking ahead
 Improvisations to our approach
Page 2
1. Objective:
 Analyze and report the cloud workload data based on Google cloud trace
 Use graphical tools to visualize the data (you may need to write programs to process the data
 in order to feed them into visual tools)
 Study and summarize the papers regarding other people’s experience in Google cloud trace
analysis
 Determine the workload characteristics from your analysis of the Google cloud trace
 Try to reallocate unused resources of a user to other users who require them
Page 3
2. The Approach:
Based on our study on the google cloud trace data and the gathered observations from the technical
papers we devised the following approach for the problem:
 Analyze and visualize the data to identify important attributes that determine user workload
pattern and ignore rest of the attributes
 Calculate resource usage statistics of users to identify the feasibility of resource re-allocation
 Classify users based on their resource usage quality[1] (amount of unused resource/resource
requested) using clustering analysis
 Identify target users based on the clustering analysis for resource re-allocation
 Study the workload pattern of tasks of the target users and classify tasks based on their lengths
 Perform time series analysis on long tasks
 Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters
of tasks of all users that have similar workload based on time series analysis
 Predict the usage pattern of a user if the current task’s pattern matches the pattern associated
with that user (or) matches the one of the cluster formed in the previous step.
Page 4
3. Dataset preprocessing and analysis
3.1 Preprocessing
Inconsistent and vague data was processed to perform analysis. The task-usage table has many
records for a same Job ID-task index pair because the same task might be re-submitted or re-
scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair
pre-processing was done.
Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of
repeating task records was considered and stored as a single record.
Time is converted into days and hours for per day analysis
3.2 Analysis:
The data in the cloud trace tables were visualized. The data which were found to be constant/within
a small range of values for most of the records were not considered for analysis. The attributes that
play a major part in shaping the user profile and task profile are considered important attributes. The
main attributes from a table were analyzed and visualized and certain observations were made.
Figure 1CPU requested per user (blue) Vs CPU used per user (red)
Observation: Most users overestimate the resources they need and use less than 5% of the requested resources
A few users under estimate the resources and use more than thrice the amount of requested resources.
Page 5
Figure 2Memory requested per user (blue) Vs Memory used per user (red)
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 6
4. Calculation resource usage statistics:
As we are concerned with re-allocation of unused resources, we should look at those users who over-
estimate the resources as observed in the previous section.
To identify those users who over-estimate the resources a new attribute is calculated.
Estimation ratio [1] = (requested resource – used resource)/requested resource.
Estimation ratio varies from 0 to 1.
0 – User has used up/more than the requested resource
1 – User has not used any of the requested resource
Also from the visualizations and observations made, the following are identified as important
attributes:
User: Submission rate, CPU estimation ratio, Memory estimation ratio
Task: Task length, CPU usage, Memory usage
Figure 3CPU Estimation ratio per User
Users with negative (red) CPU estimation ratio have used resources more than requested.
Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 7
Figure 4Memory Estimation ratio per User
Users with negative (orange) memory estimation ratio have used resources more than requested.
Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
Page 8
5. Classification of users and identifying target users:
The dimensions for classification are
User: Submission rate, CPU estimation ratio, Memory estimation ratio
We use the following clustering algorithms to identify optimal number of clusters for users and tasks
 K- means
 Expectation – Maximization (EM)
 Cascade Simple K-means
 X-means[2]
We categorize the users and tasks using these clustering algorithms with the above dimensions for
users.
We compare and choose the best clustering for users and tasks.
K-means (4 clusters) EM Clustering
Page 9
X- means Cascade Simple K-means
K means clustering with 4 clusters was selected as it offers good clustering of users based on the
CPU and memory estimation ratios.
From the clustering results we observed:
97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more
than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 %
unused)
We targeted tasks that were long enough to perform efficient resource allocation. We performed
clustering on task lengths of these users to filter out short tasks
Page 10
6. Time series analysis
To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and
Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer
Issues faced section)
Clustered tasks that have same DTW value
These tasks were identified to have similar workload curve.
Two tasks with same DTW distances having similar workloads
The clusters hence formed a reference workload curve was randomly selected from one of task’s
workload in the group of tasks in that cluster. (due to time constraint)
Page 11
7. Workload Prediction
 When a user from the targeted list of users issues a task, the task’s workload is studied for pre-
determined amount of time. This time period was determined by trial and error basis, as the
minimum time at which all reference curves are different.
 During this time period, the task’s workload is compared with the reference curve of all task
clusters formed in the previous step.
 If the current task’s workload curve has zero distance with one of the reference curves i.e., similar
to a reference curve, the current curve is expected to behave similar to the reference curve and its
workload is predicted.
Since resource allocation and de-allocation cannot be done dynamically because of:
 Huge overhead
 Delay in allocating resources
Resource allocation must happen once in every pre-determined interval of time and cannot happen
continuously.
Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the
predicted curves slope a step-up or step-down is performed. Also a threshold value is set to
accommodate unexpected spikes in the workload.
Successful prediction:
Average unused resource: 94% Average resource stolen:65% (Req – Allocated)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
0.105
0.11
0.115
0.12
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
CPUUsage
Time (s)
Efficient resource allocation curve
Used
Allocated
Req
Page 12
Failed prediction:
Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is
because of the random selection of the reference curve for task clusters. Though the randomly selected
reference curve has generated a descent resource allocation curve, there are points at which the current
task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced
section.(Solution discussed later)
0
0.005
0.01
0.015
0.02
0.025 1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
CPUuse
Time (s)
Allocated
Usage
Page 13
8. Tools and Algorithms Used:
 JAVA: For extracting required data out of the datasets, we used Java programming (csv
reader/writer, hashmaps).
 DTW on MATLAB: Implemented DTW using Matlab’s in-built function.
 WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans
 TABLEAU 8.1: To visualize the datasets and results.
 Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was
continuous and the algorithm needed discrete data.
 Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this
algorithm.
Page 14
9. Issues faced and possible solutions:
 MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly
9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in
MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time
complexity to great extent.
 MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along
with other parameters. We had to run Java programs to map users of tasks to the corresponding
DTW values.
 Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using
Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t
implement this algorithm.
 We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a
recurring workload pattern. As very few users had such workload pattern, we ended up ignoring
lot of data. So we considered tasks of all users for DTW instead per user’s task.
Page 15
10. Looking ahead
10.1Improvements and optimizations
 Choosing a good reference curve for running DTW was difficult. Having a straight line as a
reference curve gave us mediocre results as curves with peaks at different instances of time were
grouped as similar. So we compared results with the line x=y and sine curve as reference curve
and we got good results for sine curve.
 Choosing a representative curve for a task cluster was performed on a random basis due to time
constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for
a cluster.
 During prediction incidents such as the workload of a task changing to look like some other
cluster is not handled now. This can be handled by comparing the current task’s workload with all
cluster’s reference curve continuously and when current task looks like shifting to some other
cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This
constant monitoring and dynamic mapping improves the prediction accuracy.
Page 16
GROUP MEMBERS
PRABHAKAR
GANESAMURHTY
PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA
SHANMUGARAJ
prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo
.com
abias1702@gmail.com
References:
1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google
Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th
International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6
2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore
3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time
series data." Data Mining and Knowledge Discovery (2010): 1-35.

More Related Content

What's hot

A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
Partitioning of Query Processing in Distributed Database System to Improve Th...
Partitioning of Query Processing in Distributed Database System to Improve Th...Partitioning of Query Processing in Distributed Database System to Improve Th...
Partitioning of Query Processing in Distributed Database System to Improve Th...IRJET Journal
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data queryIJDKP
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Using particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problemsUsing particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problemsriyaniaes
 
Re-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesRe-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesEditor IJCATR
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
 
Reengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseReengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseeSAT Journals
 

What's hot (19)

A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
H04564550
H04564550H04564550
H04564550
 
Partitioning of Query Processing in Distributed Database System to Improve Th...
Partitioning of Query Processing in Distributed Database System to Improve Th...Partitioning of Query Processing in Distributed Database System to Improve Th...
Partitioning of Query Processing in Distributed Database System to Improve Th...
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data query
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
IJET-V3I1P27
IJET-V3I1P27IJET-V3I1P27
IJET-V3I1P27
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Using particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problemsUsing particle swarm optimization to solve test functions problems
Using particle swarm optimization to solve test functions problems
 
Re-enactment of Newspaper Articles
Re-enactment of Newspaper ArticlesRe-enactment of Newspaper Articles
Re-enactment of Newspaper Articles
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
50120130406007
5012013040600750120130406007
50120130406007
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud
 
Reengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented databaseReengineering of relational databases to object oriented database
Reengineering of relational databases to object oriented database
 

Similar to Cloud workload analysis and simulation

cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02PRIYANKA MEHTA
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulationPrabhakar Ganesamurthy
 
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...eSAT Journals
 
IRJET- Book Recommendation System using Item Based Collaborative Filtering
IRJET- Book Recommendation System using Item Based Collaborative FilteringIRJET- Book Recommendation System using Item Based Collaborative Filtering
IRJET- Book Recommendation System using Item Based Collaborative FilteringIRJET Journal
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submittedChamath Sajeewa
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
Improving Effort Estimation in Agile Software Development Projects
Improving Effort Estimation in Agile Software Development ProjectsImproving Effort Estimation in Agile Software Development Projects
Improving Effort Estimation in Agile Software Development ProjectsGedi Siuskus
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using dataeSAT Publishing House
 
Variance rover system
Variance rover systemVariance rover system
Variance rover systemeSAT Journals
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET Journal
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...IRJET Journal
 
A compendium on load forecasting approaches and models
A compendium on load forecasting approaches and modelsA compendium on load forecasting approaches and models
A compendium on load forecasting approaches and modelseSAT Publishing House
 
Iare ds lecture_notes_2
Iare ds lecture_notes_2Iare ds lecture_notes_2
Iare ds lecture_notes_2RajSingh734307
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...paperpublications3
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Arpita Majumder
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsIRJET Journal
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesIRJET Journal
 
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET Journal
 

Similar to Cloud workload analysis and simulation (20)

cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
 
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...
 
IRJET- Book Recommendation System using Item Based Collaborative Filtering
IRJET- Book Recommendation System using Item Based Collaborative FilteringIRJET- Book Recommendation System using Item Based Collaborative Filtering
IRJET- Book Recommendation System using Item Based Collaborative Filtering
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
 
V33119122
V33119122V33119122
V33119122
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Improving Effort Estimation in Agile Software Development Projects
Improving Effort Estimation in Agile Software Development ProjectsImproving Effort Estimation in Agile Software Development Projects
Improving Effort Estimation in Agile Software Development Projects
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using data
 
Variance rover system
Variance rover systemVariance rover system
Variance rover system
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
 
A compendium on load forecasting approaches and models
A compendium on load forecasting approaches and modelsA compendium on load forecasting approaches and models
A compendium on load forecasting approaches and models
 
Iare ds lecture_notes_2
Iare ds lecture_notes_2Iare ds lecture_notes_2
Iare ds lecture_notes_2
 
KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
 
Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...Improving Service Recommendation Method on Map reduce by User Preferences and...
Improving Service Recommendation Method on Map reduce by User Preferences and...
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative FilteringIRJET- Boosting Response Aware Model-Based Collaborative Filtering
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
 

Cloud workload analysis and simulation

  • 1. GROUP – B3 CLOUD WORKLOAD ANALYSIS AND SIMULATION Investigating the behaviors of a cloud, focusing on their workload patterns Jan – May 2014
  • 2. TABLE OF CONTENTS Contents Highlights:_________________________________________________________________________________________________ 1 The approach ............................................................................................................................................................................................1 Dataset preprocessing and analysis................................................................................................................................................1 Clustering analysis..................................................................................................................................................................................1 Time series analysis...............................................................................................................................................................................1 Workload prediction..............................................................................................................................................................................1 Looking ahead ..........................................................................................................................................................................................1 1.Objective:________________________________________________________________________________________________ 2 2.The Approach: __________________________________________________________________________________________ 3 3.Dataset preprocessing and analysis___________________________________________________________________ 4 3.1 Preprocessing _________________________________________________________________________________________4 3.2 Analysis: _______________________________________________________________________________________________4 4.Calculation resource usage statistics:_________________________________________________________________ 6 5.Classification of users and identifying target users: _________________________________________________ 8 6.Time series analysis__________________________________________________________________________________ 10 7.Workload Prediction _________________________________________________________________________________ 11 9.Issues faced and possible solutions: ________________________________________________________________ 14 10.Looking ahead_______________________________________________________________________________________ 15 GROUP MEMBERS ______________________________________________________________________________________ 16 References:______________________________________________________________________________________________ 16
  • 3. Page 1 Highlights: The approach  Studied google trace data schema  Studied related technical papers and summarized useful observations  Devised an approach to analyze cloud workload using observations from technical papers and considering google trace data’s schema Dataset preprocessing and analysis  Preprocessed the data to prepare it for analysis  Visualized important statistics for feasibility decision and computed relevant attributes  The main attributes were analyzed and visualized and observations were made. Clustering analysis  Applied various clustering algorithm, compared the results and chose the best clustering for user and task analysis  Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage Time series analysis  Target users and their task were identified from the clustering results  Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage Workload prediction  Users with specific resource usage patterns are identified  Resource for users is allocated based on the identified usage pattern with a threshold value Looking ahead  Improvisations to our approach
  • 4. Page 2 1. Objective:  Analyze and report the cloud workload data based on Google cloud trace  Use graphical tools to visualize the data (you may need to write programs to process the data  in order to feed them into visual tools)  Study and summarize the papers regarding other people’s experience in Google cloud trace analysis  Determine the workload characteristics from your analysis of the Google cloud trace  Try to reallocate unused resources of a user to other users who require them
  • 5. Page 3 2. The Approach: Based on our study on the google cloud trace data and the gathered observations from the technical papers we devised the following approach for the problem:  Analyze and visualize the data to identify important attributes that determine user workload pattern and ignore rest of the attributes  Calculate resource usage statistics of users to identify the feasibility of resource re-allocation  Classify users based on their resource usage quality[1] (amount of unused resource/resource requested) using clustering analysis  Identify target users based on the clustering analysis for resource re-allocation  Study the workload pattern of tasks of the target users and classify tasks based on their lengths  Perform time series analysis on long tasks  Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters of tasks of all users that have similar workload based on time series analysis  Predict the usage pattern of a user if the current task’s pattern matches the pattern associated with that user (or) matches the one of the cluster formed in the previous step.
  • 6. Page 4 3. Dataset preprocessing and analysis 3.1 Preprocessing Inconsistent and vague data was processed to perform analysis. The task-usage table has many records for a same Job ID-task index pair because the same task might be re-submitted or re- scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair pre-processing was done. Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of repeating task records was considered and stored as a single record. Time is converted into days and hours for per day analysis 3.2 Analysis: The data in the cloud trace tables were visualized. The data which were found to be constant/within a small range of values for most of the records were not considered for analysis. The attributes that play a major part in shaping the user profile and task profile are considered important attributes. The main attributes from a table were analyzed and visualized and certain observations were made. Figure 1CPU requested per user (blue) Vs CPU used per user (red) Observation: Most users overestimate the resources they need and use less than 5% of the requested resources A few users under estimate the resources and use more than thrice the amount of requested resources.
  • 7. Page 5 Figure 2Memory requested per user (blue) Vs Memory used per user (red) Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 8. Page 6 4. Calculation resource usage statistics: As we are concerned with re-allocation of unused resources, we should look at those users who over- estimate the resources as observed in the previous section. To identify those users who over-estimate the resources a new attribute is calculated. Estimation ratio [1] = (requested resource – used resource)/requested resource. Estimation ratio varies from 0 to 1. 0 – User has used up/more than the requested resource 1 – User has not used any of the requested resource Also from the visualizations and observations made, the following are identified as important attributes: User: Submission rate, CPU estimation ratio, Memory estimation ratio Task: Task length, CPU usage, Memory usage Figure 3CPU Estimation ratio per User Users with negative (red) CPU estimation ratio have used resources more than requested. Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 9. Page 7 Figure 4Memory Estimation ratio per User Users with negative (orange) memory estimation ratio have used resources more than requested. Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.
  • 10. Page 8 5. Classification of users and identifying target users: The dimensions for classification are User: Submission rate, CPU estimation ratio, Memory estimation ratio We use the following clustering algorithms to identify optimal number of clusters for users and tasks  K- means  Expectation – Maximization (EM)  Cascade Simple K-means  X-means[2] We categorize the users and tasks using these clustering algorithms with the above dimensions for users. We compare and choose the best clustering for users and tasks. K-means (4 clusters) EM Clustering
  • 11. Page 9 X- means Cascade Simple K-means K means clustering with 4 clusters was selected as it offers good clustering of users based on the CPU and memory estimation ratios. From the clustering results we observed: 97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 % unused) We targeted tasks that were long enough to perform efficient resource allocation. We performed clustering on task lengths of these users to filter out short tasks
  • 12. Page 10 6. Time series analysis To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer Issues faced section) Clustered tasks that have same DTW value These tasks were identified to have similar workload curve. Two tasks with same DTW distances having similar workloads The clusters hence formed a reference workload curve was randomly selected from one of task’s workload in the group of tasks in that cluster. (due to time constraint)
  • 13. Page 11 7. Workload Prediction  When a user from the targeted list of users issues a task, the task’s workload is studied for pre- determined amount of time. This time period was determined by trial and error basis, as the minimum time at which all reference curves are different.  During this time period, the task’s workload is compared with the reference curve of all task clusters formed in the previous step.  If the current task’s workload curve has zero distance with one of the reference curves i.e., similar to a reference curve, the current curve is expected to behave similar to the reference curve and its workload is predicted. Since resource allocation and de-allocation cannot be done dynamically because of:  Huge overhead  Delay in allocating resources Resource allocation must happen once in every pre-determined interval of time and cannot happen continuously. Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the predicted curves slope a step-up or step-down is performed. Also a threshold value is set to accommodate unexpected spikes in the workload. Successful prediction: Average unused resource: 94% Average resource stolen:65% (Req – Allocated) 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105 0.11 0.115 0.12 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 CPUUsage Time (s) Efficient resource allocation curve Used Allocated Req
  • 14. Page 12 Failed prediction: Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is because of the random selection of the reference curve for task clusters. Though the randomly selected reference curve has generated a descent resource allocation curve, there are points at which the current task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced section.(Solution discussed later) 0 0.005 0.01 0.015 0.02 0.025 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 CPUuse Time (s) Allocated Usage
  • 15. Page 13 8. Tools and Algorithms Used:  JAVA: For extracting required data out of the datasets, we used Java programming (csv reader/writer, hashmaps).  DTW on MATLAB: Implemented DTW using Matlab’s in-built function.  WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans  TABLEAU 8.1: To visualize the datasets and results.  Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was continuous and the algorithm needed discrete data.  Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this algorithm.
  • 16. Page 14 9. Issues faced and possible solutions:  MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly 9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time complexity to great extent.  MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along with other parameters. We had to run Java programs to map users of tasks to the corresponding DTW values.  Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t implement this algorithm.  We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a recurring workload pattern. As very few users had such workload pattern, we ended up ignoring lot of data. So we considered tasks of all users for DTW instead per user’s task.
  • 17. Page 15 10. Looking ahead 10.1Improvements and optimizations  Choosing a good reference curve for running DTW was difficult. Having a straight line as a reference curve gave us mediocre results as curves with peaks at different instances of time were grouped as similar. So we compared results with the line x=y and sine curve as reference curve and we got good results for sine curve.  Choosing a representative curve for a task cluster was performed on a random basis due to time constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for a cluster.  During prediction incidents such as the workload of a task changing to look like some other cluster is not handled now. This can be handled by comparing the current task’s workload with all cluster’s reference curve continuously and when current task looks like shifting to some other cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This constant monitoring and dynamic mapping improves the prediction accuracy.
  • 18. Page 16 GROUP MEMBERS PRABHAKAR GANESAMURHTY PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA SHANMUGARAJ prabhakarg@utdallas.edu priyanka.nmehta@gmail.com arunraja.srinivasan@yahoo .com abias1702@gmail.com References: 1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6 2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore 3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time series data." Data Mining and Knowledge Discovery (2010): 1-35.