Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analysis of Grid log data with Affinity Propagation
G. Modena, M. van Someren
Universiteit van Amsterdam
June 20, 2013
G. Mo...
Grid Computing
Paradigm vs. implementation
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013...
Paradigm
On-demand computational power and storage
Distributed, heterogeneous workload
Dynamic network topology
Resources:...
Implementation
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 4 / 17
Problem domain
Grid
Elevated (job) failure rates
Possible causes
User related failures (configuration errors, buggy code)
N...
Failure patterns
job abortion rate
resubmitted jobs rate
cross-domain traffic patterns
resource starvation
unreachable or bl...
Troubleshooting
Manual analysis and correlation of log data Data is generated by
multiple sources
Time consuming sysop inv...
Goal
Detect meaningful job traffic patterns in Grid log data
Approach
1 Clustering
2 Get experts to label and accept cluster...
Data
Grid Observatory provided a dataset of log files (EGEE)
In one week: 2356 jobs, 16 users, 18 brokers, 25 computing ele...
Example of features
Duration
Number of transitions between grid nodes
Latest status code
Last grid node visited
Number of ...
Step 1: clustering
Affinity Propagation
Represent data points as a cluster graph
Belief propagation
Chosen because
State of ...
Method
Affinity Propagation
Given: pairwise similarity between data points
Find: clusters, consisting of exemplar (most char...
Result of clustering step
12 clusters (longer time periods = more clusters)
Extracted characteristic features for each clu...
Step 2: Clusters validation
Presented a list of clusters and their characteristic features to human
experts (NIKHEF)
Exper...
5 useful clusters - with features
Successfully terminated jobs (557): job length, user, resource broker num
jobs, computin...
Step 3: Classification
Cross-validation with same data
SVM as classifier (boolean and multi-class)
84%accuracy on cross vali...
Conclusion
Our method
Discovered useful classes of jobs on a Grid network
Can be used as filter in monitoring (trend discov...
Upcoming SlideShare
Loading in …5
×

Analysis of grid log data with Affinity Propagation

In this paper we present an unsupervised learning approach to detect meaningful job traffic patterns in Grid log data. Manual anomaly detection on modern Grid environments is troublesome given their in- creasing complexity, the distributed, dynamic topology of the network and heterogeneity of the jobs being executed. The ability to automat- ically detect meaningful events with little or no human intervention is therefore desirable. We evaluate our method on a set of log data col- lected on the Grid. Since we lack a priori knowledge of patterns that can be detected and no labelled data is available, an unsupervised learning method is followed. We cluster jobs executed on the Grid using Affinity Propagation. We try to explain discovered clusters using representative features and we label them with the help of domain experts. Finally, as a further validation step, we construct a classifier for five of the detected clusters and we use it to predict the termination status of unseen jobs.

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Analysis of grid log data with Affinity Propagation

  1. 1. Analysis of Grid log data with Affinity Propagation G. Modena, M. van Someren Universiteit van Amsterdam June 20, 2013 G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 1 / 17
  2. 2. Grid Computing Paradigm vs. implementation G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 2 / 17
  3. 3. Paradigm On-demand computational power and storage Distributed, heterogeneous workload Dynamic network topology Resources: users, computing nodes, brokers, jobs G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 3 / 17
  4. 4. Implementation G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 4 / 17
  5. 5. Problem domain Grid Elevated (job) failure rates Possible causes User related failures (configuration errors, buggy code) Network failure (firewalls blocking traffics, node disappear, errors at site level) Resource failure (error on local nodes) G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 5 / 17
  6. 6. Failure patterns job abortion rate resubmitted jobs rate cross-domain traffic patterns resource starvation unreachable or black hole nodes G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 6 / 17
  7. 7. Troubleshooting Manual analysis and correlation of log data Data is generated by multiple sources Time consuming sysop involvement Job traffic is domain dependent Hard to manage and assess problems G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 7 / 17
  8. 8. Goal Detect meaningful job traffic patterns in Grid log data Approach 1 Clustering 2 Get experts to label and accept clusters 3 Train a classifier for interesting clusters G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 8 / 17
  9. 9. Data Grid Observatory provided a dataset of log files (EGEE) In one week: 2356 jobs, 16 users, 18 brokers, 25 computing elements No categories: unlabelled data Average job path length of 4 hops, duration 7.5 hours, job abortion rate close to 50% job until time t (some jobs run very long, or disappear); tried several values for t (1 hour -1 week) Extracted 40 features for describing jobs: features of user, input node, broker, computing element, path G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 9 / 17
  10. 10. Example of features Duration Number of transitions between grid nodes Latest status code Last grid node visited Number of resubmissions G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 10 / 17
  11. 11. Step 1: clustering Affinity Propagation Represent data points as a cluster graph Belief propagation Chosen because State of the art - proven in multiple domains No parameter for number of clusters Found to be fast Found to find good clusters G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 11 / 17
  12. 12. Method Affinity Propagation Given: pairwise similarity between data points Find: clusters, consisting of exemplar (most characteristic datapoint in cluster) and members At each iteration estimate How well suited a point is to be an exemplar of a cluster How well suited a point is to be a member of a cluster around another exemplar On convergence Clusters emerge G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 12 / 17
  13. 13. Result of clustering step 12 clusters (longer time periods = more clusters) Extracted characteristic features for each cluster G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 13 / 17
  14. 14. Step 2: Clusters validation Presented a list of clusters and their characteristic features to human experts (NIKHEF) Experts selected and labeled 5 clusters that were meaningful and large enough ”Best” clusters have been discovered for values of t between 8 and 24 hours G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 14 / 17
  15. 15. 5 useful clusters - with features Successfully terminated jobs (557): job length, user, resource broker num jobs, computing units num jobs Failure involving computing elements and brokers (189): job length, user, resource broker num jobs, computing units num jobs High job resubmission count and low resubmission time intervals; possible black hole node (133): resubmission count, resubmission time interval, resource broker average job permanence User related problem; maybe a user submitting jobs to a broker unable to match her requirements (11): user, resource broker aborted, resource broker terminated, resource broker average job permanence, resource broker num jobs Anomaly possibly caused by resource exhaustion on computing elements (101): computing element average job permanence, average terminated, Computing element num jobs G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 15 / 17
  16. 16. Step 3: Classification Cross-validation with same data SVM as classifier (boolean and multi-class) 84%accuracy on cross validation for the boolean class case 60% accuracy on cross validation for the multi-class case G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 16 / 17
  17. 17. Conclusion Our method Discovered useful classes of jobs on a Grid network Can be used as filter in monitoring (trend discovery) Future work Scale up the dataset Incorporate a temporal dimension in clustering Online anytime clustering and classification G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 17 / 17

    Be the first to comment

    Login to see the comments

In this paper we present an unsupervised learning approach to detect meaningful job traffic patterns in Grid log data. Manual anomaly detection on modern Grid environments is troublesome given their in- creasing complexity, the distributed, dynamic topology of the network and heterogeneity of the jobs being executed. The ability to automat- ically detect meaningful events with little or no human intervention is therefore desirable. We evaluate our method on a set of log data col- lected on the Grid. Since we lack a priori knowledge of patterns that can be detected and no labelled data is available, an unsupervised learning method is followed. We cluster jobs executed on the Grid using Affinity Propagation. We try to explain discovered clusters using representative features and we label them with the help of domain experts. Finally, as a further validation step, we construct a classifier for five of the detected clusters and we use it to predict the termination status of unseen jobs.

Views

Total views

634

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×