Analysis of grid log data with Affinity Propagation

Analysis of Grid log data with Aﬃnity Propagation
G. Modena, M. van Someren
Universiteit van Amsterdam
June 20, 2013
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 1 / 17

Grid Computing
Paradigm vs. implementation

Paradigm
On-demand computational power and storage
Distributed, heterogeneous workload
Dynamic network topology
Resources: users, computing nodes, brokers, jobs

Implementation

Problem domain
Grid
Elevated (job) failure rates
Possible causes
User related failures (configuration errors, buggy code)
Network failure (firewalls blocking traffics, node disappear, errors at
site level)
Resource failure (error on local nodes)

Failure patterns
job abortion rate
resubmitted jobs rate
cross-domain traﬃc patterns
resource starvation
unreachable or black hole nodes

Troubleshooting
Manual analysis and correlation of log data Data is generated by
multiple sources
Time consuming sysop involvement
Job traﬃc is domain dependent
Hard to manage and assess problems

Goal
Detect meaningful job traﬃc patterns in Grid log data
Approach
1 Clustering
2 Get experts to label and accept clusters
3 Train a classiﬁer for interesting clusters

Data
Grid Observatory provided a dataset of log ﬁles (EGEE)
In one week: 2356 jobs, 16 users, 18 brokers, 25 computing elements
No categories: unlabelled data
Average job path length of 4 hops, duration 7.5 hours, job abortion
rate close to 50%
job until time t (some jobs run very long, or disappear); tried several
values for t (1 hour -1 week)
Extracted 40 features for describing jobs: features of user, input node,
broker, computing element, path

Example of features
Duration
Number of transitions between grid nodes
Latest status code
Last grid node visited
Number of resubmissions

Step 1: clustering
Aﬃnity Propagation
Represent data points as a cluster graph
Belief propagation
Chosen because
State of the art - proven in multiple domains
No parameter for number of clusters
Found to be fast
Found to ﬁnd good clusters

Method
Aﬃnity Propagation
Given: pairwise similarity between data points
Find: clusters, consisting of exemplar (most characteristic datapoint in
cluster) and members
At each iteration estimate
How well suited a point is to be an exemplar of a cluster
How well suited a point is to be a member of a cluster around
another exemplar
On convergence
Clusters emerge

Result of clustering step
12 clusters (longer time periods = more clusters)
Extracted characteristic features for each cluster

Step 2: Clusters validation
Presented a list of clusters and their characteristic features to human
experts (NIKHEF)
Experts selected and labeled 5 clusters that were meaningful and large
enough
”Best” clusters have been discovered for values of t between 8 and 24
hours

5 useful clusters - with features
Successfully terminated jobs (557): job length, user, resource broker num
jobs, computing units num jobs
Failure involving computing elements and brokers (189): job length, user,
resource broker num jobs, computing units num jobs
High job resubmission count and low resubmission time intervals; possible
black hole node (133): resubmission count, resubmission time interval,
resource broker average job permanence
User related problem; maybe a user submitting jobs to a broker unable to
match her requirements (11): user, resource broker aborted, resource
broker terminated, resource broker average job permanence, resource
broker num jobs
Anomaly possibly caused by resource exhaustion on computing elements
(101): computing element average job permanence, average terminated,
Computing element num jobs

Step 3: Classiﬁcation
Cross-validation with same data
SVM as classiﬁer (boolean and multi-class)
84%accuracy on cross validation for the boolean class case
60% accuracy on cross validation for the multi-class case

Conclusion
Our method
Discovered useful classes of jobs on a Grid network
Can be used as ﬁlter in monitoring (trend discovery)
Future work
Scale up the dataset
Incorporate a temporal dimension in clustering
Online anytime clustering and classiﬁcation

Analysis of grid log data with Affinity Propagation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Analysis of grid log data with Affinity Propagation

Similar to Analysis of grid log data with Affinity Propagation (20)

More from Gabriele Modena

More from Gabriele Modena (6)

Recently uploaded

Recently uploaded (20)

Analysis of grid log data with Affinity Propagation