In this paper we present an unsupervised learning approach to detect meaningful job traffic patterns in Grid log data. Manual anomaly detection on modern Grid environments is troublesome given their in- creasing complexity, the distributed, dynamic topology of the network and heterogeneity of the jobs being executed. The ability to automat- ically detect meaningful events with little or no human intervention is therefore desirable. We evaluate our method on a set of log data col- lected on the Grid. Since we lack a priori knowledge of patterns that can be detected and no labelled data is available, an unsupervised learning method is followed. We cluster jobs executed on the Grid using Affinity Propagation. We try to explain discovered clusters using representative features and we label them with the help of domain experts. Finally, as a further validation step, we construct a classifier for five of the detected clusters and we use it to predict the termination status of unseen jobs.
Analysis of grid log data with Affinity Propagation
1. Analysis of Grid log data with Affinity Propagation
G. Modena, M. van Someren
Universiteit van Amsterdam
June 20, 2013
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 1 / 17
2. Grid Computing
Paradigm vs. implementation
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 2 / 17
3. Paradigm
On-demand computational power and storage
Distributed, heterogeneous workload
Dynamic network topology
Resources: users, computing nodes, brokers, jobs
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 3 / 17
5. Problem domain
Grid
Elevated (job) failure rates
Possible causes
User related failures (configuration errors, buggy code)
Network failure (firewalls blocking traffics, node disappear, errors at
site level)
Resource failure (error on local nodes)
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 5 / 17
6. Failure patterns
job abortion rate
resubmitted jobs rate
cross-domain traffic patterns
resource starvation
unreachable or black hole nodes
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 6 / 17
7. Troubleshooting
Manual analysis and correlation of log data Data is generated by
multiple sources
Time consuming sysop involvement
Job traffic is domain dependent
Hard to manage and assess problems
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 7 / 17
8. Goal
Detect meaningful job traffic patterns in Grid log data
Approach
1 Clustering
2 Get experts to label and accept clusters
3 Train a classifier for interesting clusters
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 8 / 17
9. Data
Grid Observatory provided a dataset of log files (EGEE)
In one week: 2356 jobs, 16 users, 18 brokers, 25 computing elements
No categories: unlabelled data
Average job path length of 4 hops, duration 7.5 hours, job abortion
rate close to 50%
job until time t (some jobs run very long, or disappear); tried several
values for t (1 hour -1 week)
Extracted 40 features for describing jobs: features of user, input node,
broker, computing element, path
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 9 / 17
10. Example of features
Duration
Number of transitions between grid nodes
Latest status code
Last grid node visited
Number of resubmissions
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 10 / 17
11. Step 1: clustering
Affinity Propagation
Represent data points as a cluster graph
Belief propagation
Chosen because
State of the art - proven in multiple domains
No parameter for number of clusters
Found to be fast
Found to find good clusters
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 11 / 17
12. Method
Affinity Propagation
Given: pairwise similarity between data points
Find: clusters, consisting of exemplar (most characteristic datapoint in
cluster) and members
At each iteration estimate
How well suited a point is to be an exemplar of a cluster
How well suited a point is to be a member of a cluster around
another exemplar
On convergence
Clusters emerge
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 12 / 17
13. Result of clustering step
12 clusters (longer time periods = more clusters)
Extracted characteristic features for each cluster
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 13 / 17
14. Step 2: Clusters validation
Presented a list of clusters and their characteristic features to human
experts (NIKHEF)
Experts selected and labeled 5 clusters that were meaningful and large
enough
”Best” clusters have been discovered for values of t between 8 and 24
hours
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 14 / 17
15. 5 useful clusters - with features
Successfully terminated jobs (557): job length, user, resource broker num
jobs, computing units num jobs
Failure involving computing elements and brokers (189): job length, user,
resource broker num jobs, computing units num jobs
High job resubmission count and low resubmission time intervals; possible
black hole node (133): resubmission count, resubmission time interval,
resource broker average job permanence
User related problem; maybe a user submitting jobs to a broker unable to
match her requirements (11): user, resource broker aborted, resource
broker terminated, resource broker average job permanence, resource
broker num jobs
Anomaly possibly caused by resource exhaustion on computing elements
(101): computing element average job permanence, average terminated,
Computing element num jobs
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 15 / 17
16. Step 3: Classification
Cross-validation with same data
SVM as classifier (boolean and multi-class)
84%accuracy on cross validation for the boolean class case
60% accuracy on cross validation for the multi-class case
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 16 / 17
17. Conclusion
Our method
Discovered useful classes of jobs on a Grid network
Can be used as filter in monitoring (trend discovery)
Future work
Scale up the dataset
Incorporate a temporal dimension in clustering
Online anytime clustering and classification
G. Modena, M. van Someren (UvA) Analysis of Grid log data with AP June 20, 2013 17 / 17