This document analyzes node uptime data from the PlanetLab testbed over a year to characterize typical uptime behaviors. It identifies six distinct clusters of nodes based on their uptime patterns, which provide a more accurate model than standard distributions. Understanding the actual uptime behaviors is important for designing distributed applications and services to properly handle failures.
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Editor IJCATR
Scheduling jobs to resources in grid computing is complicated due to the distributed and heterogeneous nature of the resources.
The purpose of job scheduling in grid environment is to achieve high system throughput and minimize the execution time of applications.
The complexity of scheduling problem increases with the size of the grid and becomes highly difficult to solve effectively.
To obtain a good and efficient method to solve scheduling problems in grid, a new area of research is implemented. In this paper, a job
scheduling algorithm is proposed to assign jobs to available resources in grid environment. The proposed algorithm is based on Ant
Colony Optimization (ACO) algorithm. This algorithm is combined with one of the best scheduling algorithm, Suffrage. This paper uses
the result of Suffrage in proposed ACO algorithm. The main contribution of this work is to minimize the makespan of a given set of
jobs. The experimental results show that the proposed algorithm can lead to significant performance in grid environment.
A survey on ant colony clustering papersZahra Sadeghi
This document summarizes several papers on ant-based clustering algorithms. Key points include:
- Ant clustering algorithms are inspired by how ant colonies self-organize through decentralized control and stigmergy (indirect communication via pheromones).
- Early work applied this approach to problems like the traveling salesman problem. Later work explored using ants for data clustering.
- Typical ant clustering algorithms involve ants randomly placing objects in a workspace and probabilistically picking up and dropping objects based on similarity to neighbors.
- Researchers have explored ways to improve ant clustering, such as using pheromones to guide ant movement, cooling schedules, and progressive vision ranges for ants.
- Other work has applied genetic algorithms and agent
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant
Neelabh Pant successfully defended his PhD thesis titled "Hyper-optimized Machine Learning and Deep Learning Methods for Geo-Spatial and Temporal Function Estimation" at the University of Texas at Arlington. His research focused on developing recurrent neural networks, long short-term memory models, and genetic optimization techniques to predict locations, stock prices, and currency exchanges based on spatio-temporal data. Pant's dissertation committee included Dr. Ramez Elmasri as his advisor along with four other professors who evaluated his work.
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...Soumya Banerjee
In this research we use a decentralized computing approach to allocate and schedule tasks on a massively distributed grid. Using emergent properties of multi-agent systems, the algorithm dynamically creates and dissociates clusters
to serve the changing resource demands of a global task queue. The algorithm is compared to a standard First-in First-out (FIFO) scheduling algorithm. Experiments
done on a simulator show that the distributed resource allocation protocol (dRAP) algorithm outperforms the FIFO scheduling algorithm on time to empty
queue, average waiting time and CPU utilization. Such a decentralized computing approach holds promise for massively distributed processing scenarios like SETI@home and Google MapReduce.
This document summarizes a research project that develops a tool to cluster and visualize Internet outage events from log data using time-series analysis. The tool uses a MapReduce algorithm that runs in O(n log n) time to cluster outage blocks based on their start and end times. The data is partitioned into bins and processed in parallel on Hadoop. An evaluation shows that smaller bin sizes help reduce the maximum number of blocks in each bin, improving the tool's ability to handle bursts of outage events.
Clustering-based Analysis for Heavy-Hitter Flow DetectionAPNIC
This document summarizes a research paper that proposes using unsupervised machine learning clustering techniques rather than thresholds to detect heavy hitter (HH) flows in a network. It describes collecting network flow data and analyzing it using algorithms like K-means and Gaussian mixtures to group flows. This identified multiple clusters rather than just two groups (elephants and mice). Further clustering an ambiguous zone revealed patterns that could better classify HH flows without relying on thresholds. The clustering results were then passed to an SDN controller to mark flows and take appropriate actions like re-routing.
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Editor IJCATR
Scheduling jobs to resources in grid computing is complicated due to the distributed and heterogeneous nature of the resources.
The purpose of job scheduling in grid environment is to achieve high system throughput and minimize the execution time of applications.
The complexity of scheduling problem increases with the size of the grid and becomes highly difficult to solve effectively.
To obtain a good and efficient method to solve scheduling problems in grid, a new area of research is implemented. In this paper, a job
scheduling algorithm is proposed to assign jobs to available resources in grid environment. The proposed algorithm is based on Ant
Colony Optimization (ACO) algorithm. This algorithm is combined with one of the best scheduling algorithm, Suffrage. This paper uses
the result of Suffrage in proposed ACO algorithm. The main contribution of this work is to minimize the makespan of a given set of
jobs. The experimental results show that the proposed algorithm can lead to significant performance in grid environment.
A survey on ant colony clustering papersZahra Sadeghi
This document summarizes several papers on ant-based clustering algorithms. Key points include:
- Ant clustering algorithms are inspired by how ant colonies self-organize through decentralized control and stigmergy (indirect communication via pheromones).
- Early work applied this approach to problems like the traveling salesman problem. Later work explored using ants for data clustering.
- Typical ant clustering algorithms involve ants randomly placing objects in a workspace and probabilistically picking up and dropping objects based on similarity to neighbors.
- Researchers have explored ways to improve ant clustering, such as using pheromones to guide ant movement, cooling schedules, and progressive vision ranges for ants.
- Other work has applied genetic algorithms and agent
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant
Neelabh Pant successfully defended his PhD thesis titled "Hyper-optimized Machine Learning and Deep Learning Methods for Geo-Spatial and Temporal Function Estimation" at the University of Texas at Arlington. His research focused on developing recurrent neural networks, long short-term memory models, and genetic optimization techniques to predict locations, stock prices, and currency exchanges based on spatio-temporal data. Pant's dissertation committee included Dr. Ramez Elmasri as his advisor along with four other professors who evaluated his work.
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...Soumya Banerjee
In this research we use a decentralized computing approach to allocate and schedule tasks on a massively distributed grid. Using emergent properties of multi-agent systems, the algorithm dynamically creates and dissociates clusters
to serve the changing resource demands of a global task queue. The algorithm is compared to a standard First-in First-out (FIFO) scheduling algorithm. Experiments
done on a simulator show that the distributed resource allocation protocol (dRAP) algorithm outperforms the FIFO scheduling algorithm on time to empty
queue, average waiting time and CPU utilization. Such a decentralized computing approach holds promise for massively distributed processing scenarios like SETI@home and Google MapReduce.
This document summarizes a research project that develops a tool to cluster and visualize Internet outage events from log data using time-series analysis. The tool uses a MapReduce algorithm that runs in O(n log n) time to cluster outage blocks based on their start and end times. The data is partitioned into bins and processed in parallel on Hadoop. An evaluation shows that smaller bin sizes help reduce the maximum number of blocks in each bin, improving the tool's ability to handle bursts of outage events.
Clustering-based Analysis for Heavy-Hitter Flow DetectionAPNIC
This document summarizes a research paper that proposes using unsupervised machine learning clustering techniques rather than thresholds to detect heavy hitter (HH) flows in a network. It describes collecting network flow data and analyzing it using algorithms like K-means and Gaussian mixtures to group flows. This identified multiple clusters rather than just two groups (elephants and mice). Further clustering an ambiguous zone revealed patterns that could better classify HH flows without relying on thresholds. The clustering results were then passed to an SDN controller to mark flows and take appropriate actions like re-routing.
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
This document describes the EXOSIMS framework for simulating exoplanet direct imaging missions. EXOSIMS integrates independent modules that perform specific tasks into a unified mission simulation. Key modules include: Planet Population (generates planetary system parameters), Optical System (models telescope performance), Survey Simulation (runs observations), and Post-Processing (determines detections and yields). EXOSIMS has been used to model the science yield of the WFIRST-AFTA mission and provides an open-source tool for studying various mission designs.
Location and Mobility Aware Resource Management for 5G Cloud Radio Access Net...Md Nazrul Islam Roxy
This document proposes location and mobility aware resource management algorithms for 5G cloud radio access networks (C-RAN). It proposes a two-step approach using 1) a location-aware virtual base station (VBS) clustering algorithm and 2) a VBS cluster packing algorithm that considers location, mobility patterns, and handovers. It evaluates these proposed algorithms along with existing algorithms through simulation. The results show the proposed hierarchical clustering algorithm reduces inter-host handovers by up to 26.5% while the proposed location-aware packing reduces them by up to 30%. The best combination reduces inter-host handovers by up to 34.8%.
1. The document summarizes ongoing data mining and machine learning research at the University of Houston from 2006-2009.
2. Key areas of research included developing shape-aware clustering algorithms, discovering regional knowledge in geo-referenced datasets, emergent pattern discovery, and various machine learning applications.
3. The researchers were developing techniques for clustering with plug-in fitness functions, discovering spatial risk patterns like arsenic levels, and an open source data mining framework called Cougar2.
Energy-aware Task Scheduling using Ant-colony Optimization in cloudLinda J
The document proposes an energy-aware task scheduling algorithm using ant colony optimization for cloud computing. It aims to minimize energy consumption in datacenters by scheduling tasks efficiently across virtual machines and physical hosts. The algorithm uses concepts from ant colony optimization to probabilistically determine good task-to-resource allocations. The results show that the proposed approach reduces energy consumption by 22% compared to a first-come, first-served scheduling approach.
This document summarizes the Particle Swarm Optimization (PSO) algorithm. PSO is a population-based stochastic optimization technique inspired by bird flocking. It works by having a population of candidate solutions, called particles, that fly through the problem space, with the movements of each particle influenced by its local best known position as well as the global best known position. The document provides an overview of PSO and its applications, describes the basic PSO algorithm and several variants, and discusses parallel and structural optimization implementations of PSO.
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMmailjkb
This document discusses using machine learning algorithms for job scheduling in a grid computing environment. It aims to minimize makespan, the total time to complete all tasks, by learning from past scheduling experiences. It proposes using ant colony optimization, where artificial ants probabilistically choose task-machine pairs to incrementally find optimal schedules. The algorithm is compared to other scheduling methods and extended to online scheduling by classifying jobs with attributes to appropriate machines. A feasibility study demonstrates classification and scheduling of test jobs using machine learning tools.
A Framework for Performance Analysis of Computing Cloudsijsrd.com
This document presents a modified weighted active monitoring load balancing algorithm for cloud computing. It aims to improve performance over existing algorithms like Round Robin by taking into account both the load on each server and the current performance. The proposed algorithm calculates factors x (available resources), y (change in response time as a performance metric), and z (combining x and y) to generate a dynamic load balancing queue. It was implemented on a private cloud and showed improved performance over the Round Robin algorithm. Future work involves implementing the algorithm on a real-world cloud environment.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Researchers around the world attempt to predict interactions between atoms to further understand molecular structure and dynamics across various disciplines such as environmental, pharmaceutical, and material science. Imaging technologies like Nuclear Magnetic Resonance (NMR) allow us to gain insight into a molecule’s structure and dynamics, but the technology depends on our ability to accurately predict scalar coupling constants, a part of the magnetic interactions between pairs of atoms. The strength of this magnetic interaction is a function of electron intervention and molecular topology. The current predictive method uses quantum mechanical principles and can accurately calculate scalar coupling constants given a 3-D molecular structure as input. Despite its accuracy, this deterministic model is computationally expensive, as calculations can take days to weeks per molecule, rendering this approach infeasible for day-to-day applications. This project aims to create a framework of predictive algorithms for molecular interactions, allowing experts to optimize determination of molecular properties and behavior.
San Juan de Sahagun was born in 1414 in Spain to a pious and well-to-do family. He was educated by Benedictines and ordained as a priest in 1445. He had several positions in the church but gave most of the proceeds from his positions to the poor. He later joined the Augustinian order and became known for his devotion to the Eucharist, sometimes experiencing visions during Mass. He was also a renowned spiritual director and preacher who helped improve social conditions through his sermons.
Wind and water provide renewable sources of energy that can help address issues with fossil fuels. While fossil fuels currently supply most global energy needs, they are finite resources that also contribute to environmental problems. Renewable sources like hydroelectric, ocean wave, and tidal power offer clean alternatives but also have some disadvantages to consider regarding their implementation and impacts. Overall, diversifying energy supplies with renewable options can improve energy security while reducing environmental issues.
This document is Rese Patton's visual resume summarizing her journey in the music business. It describes her early work in production and security at Paisley Park Studios in 1994. It details her decision to enroll at FullSail University to gain a degree and further experience to work in the music business full-time. Over the course of her time at FullSail, she took courses in media arts, computer science, entertainment business, marketing, business law, statistics, international business, entrepreneurship, and music business marketing to prepare for a career in the music industry. She has one year left at FullSail and is currently working as a producer, writer and composer, and feels her music business career is on track
Hakon Verespej's presentation at the Sourcing 7 rountable, October 2012. Focuses on his interest in recruiting coming from an engineering background and why great sourcing is so awesome.
12 perbedaan sistem pendidikan indonesiaFarion Dusun
Dokumen membandingkan sistem pendidikan Indonesia dan Finlandia. Sistem pendidikan Finlandia memberikan kemandirian yang lebih besar kepada siswa dan guru. Guru dihargai dan mendapat kebebasan untuk mengembangkan metode mengajar. Tidak ada tekanan akan tes atau evaluasi berlebihan bagi siswa. Hal ini membuat pendidikan Finlandia berhasil dan siswanya merasa nyaman belajar.
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
This document presents a new automatic clustering algorithm called Automatic Merging for Optimal Clusters (AMOC). AMOC is a two-phase iterative extension of k-means clustering that aims to automatically determine the optimal number of clusters for a given dataset. In the first phase, AMOC initializes a large number of clusters k using k-means. In the second phase, it iteratively merges the lowest probability cluster with its closest neighbor, recomputing metrics each time to evaluate if the merge improved clustering quality. The algorithm stops merging once no improvements are found. Experimental results on synthetic and real datasets show AMOC finds nearly optimal cluster structures in terms of number, compactness and separation of clusters.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Clustering is an unsupervised machine learning technique used to group unlabeled data points. There are two main approaches: hierarchical clustering and partitioning clustering. Partitioning clustering algorithms like k-means and k-medoids attempt to partition data into k clusters by optimizing a criterion function. Hierarchical clustering creates nested clusters by merging or splitting clusters. Examples of hierarchical algorithms include agglomerative clustering, which builds clusters from bottom-up, and divisive clustering, which separates clusters from top-down. Clustering can group both numerical and categorical data.
Coupling-Based Internal Clock Synchronization for Large Scale Dynamic Distrib...Angelo Corsaro
This paper studies the problem of realizing a common software clock among a large set of nodes without an external time reference (i.e., internal clock synchronization), any centralized control and where nodes can join and leave the distributed system at their will. The paper proposes an internal clock synchronization algorithm which combines the gossip-based paradigm with a nature-inspired approach, coming from the coupled oscillators phenomenon, to cope with scale and churn. The algorithm works on the top of an overlay network and uses a uniform peer sampling service to fullfill each node’s local view. Therefore, differently from clock synchronization protocols for small scale and static distributed systems, here each node synchronizes regularly with only the neighbors in its local view and not with the whole system. Theoretical and empirical evaluations of the convergence speed and of the synchronization error of the coupled-based internal clock synchronization algorithm have been carried out, showing how convergence time and the synchronization error depends on the coupling factor and on the local view size. Moreover the variation of the synchronization error with respect to churn and the impact of a sudden variation of the number of nodes have been analyzed to show the stability of the algorithm. In all these contexts, the algorithm shows nice performance and very good self-organizing properties. Finally, we showed how the assumption on the existence of a uniform peer-sampling service is instrumental for the good behavior of the algorithm.
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
This document discusses a study that uses clustering techniques to analyze network flow data and extract patterns of torrent usage at Korea University. The study transforms network flow data by time blocks. It then uses k-means clustering and principal component analysis to identify optimal cluster numbers and visualize cluster formations. The study finds that 7 clusters best captures distinct torrent usage patterns. It analyzes the stability of the clusters and identifies two clusters that show regular torrent usage during working hours and heavy overall usage. The goal is to help network administrators identify times of heavy bandwidth usage.
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGcscpconf
In recent years, Evolutionary clustering is an evolving research area in data mining. The evolution diagnosis of any homogeneous as well as heterogeneous network will provide an overall view about the network. Applications of evolutionary clustering includes, analyzing, the social networks, information networks, about their structure, properties and behaviors. In this paper, the authors study the problem of influence of priors over multi-typed object in
evolutionary clustering. Priors are defined for each type of object in a heterogeneous information network and experimental results were produced to show how consistency and quality of cluster changes according to the priors
Influence of priors over multityped object in evolutionary clusteringcsandit
In recent years, Evolutionary clustering is an evolving research area in data mining. The
evolution diagnosis of any homogeneous as well as heterogeneous network will provide an
overall view about the network. Applications of evolutionary clustering includes, analyzing, the
social networks, information networks, about their structure, properties and behaviors. In this
paper, the authors study the problem of influence of priors over multi-typed object in
evolutionary clustering. Priors are defined for each type of object in a heterogeneous
information network and experimental results were produced to show how consistency and
quality of cluster changes according to the priors.
This document describes the EXOSIMS framework for simulating exoplanet direct imaging missions. EXOSIMS integrates independent modules that perform specific tasks into a unified mission simulation. Key modules include: Planet Population (generates planetary system parameters), Optical System (models telescope performance), Survey Simulation (runs observations), and Post-Processing (determines detections and yields). EXOSIMS has been used to model the science yield of the WFIRST-AFTA mission and provides an open-source tool for studying various mission designs.
Location and Mobility Aware Resource Management for 5G Cloud Radio Access Net...Md Nazrul Islam Roxy
This document proposes location and mobility aware resource management algorithms for 5G cloud radio access networks (C-RAN). It proposes a two-step approach using 1) a location-aware virtual base station (VBS) clustering algorithm and 2) a VBS cluster packing algorithm that considers location, mobility patterns, and handovers. It evaluates these proposed algorithms along with existing algorithms through simulation. The results show the proposed hierarchical clustering algorithm reduces inter-host handovers by up to 26.5% while the proposed location-aware packing reduces them by up to 30%. The best combination reduces inter-host handovers by up to 34.8%.
1. The document summarizes ongoing data mining and machine learning research at the University of Houston from 2006-2009.
2. Key areas of research included developing shape-aware clustering algorithms, discovering regional knowledge in geo-referenced datasets, emergent pattern discovery, and various machine learning applications.
3. The researchers were developing techniques for clustering with plug-in fitness functions, discovering spatial risk patterns like arsenic levels, and an open source data mining framework called Cougar2.
Energy-aware Task Scheduling using Ant-colony Optimization in cloudLinda J
The document proposes an energy-aware task scheduling algorithm using ant colony optimization for cloud computing. It aims to minimize energy consumption in datacenters by scheduling tasks efficiently across virtual machines and physical hosts. The algorithm uses concepts from ant colony optimization to probabilistically determine good task-to-resource allocations. The results show that the proposed approach reduces energy consumption by 22% compared to a first-come, first-served scheduling approach.
This document summarizes the Particle Swarm Optimization (PSO) algorithm. PSO is a population-based stochastic optimization technique inspired by bird flocking. It works by having a population of candidate solutions, called particles, that fly through the problem space, with the movements of each particle influenced by its local best known position as well as the global best known position. The document provides an overview of PSO and its applications, describes the basic PSO algorithm and several variants, and discusses parallel and structural optimization implementations of PSO.
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMmailjkb
This document discusses using machine learning algorithms for job scheduling in a grid computing environment. It aims to minimize makespan, the total time to complete all tasks, by learning from past scheduling experiences. It proposes using ant colony optimization, where artificial ants probabilistically choose task-machine pairs to incrementally find optimal schedules. The algorithm is compared to other scheduling methods and extended to online scheduling by classifying jobs with attributes to appropriate machines. A feasibility study demonstrates classification and scheduling of test jobs using machine learning tools.
A Framework for Performance Analysis of Computing Cloudsijsrd.com
This document presents a modified weighted active monitoring load balancing algorithm for cloud computing. It aims to improve performance over existing algorithms like Round Robin by taking into account both the load on each server and the current performance. The proposed algorithm calculates factors x (available resources), y (change in response time as a performance metric), and z (combining x and y) to generate a dynamic load balancing queue. It was implemented on a private cloud and showed improved performance over the Round Robin algorithm. Future work involves implementing the algorithm on a real-world cloud environment.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
Researchers around the world attempt to predict interactions between atoms to further understand molecular structure and dynamics across various disciplines such as environmental, pharmaceutical, and material science. Imaging technologies like Nuclear Magnetic Resonance (NMR) allow us to gain insight into a molecule’s structure and dynamics, but the technology depends on our ability to accurately predict scalar coupling constants, a part of the magnetic interactions between pairs of atoms. The strength of this magnetic interaction is a function of electron intervention and molecular topology. The current predictive method uses quantum mechanical principles and can accurately calculate scalar coupling constants given a 3-D molecular structure as input. Despite its accuracy, this deterministic model is computationally expensive, as calculations can take days to weeks per molecule, rendering this approach infeasible for day-to-day applications. This project aims to create a framework of predictive algorithms for molecular interactions, allowing experts to optimize determination of molecular properties and behavior.
San Juan de Sahagun was born in 1414 in Spain to a pious and well-to-do family. He was educated by Benedictines and ordained as a priest in 1445. He had several positions in the church but gave most of the proceeds from his positions to the poor. He later joined the Augustinian order and became known for his devotion to the Eucharist, sometimes experiencing visions during Mass. He was also a renowned spiritual director and preacher who helped improve social conditions through his sermons.
Wind and water provide renewable sources of energy that can help address issues with fossil fuels. While fossil fuels currently supply most global energy needs, they are finite resources that also contribute to environmental problems. Renewable sources like hydroelectric, ocean wave, and tidal power offer clean alternatives but also have some disadvantages to consider regarding their implementation and impacts. Overall, diversifying energy supplies with renewable options can improve energy security while reducing environmental issues.
This document is Rese Patton's visual resume summarizing her journey in the music business. It describes her early work in production and security at Paisley Park Studios in 1994. It details her decision to enroll at FullSail University to gain a degree and further experience to work in the music business full-time. Over the course of her time at FullSail, she took courses in media arts, computer science, entertainment business, marketing, business law, statistics, international business, entrepreneurship, and music business marketing to prepare for a career in the music industry. She has one year left at FullSail and is currently working as a producer, writer and composer, and feels her music business career is on track
Hakon Verespej's presentation at the Sourcing 7 rountable, October 2012. Focuses on his interest in recruiting coming from an engineering background and why great sourcing is so awesome.
12 perbedaan sistem pendidikan indonesiaFarion Dusun
Dokumen membandingkan sistem pendidikan Indonesia dan Finlandia. Sistem pendidikan Finlandia memberikan kemandirian yang lebih besar kepada siswa dan guru. Guru dihargai dan mendapat kebebasan untuk mengembangkan metode mengajar. Tidak ada tekanan akan tes atau evaluasi berlebihan bagi siswa. Hal ini membuat pendidikan Finlandia berhasil dan siswanya merasa nyaman belajar.
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
This document presents a new automatic clustering algorithm called Automatic Merging for Optimal Clusters (AMOC). AMOC is a two-phase iterative extension of k-means clustering that aims to automatically determine the optimal number of clusters for a given dataset. In the first phase, AMOC initializes a large number of clusters k using k-means. In the second phase, it iteratively merges the lowest probability cluster with its closest neighbor, recomputing metrics each time to evaluate if the merge improved clustering quality. The algorithm stops merging once no improvements are found. Experimental results on synthetic and real datasets show AMOC finds nearly optimal cluster structures in terms of number, compactness and separation of clusters.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Clustering is an unsupervised machine learning technique used to group unlabeled data points. There are two main approaches: hierarchical clustering and partitioning clustering. Partitioning clustering algorithms like k-means and k-medoids attempt to partition data into k clusters by optimizing a criterion function. Hierarchical clustering creates nested clusters by merging or splitting clusters. Examples of hierarchical algorithms include agglomerative clustering, which builds clusters from bottom-up, and divisive clustering, which separates clusters from top-down. Clustering can group both numerical and categorical data.
Coupling-Based Internal Clock Synchronization for Large Scale Dynamic Distrib...Angelo Corsaro
This paper studies the problem of realizing a common software clock among a large set of nodes without an external time reference (i.e., internal clock synchronization), any centralized control and where nodes can join and leave the distributed system at their will. The paper proposes an internal clock synchronization algorithm which combines the gossip-based paradigm with a nature-inspired approach, coming from the coupled oscillators phenomenon, to cope with scale and churn. The algorithm works on the top of an overlay network and uses a uniform peer sampling service to fullfill each node’s local view. Therefore, differently from clock synchronization protocols for small scale and static distributed systems, here each node synchronizes regularly with only the neighbors in its local view and not with the whole system. Theoretical and empirical evaluations of the convergence speed and of the synchronization error of the coupled-based internal clock synchronization algorithm have been carried out, showing how convergence time and the synchronization error depends on the coupling factor and on the local view size. Moreover the variation of the synchronization error with respect to churn and the impact of a sudden variation of the number of nodes have been analyzed to show the stability of the algorithm. In all these contexts, the algorithm shows nice performance and very good self-organizing properties. Finally, we showed how the assumption on the existence of a uniform peer-sampling service is instrumental for the good behavior of the algorithm.
Network Flow Pattern Extraction by Clustering Eugine KangEugine Kang
This document discusses a study that uses clustering techniques to analyze network flow data and extract patterns of torrent usage at Korea University. The study transforms network flow data by time blocks. It then uses k-means clustering and principal component analysis to identify optimal cluster numbers and visualize cluster formations. The study finds that 7 clusters best captures distinct torrent usage patterns. It analyzes the stability of the clusters and identifies two clusters that show regular torrent usage during working hours and heavy overall usage. The goal is to help network administrators identify times of heavy bandwidth usage.
INFLUENCE OF PRIORS OVER MULTITYPED OBJECT IN EVOLUTIONARY CLUSTERINGcscpconf
In recent years, Evolutionary clustering is an evolving research area in data mining. The evolution diagnosis of any homogeneous as well as heterogeneous network will provide an overall view about the network. Applications of evolutionary clustering includes, analyzing, the social networks, information networks, about their structure, properties and behaviors. In this paper, the authors study the problem of influence of priors over multi-typed object in
evolutionary clustering. Priors are defined for each type of object in a heterogeneous information network and experimental results were produced to show how consistency and quality of cluster changes according to the priors
Influence of priors over multityped object in evolutionary clusteringcsandit
In recent years, Evolutionary clustering is an evolving research area in data mining. The
evolution diagnosis of any homogeneous as well as heterogeneous network will provide an
overall view about the network. Applications of evolutionary clustering includes, analyzing, the
social networks, information networks, about their structure, properties and behaviors. In this
paper, the authors study the problem of influence of priors over multi-typed object in
evolutionary clustering. Priors are defined for each type of object in a heterogeneous
information network and experimental results were produced to show how consistency and
quality of cluster changes according to the priors.
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
Currently, there are two techniques used for large-scale gene-expression profiling; microarray and
RNA-Sequence (RNA-Seq).This paper is intended to study and compare different clustering algorithms that used
in microarray data analysis. Microarray is a DNA molecules array which allows multiple hybridization
experiments to be carried out simultaneously and trace expression levels of thousands of genes. It is a highthroughput
technology for gene expression analysis and becomes an effective tool for biomedical research.
Microarray analysis aims to interpret the data produced from experiments on DNA, RNA, and protein
microarrays, which enable researchers to investigate the expression state of a large number of genes. Data
clustering represents the first and main process in microarray data analysis. The k-means, fuzzy c-mean, selforganizing
map, and hierarchical clustering algorithms are under investigation in this paper. These algorithms
are compared based on their clustering model.
The document surveys previous work on clustering of time series data. It discusses three main approaches to time series clustering: (1) working directly with raw time series data, (2) extracting features from the raw data and clustering the features, and (3) building models from the raw data and clustering the model parameters. It also summarizes commonly used clustering algorithms, measures for determining time series similarity, and criteria for evaluating clustering results. The document categorizes and discusses past time series clustering research and identifies topics for future work. It aims to provide an overview of the field of time series clustering.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data IJMER
Clustering mixed type data is one of the major research topics in the area of data mining. In
this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution
centroid is used to represent the prototype of categorical variables in a cluster which is then combined
with the mean to represent the prototype of clusters with mixed type variables. In the method, data is
observed from different views and the variables are grouped into different views. Those instances that
can be viewed differently from different viewpoints can be defined as multiview data. During clustering
process the differences among views are ignored in usual cases. Here, both views and variables weights
are computed simultaneously. The view weight is used to determine the closeness or density of view and
variable weight is used to identify the significance of each variable. With the intention of determining
the cluster of objects both these weights are used in the distance function. In the proposed method,
enhancement to the k-prototypes is done so that it automatically computes both view and variable
weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering
algorithms.
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...Zac Darcy
This paper presents a new method for evaluating the symmetric information gap between two dynamical systems using particle filters. It first describes a symmetric version of the information gap metric based on symmetric Kullback-Leibler divergence. A numerical method is then developed to approximate this symmetric K-L rate using particle filters. This represents the posterior densities of the dynamical systems as mixtures of Gaussians. The method is demonstrated on a nonlinear target tracking example, computing the symmetric information gap between two systems at each time step.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a
partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor
networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in
colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs).
This method tries to find all colocations that are to be generated from a random world. For this we first
apply an approximation error to find all the PPCs which reduce the computations. Next find all the
possible worlds and split them into two different worlds and compute the prevalence probability. These
worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic
Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant
improvement in computational time in comparison to some of the existing methods used in colocation
mining.
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
2. C. Characterizing PlanetLab Nodes
Using the described tools, we were able to cluster the
nodes captured in the trace data by their uptime behaviors.
From the resulting clusters, we determined the key
behavioral characteristics that define the classifications. In
the next section, we discuss the application of this
classification process to the PlanetLab trace data.
III. RESULTS
A. Aggregate Distribution
Fig. 1 shows the observed distribution of uptimes up to
32 days over the entire set of data. This includes 27447
(95.6%) of the total 28705 uptimes. Each bucket covers 1
hour. Looking at the chart, it does appear to bear a
resemblance to standard distributions. To test this, we
performed Chi-Square goodness-of-fit tests [17] to check
whether the data could be modeled as Exponential or
Weibull distributions.
Starting with the Exponential distribution, we calculated
a Chi-Square statistic value of 247142, where a good fit
with a significance level of 5% requires the statistic to have
been less than 42.6. Evaluating the Weibull distribution
resulted in a Chi-Square statistic value of 21265. The
parameters for the Weibull distribution were estimated using
maximum likelihood estimation with initial values
calculated using least-squares linear regression.
The biggest sources of difference between the observed
and expected data are due to the spikes visible in Fig. 1. For
example, there are 6815 entries in the first bucket (i.e.,
covering hour 1, which is the first hour), where the expected
count using the Exponential distribution is about 333 and
using the Weibull distribution is about 5615. At 24 hours,
we observed 868 entries where Exponential expected about
251 and Weibull expected about 156. The number of
instances of such behavior plus the degree to which the
behaviors express themselves in each instance is an
indication that characterization of PlanetLab should include
such information.
Given this behavior, it may seem reasonable to
characterize the system with an aggregate like the one
presented here. However, there is much additional
information to be gained by investigating the unique
behaviors shared by subsets of the nodes being modeled, as
we will see in Section III.B.
B. Forming Clusters
A single aggregate is a special case of clustering where
the number of clusters produced is 1. It is possible that this
adequately characterizes the target system. However, by
looking a little deeper, we can find additional information
Figure 1. Histogram of uptimes in 1-hour buckets.
0
0.5
1
1.5
2
2.5
3
%ofUptimesInBucket
Hours
204
3. useful in characterizing the system. For PlanetLab, we did
this by using Cluto to identify shared group behaviors
within the system, treating observed uptimes as traits of the
nodes.
To perform this cluster analysis, we configured Cluto to
use (1) a direct clustering algorithm, (2) the correlation
coefficient as a measure of similarity, and (3) maximization
of the ratio of intra-cluster similarity to inter-cluster
similarity as a measure of the clustering quality (for more
information on this cluster quality measurement, see “H2”
in the Cluto documentation [11]).
When we performed our initial analysis, we found that
most of the buckets were of little consequence to the
outcome, so we trimmed the set of buckets to those that
were identified as important in producing the clustering plus
a few buckets that form spikes in the aggregate chart to
focus the analysis. For the same reason, we also merged a
few of these important buckets to form one trait (in each
case, the buckets were adjacent time-wise). For example, we
treat the hour 3 and hour 4 buckets as one 2-hour bucket.
This resulted in 15 buckets including hour 1, hour 2, hours
3-4, hours 7-9, hour 16, hours 22-23, hour 24, hour 48, hour
72, hour 117, hour 191, hour 218, hour 312, hour 492, and
hour 679.
Using this small set of significant buckets, we performed
a number of clustering runs, for which we configured Cluto
to form an incrementally increasing number of clusters in
each proceeding run.
C. Isolation of Behaviors
Fig. 2 shows the internal similarity and external
similarity for each of the runs we performed. Internal
similarity is the similarity of uptime distributions among
nodes grouped into a common cluster and external similarity
is the similarity between different clusters. As described in
Section III.B, similarity is measured using the correlation
coefficient. Overall, this figure provides evidence that for
this dataset, dividing nodes into clusters isolates unique
shared behaviors.
Looking at Fig. 2, we find that internal similarity starts
out increasing quickly, begins leveling out, dips, and finally
resumes leveling out. For small numbers of clusters, adding
an additional cluster has a strong impact, while the value of
additions decreases as the total number of clusters starts to
get large. This is partially explained by the fact that internal
similarity is already high as we get into larger numbers of
clusters. It is also a result of the new cluster not having a
significant impact on increasing intra-cluster similarity or
reducing intra-cluster similarity.
Unlike the internal similarity measurement, external
similarity is flat up to 4 clusters. This reinforces the
observations around the high impact of adding additional
clusters when the total number of clusters is small. Low
external similarity means that the behaviors exhibited by the
different clusters are unique to those clusters. This
combined with high internal similarity means that adding
clusters is isolating unique behaviors that are strongly
shared by cluster members.
Fig. 3 charts Cluto’s internal measurement of overall
quality. As described in Section III.B, quality is assessed
using the ratio of internal to external similarity. While Fig. 2
shows a drop in internal similarity at 12 clusters, the overall
quality measurement increases. This is due to the
corresponding drop in external similarity, which increased
quality enough that the drop in internal similarity was more
than negated.
Looking at these charts, the six-cluster run roughly
forms a knee in both and appears to be a good potential
choice for a strong representation of significant behaviors
within the system.
D. Cluster Characteristics
Figs. 4 and 5 respectively show the degree to which each
trait (i.e., the buckets we selected) influences the different
Figures 2 and 3. These graphs show how the quality of clustering changes with the number of clusters formed, where quality is
positively affected by high similarity of nodes within a single cluster and negatively impacted by high similarity between different
clusters.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 2 4 6 8 10 12 14 16 18 20
AverageValuePerCluster
Number of Clusters
Internal Similarity
External Similarity
1.00E-03
1.50E-03
2.00E-03
2.50E-03
3.00E-03
3.50E-03
4.00E-03
1 2 3 4 5 6 7 8 9 10 11 12
ClusterQualityValue
Number of Clusters
205
4. clusters for the one- and six-cluster runs. Observe that in the
six-cluster run, there are six distinct patterns of behavior. In
both figures, Cluster 1 has some similarity of shape, but the
removal of dissimilar nodes in the six-cluster run has
resulted in an increase in behavioral cohesion, resulting in
some traits showing greater influence in the cluster and
others showing less.
Although we omit additional charts for brevity, visual
observation of the charts for the different runs concurs with
the observations made in Section III.C. That is, increasing
the number of clusters beyond a point yields diminishing
returns in terms of the uniqueness of the behavioral patterns
of uptime.
Table I provides additional information that is not
captured by the quality indictors or behavioral patterns. It
shows both the absolute and relative sizes of clusters formed
in runs 1 through 9, where each run is done with an
increasing number of clusters (i.e., the target number of
clusters for run n is n). This table shows that beyond six
clusters, adding additional clusters generally results in
several clusters that are small in size relative to the mean.
This is an indication that the behaviors in the additional
clusters are quite specific and applicable to only a small set
of nodes. Observe in Table I that the smallest cluster in the
7-cluster run (38 nodes) is 33% smaller than the smallest
cluster in the 6-cluster run (61 nodes). The 7-cluster run’s
smallest cluster is also around 42% of the mean cluster size
for that run, whereas the 6-cluster run’s smallest cluster is
around 54% of the respective mean.
IV. DISCUSSION
The observations made in Section III show unique
behavioral patterns that are useful for the characterization of
PlanetLab. Since these patterns represent real behaviors,
they provide a more accurate model of how the system
behaves.
Having a good understanding of the behavioral
characteristics of uptime for a system like PlanetLab is
important for performing analysis, supporting proper app-
level design decisions, and helping to explain certain events.
The lack of such a model forces one to rely on assumptions
TABLE I
NODE COUNTS OF THE CLUSTERS GENERATED DURING ANALYSIS OF PLANETLAB DATA
Number of
Clusters
Sizes of Clusters Mean
Size (%)
Std. Dev.
(%)
1 669 (100%) 100 -
2 390 (58%), 279 (42%) 50 11.31
3 339 (51%), 189 (28%), 141 (21%) 33.3 15.70
4 319 (48%), 56 (8%), 184 (28%), 110 (16%) 25 17.40
5 280 (42%), 90 (14%), 56 (8%), 141 (21%), 102 (15%) 20 13.13
6 218 (33%), 98 (15%), 77 (12%), 61 (9%), 114 (17%), 101 (15%) 16.7 8.40
7 210 (31%), 89 (13%), 55 (8%), 74 (11%), 111 (17%), 38 (6%), 92 (14%) 14.3 8.24
8 77 (12%), 184 (28%), 63 (9%), 54 (8%), 41 (6%), 57 (9%), 93 (14%), 100 (15%) 12.5 6.93
9 77 (12%), 178 (27%), 61 (9%), 25 (4%), 54 (8%), 29 (4%), 52 (8%), 93 (14%), 100 (15%) 11.1 7.09
Figures 4 and 5. These show the behavioral characteristics of clusters formed by the single-cluster run and the six-cluster run,
respectively. Observe that Cluster 1 has a similar shape in each chart, though since nodes with dissimilar behavior have been removed
from Cluster 1 in the six-cluster run, the average number of uptimes in the 1-hour bucket increases. Likewise, the other clusters in the
six-cluster run show behaviors indiscernible in the single-cluster run.
0
5
10
15
20
25
30
35
AverageNumberofUptimesinBucket
Buckets (hours)
Cluster 1
0
5
10
15
20
25
30
35
AverageNumberofUptimesinBucket
Buckets (hours)
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
206
5. that may not be correct and can cause the application to fail
expectations.
Since many aspects of distributed applications are
designed around handling failure, having this level of detail
in the model is often quite important. For example, consider
a distributed file system that uses redundancy to ensure data
availability. If node uptime is generally expected to follow a
standard distribution, designers would miss the effects of
the spikes we see in the aggregate in Fig. 1. Further, they
would miss the substantial differentiation of subsets of
nodes shown in Fig. 5. The result of this may be the over- or
under-maintenance of data copies. Over-maintenance is
costly, while under-maintenance puts one’s service level
agreements at risk.
To build a more concrete example, we consider
CoDeeN, a decentralized CDN built on PlanetLab [20]. The
system is designed such that nodes avoid “problematic”
peers rather than incurring the performance hit those nodes
would impose. The question of determining peer group size,
which is based on the stability of peers, is labeled as an open
question. Based on our observation of behaviors within
PlanetLab, this future work would benefit from accounting
for the differences in subsets of nodes within the system.
That is, the differences in subsets of nodes can allow for
more refined tuning of peer-selection and perhaps pro-active
changes in the topology based on the expectation of pending
failure.
Aside from designing and tuning applications around the
behavior of PlanetLab itself, the behaviors we have
identified can also help to identify groups of nodes that
reflect the behavior of the target environment for which one
is creating the application. For example, if the application is
to be deployed to an environment in which nodes are highly
stable, it makes more sense to test the application on the
nodes that most closely reflect this behavior than on a
random selection of nodes.
V. RELATED WORK
In [9], the authors investigated methods of classifying
failures and studied node behaviors in the PlanetLab test
bed. Although they didn’t use clustering techniques, they
devised a systematic method of classifying failures by
duration, group-size, and the nature of the failure. They
further apply additional information such as geography,
error messages, and resource usage to explain possible
causes of the errors.
In [10], the authors studied classification of the
availability of CPUs in a massive distributed system through
the use of clustering techniques. In their analysis, they
excluded all nodes with nonrandom behavior to focus on
emergent patterns across nodes whose behaviors are
independent and identically distributed. Interestingly, they
too found six clusters to be the right number for behavioral
classification. In their analysis, they focused on subdividing
the nodes into groups whose behavior could be modeled
using known statistical distributions.
Several works have presented other unique perspectives
on representation of node uptimes. In one such study, the
authors propose a method to model system-wide churn
behavior that arises from individual node behavior, but does
not require modeling the behavior of individual nodes [12].
In another, the authors find three noticeable, distinct
behaviors in a dataset containing node downtimes measured
on a large corporate network [2], where each of these
behaviors can be represented by a well-known distribution.
Using this insight, the authors formed a unique downtime
model that was composed of a combination of the three
distributions
Various works have also shown the different models
suitable for representing component uptimes in different
systems and environments. Examples of the variety of
systems that have been studied include a P2P file-sharing
system [3], student labs at universities [8][14], worker nodes
in a small resource pool [14], Internet hosts [14], and high-
performance computing clusters [18][19]. These studies
show repeatedly that the most suitable model of system
behavior varies greatly on the system in question.
A number of systems use knowledge of expected
uptimes to make functional decisions. One such system is
the Total Recall storage system [1]. Total Recall uses the
availability of participating nodes to determine the amount
of data redundancy needed to maintain a specific level of
availability for the data. To make replication decisions, live
measurements of short-term node availability are feed into
the redundancy policy, which estimates the likelihood of a
set of hosts being available at a given time. In [5], the
authors suggest the alternative of quantifying short-term
node availability from the viewpoint of the system’s
participants rather than from the viewpoint of an outsider.
The authors show that the number of file transfers required
under this alternative scheme is as little as 25% of what was
originally thought to be needed to meet the data-availability
goal. This seems to be strong evidence of how having a
different, perhaps more accurate, model of availability can
provide significant benefit.
Instead of modeling availability, it is also possible to
simulate it. In [7], the authors present DieCast, which
describes a way in which to simulate huge systems in their
entirety with only a fraction of the resources needed for
actual deployment. The trade-off, however, is that a large
amount of time is spent in simulation since the scale-down
in resource consumption is achieved through time-dilation.
The authors of [13] explore the state-machine approach
to characterizing node uptimes. Instead of using probabilistic
models, they attempt to predict uptimes and downtimes using
state machines based on branch predication techniques.
VI. CONCLUSION
By applying a Chi-Square goodness of fit test to an
aggregation of PlanetLab node uptime data, we found the
distribution of this aggregation to be poorly fitted to the
Exponential and Weibull distributions. We observed that
207
6. visible aberrations in the aggregate distribution of node
uptimes were a primary cause of this.
Using clustering techniques, we iterated through the
formation of different numbers of clusters to identify a
clustering of high quality that does not form arbitrary
groupings as occurs when too many cluster are allowed to
form. This led us to a set of six disjoint groups of nodes, in
which each group is characterized by a distinct pattern of
uptime behavior.
We believe this set of clusters to be valuable to those
building on PlanetLab and perhaps to the designers of
PlanetLab itself. Knowledge of the behaviors exhibited by
the clusters can contribute to making better design decisions
and ensuring that testing is being performed under valid
assumptions.
Future work around gathering additional information on
uptime behaviors in distributed systems includes:
x Analysis using different interpretations of
responsiveness (different ping requirements, CPU
availability, etc.)
x Analysis of downtime
x Analysis of different systems
It would also be interesting to see how the behavioral
patterns of clusters would change if the clusters were not
disjoint. Finally, it goes without saying that we would like
to apply our findings to the design and tuning of real
systems and applications.
REFERENCES
[1] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. M. Voelker. Total
Recall: System support for automated availability management. In
Proceedings of the 1st conference on Symposium on Networked
Systems Design and Implementation - Volume 1, 2004.
[2] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of
a serverless distributed file system deployed on an existing set of
desktop PCs. In ACM SIGMETRICS Performance Evaluation Review,
Volume 28, Issue 1, 2000.
[3] F. E. Bustamante and Y. Qiao. Friendships that last: Peer lifespan and
its role in P2P protocols. In Web Content Caching and Distribution:
Proceedings of the 8th International Workshop, 2004.
[4] A. Datta, K. Aberer. Internet-scale storage systems under churn - A
study of the steady-state using Markov models. In Proceedings of the
Sixth IEEE International Conference on Peer-to-Peer Computing,
2006.
[5] R. J. Dunn, J. Zahorjan, S. D. Gribble, and H. M. Levy. Presence-
based availability and P2P systems. In Proceedings of the Fifth IEEE
International Conference on Peer-to-Peer Computing, 2005.
[6] B. Godfrey. Repository of availability traces. Available online at
http://www.cs.illinois.edu/~pbg/availability/.
[7] D. Gupta, K. Vishwanath, and A. Vahdat. DieCast: Testing
distributed systems with an accurate scale model. In Proceedings of
the 5th ACM/USENIX Symposium on Networked Systems Design and
Implementation (NSDI), 2008.
[8] T. Heath, R. P. Martin, and T. D. Nguyen. The shape of failure. In
Proceedings of the First Workshop on Evaluating and Architecting
System dependabilitY (EASY), 2001.
[9] S. Jain, R. Prinja, A. Chandra, and Z. Zhang. Failure classification
and inference in large-scale systems: A systematic study of failures in
PlanetLab. Technical Report 08-014, University of Minnesota -
Computer Science and Engineering, 2008.
[10] B. Javadi, D. Kondo, J. Vincent, and D. P. Anderson. Mining for
statistical models of availability in large-scale distributed systems: An
empirical study of SETI@home. In Proceedings of the IEEE/ACM
International Symposium on Modelling, Analysis and Simulation of
Computer and Telecommunication Systems (MASCOTS), 2009.
[11] G. Karypis. CLUTO – A clustering toolkit. Technical Report #02-
017, Department of Computer Science, University of Minnesota,
2003.
[12] S. Y. Ko, I. Hoque, and I. Gupta. Using tractable and realistic churn
models to analyze quiescence behavior of distributed protocols. In
Proceedings of the IEEE Symposium on Reliable Distributed Systems
(SRDS), 2008.
[13] J. W. Mickens and B. D. Noble. Exploiting availability prediction in
distributed systems. In Proceedings of the 3rd Symposium on
Networked Systems Design & Implementation (NSDI), Volume 3,
2006.
[14] D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in
enterprise and wide-area distributed computing environments. In
Proceedings of European Conference on Parallel Computing
(EUROPAR), 2005.
[15] PlanetLab – An open platform for developing, deploying, and
accessing planetary-scale services. Available online at http://planet-
lab.org/.
[16] S. Ramabhadran and J. Pasquale. Analysis of long-running replicated
systems. In Proc. of the 25th IEEE Annual Conference on Computer
Communications (INFOCOM), 2006.
[17] J. L. Romeu. The Chi-Square: A large-sample goodness of fit test.
RAC START, Volume 10, Number 4, 2004.
[18] B. Schroeder and G. A. Gibson. A large-scale study of failures in
high-performance computing systems. In Proceedings of the
International Conference on Dependable Systems and Networks
(DSN), 2006.
[19] B. Schroeder and G. A. Gibson. Disk failures in the real world. In
Proceedings of the 5th USENIX conference on File and Storage
Technologies, 2007.
[20] L. Wang, K. Park, R. Pang, V. Pai, and L. Peterson. Reliability and
Security in the CoDeeN Content Distribution Network. In
Proceedings of the USENIX 2004 Annual Technical Conference,
2004.
208