This document discusses a study that uses clustering techniques to analyze network flow data and extract patterns of torrent usage at Korea University. The study transforms network flow data by time blocks. It then uses k-means clustering and principal component analysis to identify optimal cluster numbers and visualize cluster formations. The study finds that 7 clusters best captures distinct torrent usage patterns. It analyzes the stability of the clusters and identifies two clusters that show regular torrent usage during working hours and heavy overall usage. The goal is to help network administrators identify times of heavy bandwidth usage.
Data mining projects topics for java and dot netredpel dot com
This document discusses several papers related to data mining and machine learning techniques. It begins with a brief summary of each paper, discussing the key contributions and findings. The summaries cover topics such as differential privacy-preserving data anonymization, fault detection in power systems using decision trees, temporal pattern searching in event data, high dimensional indexing for similarity search, landmark-based approximate shortest path computation, feature selection for high dimensional data, temporal pattern mining in data streams, data leakage detection, keyword search in spatial databases, analyzing relationships on Wikipedia, improving recommender systems using user-item subgroups, decision trees for uncertain data, and building confidential query services in the cloud using data perturbation.
Distributed processing of probabilistic top k queries in wireless sensor netw...JPINFOTECH JAYAPRAKASH
Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks introduces the concepts of sufficient and necessary sets to facilitate localized data pruning in sensor network clusters. It develops three algorithms - sufficient set-based, necessary set-based, and boundary-based - for intercluster query processing with bounded communication rounds. An adaptive algorithm is also introduced to minimize transmission costs under dynamic data distributions. Experimental results show the proposed algorithms significantly reduce data transmission with only small constant communication rounds.
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...ijceronline
The document summarizes research on efficient query evaluation methods for probabilistic top-k queries in wireless sensor networks. It proposes three algorithms (SSB, NSB, BB) that use the concepts of sufficient and necessary sets to prune data and reduce transmissions between clusters and the base station. It also develops an adaptive algorithm that dynamically switches between the three based on estimated costs. Experimental results show the algorithms outperform baselines and the adaptive approach achieves near-optimal performance under different conditions.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...graphhoc
In data-intensive applications data transfer is a primary cause of job execution delay. Data access time depends on bandwidth. The major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks and Internet. Effective scheduling can reduce the amount of data transferred across the internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism. Objective of dynamic replica strategies is reducing file access time which leads to reducing job runtime. In this paper we develop a job scheduling policy and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies. We study our approach and evaluate it through simulation. The results show that our algorithm has improved 12% over the current strategies
Delivering Application-Layer Traffic Optimization (ALTO) Services based on ...Danny Alex Lachos Perez
Application-Layer Traffic Optimization (ALTO) is an IETF standardized protocol that provides abstract network topology and cost maps in addition to endpoint information services that can be consumed by applications in order to become network-aware and take optimized decisions regarding traffic flows. In this work, we propose a public service based on the ALTO specification using public routing information available at the Brazilian Internet eXchange Points (IXPs). Our ALTO server prototype takes the acronym of AaaS (ALTO-as-a-Service) and is based on over 2.5GB of real BGP data from the 25 Brazilian IX.br public IXPs. We evaluate our proposal in terms of functional behaviour and performance via proof of concept experiments which point to the potential benefits of applications being able to take smart endpoint selection decisions when consuming the developer-friendly ALTO APIs.
ENERGY-EFFICIENT DATA COLLECTION IN CLUSTERED WIRELESS SENSOR NETWORKS EMPLOY...ijwmn
In this paper, a energy-efficient data collection method is proposed in which an integration between
Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is
exploited.Based on the fact that sensory data in WSNs is often highly correlated and is suitable to be
transformed in DCT domain, we propose that each cluster from the networks only sends a small number of
large DCT transformed coefficients to the base-station (BS) for data collection in two common ways, either
directly or in multi-hop routing. All sensory data from the sensor network can be recovered based on the
large coefficients received at the BS. We further analyze and formulate the communication cost as the
power consumption for transmitting data in such networks based on stochastic problems. Some common
clustering algorithms are applied and compared to verify the analysis and simulation results. Both noise
and noiseless environments for the proposed method are considered.
This document discusses dynamic adaptation techniques for optimizing data transfer performance over networks. It describes how the number of concurrent data transfer streams can be adjusted dynamically according to changing network conditions, without relying on historical measurements or external profiling. The proposed approach gradually increases the level of parallelism during a transfer to find a near-optimal number of streams based on instant throughput measurements, allowing it to adapt to varying environments and network utilization over time.
Data mining projects topics for java and dot netredpel dot com
This document discusses several papers related to data mining and machine learning techniques. It begins with a brief summary of each paper, discussing the key contributions and findings. The summaries cover topics such as differential privacy-preserving data anonymization, fault detection in power systems using decision trees, temporal pattern searching in event data, high dimensional indexing for similarity search, landmark-based approximate shortest path computation, feature selection for high dimensional data, temporal pattern mining in data streams, data leakage detection, keyword search in spatial databases, analyzing relationships on Wikipedia, improving recommender systems using user-item subgroups, decision trees for uncertain data, and building confidential query services in the cloud using data perturbation.
Distributed processing of probabilistic top k queries in wireless sensor netw...JPINFOTECH JAYAPRAKASH
Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks introduces the concepts of sufficient and necessary sets to facilitate localized data pruning in sensor network clusters. It develops three algorithms - sufficient set-based, necessary set-based, and boundary-based - for intercluster query processing with bounded communication rounds. An adaptive algorithm is also introduced to minimize transmission costs under dynamic data distributions. Experimental results show the proposed algorithms significantly reduce data transmission with only small constant communication rounds.
Efficient Query Evaluation of Probabilistic Top-k Queries in Wireless Sensor ...ijceronline
The document summarizes research on efficient query evaluation methods for probabilistic top-k queries in wireless sensor networks. It proposes three algorithms (SSB, NSB, BB) that use the concepts of sufficient and necessary sets to prune data and reduce transmissions between clusters and the base station. It also develops an adaptive algorithm that dynamically switches between the three based on estimated costs. Experimental results show the algorithms outperform baselines and the adaptive approach achieves near-optimal performance under different conditions.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...graphhoc
In data-intensive applications data transfer is a primary cause of job execution delay. Data access time depends on bandwidth. The major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks and Internet. Effective scheduling can reduce the amount of data transferred across the internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism. Objective of dynamic replica strategies is reducing file access time which leads to reducing job runtime. In this paper we develop a job scheduling policy and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies. We study our approach and evaluate it through simulation. The results show that our algorithm has improved 12% over the current strategies
Delivering Application-Layer Traffic Optimization (ALTO) Services based on ...Danny Alex Lachos Perez
Application-Layer Traffic Optimization (ALTO) is an IETF standardized protocol that provides abstract network topology and cost maps in addition to endpoint information services that can be consumed by applications in order to become network-aware and take optimized decisions regarding traffic flows. In this work, we propose a public service based on the ALTO specification using public routing information available at the Brazilian Internet eXchange Points (IXPs). Our ALTO server prototype takes the acronym of AaaS (ALTO-as-a-Service) and is based on over 2.5GB of real BGP data from the 25 Brazilian IX.br public IXPs. We evaluate our proposal in terms of functional behaviour and performance via proof of concept experiments which point to the potential benefits of applications being able to take smart endpoint selection decisions when consuming the developer-friendly ALTO APIs.
ENERGY-EFFICIENT DATA COLLECTION IN CLUSTERED WIRELESS SENSOR NETWORKS EMPLOY...ijwmn
In this paper, a energy-efficient data collection method is proposed in which an integration between
Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is
exploited.Based on the fact that sensory data in WSNs is often highly correlated and is suitable to be
transformed in DCT domain, we propose that each cluster from the networks only sends a small number of
large DCT transformed coefficients to the base-station (BS) for data collection in two common ways, either
directly or in multi-hop routing. All sensory data from the sensor network can be recovered based on the
large coefficients received at the BS. We further analyze and formulate the communication cost as the
power consumption for transmitting data in such networks based on stochastic problems. Some common
clustering algorithms are applied and compared to verify the analysis and simulation results. Both noise
and noiseless environments for the proposed method are considered.
This document discusses dynamic adaptation techniques for optimizing data transfer performance over networks. It describes how the number of concurrent data transfer streams can be adjusted dynamically according to changing network conditions, without relying on historical measurements or external profiling. The proposed approach gradually increases the level of parallelism during a transfer to find a near-optimal number of streams based on instant throughput measurements, allowing it to adapt to varying environments and network utilization over time.
An Efficient top- k Query Processing in Distributed Wireless Sensor NetworksIJMER
Wireless Sensor Networks (WSNs) are usually defined as large-scale, ad-hoc, multi-hop and
wireless unpartitioned networks of homogeneous, small, static nodes deployed in an area of interest.
Applications of sensor networks include monitoring volcano activity, building structures or natural
habitat monitoring. In this paper, we present the problem of processing probabilistic top-k queries in a
distributed wireless sensor networks. The basic problem in top-k query processing is that, a single method
cannot be used as a solution to the problem of top-k query processing because there are many types of
top-k query processing. The method has to be based on the situation, the classification and the type of
database and the query model. Here we develop three algorithms, namely, sufficient set-based (SSB),
necessary set-based (NSB), and boundary-based (BB), for inter- cluster query processing with bounded
rounds of communications. Moreover, in responding to dynamic changes of data distribution in the
overall network, we develop an adaptive algorithm that dynamically switches among the three proposed
algorithms to minimize the transmission cost.
The document examines patterns in SNMP data from two busy network links over a period of six months. Graphs of bandwidth usage were generated on a monthly and weekly basis to look for any time-related patterns. The data for the two links over one year was then collected and analyzed using machine learning software. Initial time series prediction using the data had an error rate of 40-50% but this was reduced to 30-40% using other techniques. Further study is needed to make more useful predictions.
Clustering-based Analysis for Heavy-Hitter Flow DetectionAPNIC
This document summarizes a research paper that proposes using unsupervised machine learning clustering techniques rather than thresholds to detect heavy hitter (HH) flows in a network. It describes collecting network flow data and analyzing it using algorithms like K-means and Gaussian mixtures to group flows. This identified multiple clusters rather than just two groups (elephants and mice). Further clustering an ambiguous zone revealed patterns that could better classify HH flows without relying on thresholds. The clustering results were then passed to an SDN controller to mark flows and take appropriate actions like re-routing.
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...csandit
Reducing the number of processed data, when the information flow is high, is essential in processes that require short response times, such as the detection of anomalies in data networks. This work applied the wavelet transform in the reduction of the size of the monitoring register of a software defined network. Its main contribution lies in obtaining a record that, although reduced, retains detailed information required by the detectors of anomalies.
Solve Production Allocation and Reconciliation Problems using the same NetworkAlkis Vazacopoulos
Production allocation is a business accounting practice used throughout the processing world to proportionately and quantitatively assign measurement error and production expenditures or overheads to internal and external business owners. Reconciliation is a scientific function to vet production data of gross errors or non-random variation if it occurs and to find more precise estimates of the measured values. The consequence of our proposed technique is to allow these two functions the capability to use the same production network or flow-path. Only one model is required to be maintained eliminating the possibility that potentially costly mis-allocation will occur due to business and engineering model-mismatch. Mis-allocation due to measurement errors can still be problematic as we illustrate in an example, but should be reduced over time because of the reconciliation measurement diagnostics.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
9.distributive energy efficient adaptive clustering protocol for wireless sen...Chính Cao
The document proposes a new clustering protocol called DEEAC for wireless sensor networks. DEEAC is adaptive based on the data reporting rates and residual energy levels of nodes. It aims to distribute energy consumption more evenly across the network by selecting cluster heads that have high residual energy and are located in "hot regions" with high data generation rates. This is intended to prolong the lifetime of sensor networks compared to the original LEACH protocol.
As we see that the world has become closer and faster and with the enormous growth of distributed networks like p2p, social networks, overlay networks, cloud computing etc. Theses Distributed networks are represented as graphs and the fundamental component of distributed network is the relationship defined by linkages among units or nodes in the network. Major concern for computer experts is how to store such enormous amount of data especially in form of graphs. There is a need for efficient data structure used for storage of such type of data should provide efficient format for fast retrieval of data as and when required, in this types of networks. Although adjacency matrix is an effective technique to represent a graph having few or large number of nodes and vertices but when it comes to analysis of huge amount of data from site likes like face book or twitter, adjacency matrix cannot do this. In this paper, we study the existing application of a special kind of data structure, skip graph with its various versions which can be efficiently used for storing such type of data resulting in optimal storage, space utilization retrieval and concurrency.
Internet Traffic Forecasting using Time Series MethodsAjay Ohri
This article presents and compares three methods for multi-scale Internet traffic forecasting: 1) a novel neural network ensemble approach, 2) the ARIMA time series method, and 3) adapted Holt-Winters time series methods. Experiments were conducted on real-world traffic data from two large Internet service providers at different time scales (5 min, 1 hr, 1 day) and forecast lookaheads. The neural ensemble achieved the best results for 5 min and hourly forecasts, while Holt-Winters performed best for daily forecasts. Accurate multi-scale traffic forecasting could enable more efficient network resource management and anomaly detection.
A location based least-cost scheduling for data-intensive applicationsIAEME Publication
This document summarizes a research paper that proposes a location-based least-cost scheduling algorithm for transferring multiple data-intensive files simultaneously to multiple compute nodes in a grid environment. The proposed model includes an optimized meta-scheduler that receives multiple files, predicts the optimal number of parallel TCP streams to use for each file transfer based on sampling, and schedules the files to compute nodes using a greedy algorithm that considers location and cost. Experimental results showed the optimized model achieved better transfer times and throughput compared to non-optimized transfers.
This document discusses various algorithms used for clustering data streams. It begins by introducing the problem of clustering streaming data and the common approach of using micro-clusters to summarize streaming data. It then reviews several prominent clustering algorithms like DBSCAN, DENCLUE, SNN, and CHAMELEON. The document focuses on the DBSTREAM algorithm, which explicitly captures density between micro-clusters using a shared density graph to improve reclustering. Experimental results show DBSTREAM's reclustering using shared density outperforms other reclustering strategies while using fewer micro-clusters.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
The document discusses query optimization techniques for sensor networks. It describes the basic architecture of querying in TinyDB where queries are sent to and processed by the sensor network. It notes disadvantages like hotspots and lack of in-network aggregation. The goal is to design a scheme to support multiple queries minimizing communication cost through query co-relation and transformations. An example flood warning query is provided. Queries are classified and optimization techniques like sync-joins and predicate push-down are discussed.
Abstract— In this paper we are going to develop of a new energy efficient communication scheme for wireless sensor network (WSN) which is based on the gray code technique. Gray code technique simultaneously saves energy at both the transmitter as well as receiver because time required for the transmission of is minimum. Wireless sensor networks typically require low cost devices and low power operations. We propose a new energy efficient communication scheme for wireless sensor networks that is based on the ternary number system encoding of data.0 and 1 bit values are known as energy based transmission scheme. In wireless sensor network it is very difficult to charge or replace the usable batteries. So, to maximize node or network life span is very important. Thus energy efficient communication is main objective of WSN. This energy efficient communication technique can be used in many sectors such as remote healthcare, wireless sensor network for agricultures, industrial process monitoring and environmental monitoring.
Enchancing the Data Collection in Tree based Wireless Sensor Networksijsrd.com
Number of techniques used in Wireless Sensor Network to improve data collection from sensor nodes. It achieve by minimize the schedule length and dynamic channel assignment. Schedule length minimized by BFS algorithm without interfering links. Interfering links can be eliminated by transmission power control and multi frequency. The power can be save by using beacon signal. Collection of data can also be limited by topology of network. So the nodes are arranged in form. The capacitated minimal spanning trees and degree- constrained spanning trees give significant improvement in scheduling. Finally the data collection is enhancing in terms of security by using T-Hash Chain algorithm.
A Survey on Balancing the Network Load Using Geographic Hash TablesIOSR Journals
This document summarizes a survey on balancing network load using geographic hash tables. It discusses how geographic hash tables are used to store and retrieve data from nodes in a wireless network. Two approaches to balancing the network load are proposed: 1) An analytical approach that adds new nodes to servers when load exceeds thresholds. 2) A heuristic approach that moves data between nodes to balance load without changing underlying routing protocols. The approaches aim to prevent many requests from going to single nodes. Load balancing improves network lifespan by distributing transmission and reception operations across nodes.
Reporte de los Hábitos de los Usuarios de Internet en México en 2012 en donde se puede encontrar información sobre las actividades que los internautas realizan por Internet. Número s de usuarios, cuanto tiempo pasan en promedio conectados, email, redes sociales, comercio electrónico
Este documento resume los principales medios audiovisuales como el teatro, cine, televisión, multimedia e imágenes múltiples. Explica sus orígenes y evolución a través de la historia, destacando hitos como los primeros filmes de los hermanos Lumière y la transición del cine mudo al sonoro. Finalmente, contempla el futuro de estos medios con nuevas tecnologías como la realidad virtual y la definición avanzada.
Presentación realizada por Manuel Herrera-Usagre, técnico de la Agencia de Calidad Sanitaria de Andalucía, durante el XVIII Congreso de la Sociedad Andaluza de Calidad Asistencial (Sadeca), celebrado en Granada del 20 al 22 de noviembre de 2013.
This short document encourages people to be authentic rather than fake who they are. It suggests focusing on developing real followers, likes, and subscribers by being genuine rather than inauthentic. The message is to focus on quality over quantity by staying true to oneself online.
An Efficient top- k Query Processing in Distributed Wireless Sensor NetworksIJMER
Wireless Sensor Networks (WSNs) are usually defined as large-scale, ad-hoc, multi-hop and
wireless unpartitioned networks of homogeneous, small, static nodes deployed in an area of interest.
Applications of sensor networks include monitoring volcano activity, building structures or natural
habitat monitoring. In this paper, we present the problem of processing probabilistic top-k queries in a
distributed wireless sensor networks. The basic problem in top-k query processing is that, a single method
cannot be used as a solution to the problem of top-k query processing because there are many types of
top-k query processing. The method has to be based on the situation, the classification and the type of
database and the query model. Here we develop three algorithms, namely, sufficient set-based (SSB),
necessary set-based (NSB), and boundary-based (BB), for inter- cluster query processing with bounded
rounds of communications. Moreover, in responding to dynamic changes of data distribution in the
overall network, we develop an adaptive algorithm that dynamically switches among the three proposed
algorithms to minimize the transmission cost.
The document examines patterns in SNMP data from two busy network links over a period of six months. Graphs of bandwidth usage were generated on a monthly and weekly basis to look for any time-related patterns. The data for the two links over one year was then collected and analyzed using machine learning software. Initial time series prediction using the data had an error rate of 40-50% but this was reduced to 30-40% using other techniques. Further study is needed to make more useful predictions.
Clustering-based Analysis for Heavy-Hitter Flow DetectionAPNIC
This document summarizes a research paper that proposes using unsupervised machine learning clustering techniques rather than thresholds to detect heavy hitter (HH) flows in a network. It describes collecting network flow data and analyzing it using algorithms like K-means and Gaussian mixtures to group flows. This identified multiple clusters rather than just two groups (elephants and mice). Further clustering an ambiguous zone revealed patterns that could better classify HH flows without relying on thresholds. The clustering results were then passed to an SDN controller to mark flows and take appropriate actions like re-routing.
REDUCING THE MONITORING REGISTER FOR THE DETECTION OF ANOMALIES IN SOFTWARE D...csandit
Reducing the number of processed data, when the information flow is high, is essential in processes that require short response times, such as the detection of anomalies in data networks. This work applied the wavelet transform in the reduction of the size of the monitoring register of a software defined network. Its main contribution lies in obtaining a record that, although reduced, retains detailed information required by the detectors of anomalies.
Solve Production Allocation and Reconciliation Problems using the same NetworkAlkis Vazacopoulos
Production allocation is a business accounting practice used throughout the processing world to proportionately and quantitatively assign measurement error and production expenditures or overheads to internal and external business owners. Reconciliation is a scientific function to vet production data of gross errors or non-random variation if it occurs and to find more precise estimates of the measured values. The consequence of our proposed technique is to allow these two functions the capability to use the same production network or flow-path. Only one model is required to be maintained eliminating the possibility that potentially costly mis-allocation will occur due to business and engineering model-mismatch. Mis-allocation due to measurement errors can still be problematic as we illustrate in an example, but should be reduced over time because of the reconciliation measurement diagnostics.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
9.distributive energy efficient adaptive clustering protocol for wireless sen...Chính Cao
The document proposes a new clustering protocol called DEEAC for wireless sensor networks. DEEAC is adaptive based on the data reporting rates and residual energy levels of nodes. It aims to distribute energy consumption more evenly across the network by selecting cluster heads that have high residual energy and are located in "hot regions" with high data generation rates. This is intended to prolong the lifetime of sensor networks compared to the original LEACH protocol.
As we see that the world has become closer and faster and with the enormous growth of distributed networks like p2p, social networks, overlay networks, cloud computing etc. Theses Distributed networks are represented as graphs and the fundamental component of distributed network is the relationship defined by linkages among units or nodes in the network. Major concern for computer experts is how to store such enormous amount of data especially in form of graphs. There is a need for efficient data structure used for storage of such type of data should provide efficient format for fast retrieval of data as and when required, in this types of networks. Although adjacency matrix is an effective technique to represent a graph having few or large number of nodes and vertices but when it comes to analysis of huge amount of data from site likes like face book or twitter, adjacency matrix cannot do this. In this paper, we study the existing application of a special kind of data structure, skip graph with its various versions which can be efficiently used for storing such type of data resulting in optimal storage, space utilization retrieval and concurrency.
Internet Traffic Forecasting using Time Series MethodsAjay Ohri
This article presents and compares three methods for multi-scale Internet traffic forecasting: 1) a novel neural network ensemble approach, 2) the ARIMA time series method, and 3) adapted Holt-Winters time series methods. Experiments were conducted on real-world traffic data from two large Internet service providers at different time scales (5 min, 1 hr, 1 day) and forecast lookaheads. The neural ensemble achieved the best results for 5 min and hourly forecasts, while Holt-Winters performed best for daily forecasts. Accurate multi-scale traffic forecasting could enable more efficient network resource management and anomaly detection.
A location based least-cost scheduling for data-intensive applicationsIAEME Publication
This document summarizes a research paper that proposes a location-based least-cost scheduling algorithm for transferring multiple data-intensive files simultaneously to multiple compute nodes in a grid environment. The proposed model includes an optimized meta-scheduler that receives multiple files, predicts the optimal number of parallel TCP streams to use for each file transfer based on sampling, and schedules the files to compute nodes using a greedy algorithm that considers location and cost. Experimental results showed the optimized model achieved better transfer times and throughput compared to non-optimized transfers.
This document discusses various algorithms used for clustering data streams. It begins by introducing the problem of clustering streaming data and the common approach of using micro-clusters to summarize streaming data. It then reviews several prominent clustering algorithms like DBSCAN, DENCLUE, SNN, and CHAMELEON. The document focuses on the DBSTREAM algorithm, which explicitly captures density between micro-clusters using a shared density graph to improve reclustering. Experimental results show DBSTREAM's reclustering using shared density outperforms other reclustering strategies while using fewer micro-clusters.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
The document discusses query optimization techniques for sensor networks. It describes the basic architecture of querying in TinyDB where queries are sent to and processed by the sensor network. It notes disadvantages like hotspots and lack of in-network aggregation. The goal is to design a scheme to support multiple queries minimizing communication cost through query co-relation and transformations. An example flood warning query is provided. Queries are classified and optimization techniques like sync-joins and predicate push-down are discussed.
Abstract— In this paper we are going to develop of a new energy efficient communication scheme for wireless sensor network (WSN) which is based on the gray code technique. Gray code technique simultaneously saves energy at both the transmitter as well as receiver because time required for the transmission of is minimum. Wireless sensor networks typically require low cost devices and low power operations. We propose a new energy efficient communication scheme for wireless sensor networks that is based on the ternary number system encoding of data.0 and 1 bit values are known as energy based transmission scheme. In wireless sensor network it is very difficult to charge or replace the usable batteries. So, to maximize node or network life span is very important. Thus energy efficient communication is main objective of WSN. This energy efficient communication technique can be used in many sectors such as remote healthcare, wireless sensor network for agricultures, industrial process monitoring and environmental monitoring.
Enchancing the Data Collection in Tree based Wireless Sensor Networksijsrd.com
Number of techniques used in Wireless Sensor Network to improve data collection from sensor nodes. It achieve by minimize the schedule length and dynamic channel assignment. Schedule length minimized by BFS algorithm without interfering links. Interfering links can be eliminated by transmission power control and multi frequency. The power can be save by using beacon signal. Collection of data can also be limited by topology of network. So the nodes are arranged in form. The capacitated minimal spanning trees and degree- constrained spanning trees give significant improvement in scheduling. Finally the data collection is enhancing in terms of security by using T-Hash Chain algorithm.
A Survey on Balancing the Network Load Using Geographic Hash TablesIOSR Journals
This document summarizes a survey on balancing network load using geographic hash tables. It discusses how geographic hash tables are used to store and retrieve data from nodes in a wireless network. Two approaches to balancing the network load are proposed: 1) An analytical approach that adds new nodes to servers when load exceeds thresholds. 2) A heuristic approach that moves data between nodes to balance load without changing underlying routing protocols. The approaches aim to prevent many requests from going to single nodes. Load balancing improves network lifespan by distributing transmission and reception operations across nodes.
Reporte de los Hábitos de los Usuarios de Internet en México en 2012 en donde se puede encontrar información sobre las actividades que los internautas realizan por Internet. Número s de usuarios, cuanto tiempo pasan en promedio conectados, email, redes sociales, comercio electrónico
Este documento resume los principales medios audiovisuales como el teatro, cine, televisión, multimedia e imágenes múltiples. Explica sus orígenes y evolución a través de la historia, destacando hitos como los primeros filmes de los hermanos Lumière y la transición del cine mudo al sonoro. Finalmente, contempla el futuro de estos medios con nuevas tecnologías como la realidad virtual y la definición avanzada.
Presentación realizada por Manuel Herrera-Usagre, técnico de la Agencia de Calidad Sanitaria de Andalucía, durante el XVIII Congreso de la Sociedad Andaluza de Calidad Asistencial (Sadeca), celebrado en Granada del 20 al 22 de noviembre de 2013.
This short document encourages people to be authentic rather than fake who they are. It suggests focusing on developing real followers, likes, and subscribers by being genuine rather than inauthentic. The message is to focus on quality over quantity by staying true to oneself online.
Los Darks son un movimiento cultural que surgió en Inglaterra en la década de 1970 y se extendió a otras partes de Europa y América. Llegó a México a finales de los años 80 e inicios de los 90. Los Darks suelen vestir de negro y usar ropa y accesorios con símbolos como murciélagos, calaveras y arañas. Se maquillan pálidamente y tiñen sus labios y uñas de negro. Tienen una visión deprimente y desilusionada de la vida y prefieren vivir de noche
The document lists various automobile manufacturers and some of their popular models. It includes brands such as Alfa Romeo, Aston Martin, Audi, BMW, Bentley, Citroen, Dodge, Ferrari, Ford, Honda, Hyundai, Jaguar, Jeep, Kia, Lamborghini, Land Rover, Lexus, Lotus, Maserati, Mazda, Mercedes-Benz, Mini, Mitsubishi, Nissan, Porsche, Subaru, Suzuki, Toyota, Volkswagen, and Volvo. For each manufacturer, 2-5 popular past or current vehicle models are provided.
Alex McCaslin was the second employee hired at Bison Building Materials' new Casa Grande, Arizona location. She single-handedly set up the office and business operations for the startup facility. As Office Manager, McCaslin handled accounts payable, accounts receivable, invoicing, collections, and prepared financial reports. She created job descriptions and manuals as the company did not have any. McCaslin also acted as the HR liaison to the corporate office in Houston, Texas and was responsible for all personnel files and payroll submittals.
El documento habla sobre el creciente interés en España por aprender chino. Cada vez más personas están aprendiendo chino bien en colegios, universidades o centros de idiomas. Aprender chino requiere trabajo y disciplina, y muchos expertos recomiendan una inmersión lingüística en China para dominar el idioma. China está promoviendo activamente la difusión del chino en el mundo a través de iniciativas como el Instituto Confucio.
The document is the annual report for The Fortress Resorts PLC for the 2014/15 fiscal year. It provides an overview of the company's financial highlights for the year, noting increased revenue, earnings before interest and taxes, and profit after tax compared to the previous year. The chairman's review discusses the global economic environment, noting moderate global growth of around 3% expected over the next few years, with soft commodity prices and declining oil prices supporting global activity. Economic performance varied between countries, with the US and UK seeing labor market improvements while the Eurozone continued to struggle.
Este documento presenta una investigación sobre la historia escolar de diferentes miembros de la familia del autor para analizar la evolución de los sistemas educativos a través de los años. El autor realiza entrevistas a su abuela, madre y a sí mismo sobre sus experiencias escolares, incluyendo detalles sobre las escuelas, asignaturas, métodos de enseñanza y contexto histórico de cada época. El objetivo es reconstruir la vida escolar de cada generación y apreciar los cambios en aspectos como la igualdad de género a lo
Donación, mecenazgo y patrocinio como técnicas de relaciones públicas al serv...EfiaulaOpenSchool
Este documento discute tres técnicas de responsabilidad social corporativa que las empresas pueden utilizar: la donación, el mecenazgo y el patrocinio. Explica que la donación y el mecenazgo permiten que las empresas se hagan más visibles en su comunidad sin demasiada oposición. El patrocinio hace aún más visible la colaboración económica entre la empresa y la organización receptora. También argumenta que los resultados positivos de ser socialmente responsable deben beneficiar tanto a la empresa como a la organización receptora.
Un nouveau dispositif dans la prévention cardiovasculaire chez les sportifs professionels, les jeunes athlètes et les diabétiques:
Expérience de faisabilité et d'éfficacité réalisée à Montbard (21) à l'occasion du THELETHON 2014 par le Dr Jean-Pierre RIFLER
Presentation by Bob Humer of the Asphalt Institute on "Recommendations for Mix Design Using RAP/RAS" for the CalAPA Spring Asphalt Pavement Conference & Equipment Expo, April 20-21, 2016, in Ontario, CA.
El documento describe los pasos para elaborar paquetes turísticos, incluyendo la investigación del mercado y el producto, la estructura del producto, los elementos de la prestación del servicio, el diseño del paquete, los tipos de paquetes, y la etapa post-venta. Explica que un paquete turístico es un conjunto de servicios que se comercializan como una marca y se venden a un precio unitario, y puede incluir alojamiento, transporte, comidas, excursiones u otros elementos.
Este documento presenta información sobre el sistema operativo CentOS. Brevemente describe que CentOS es una distribución de Linux gratuita basada en Red Hat Enterprise Linux, con soporte de 7 años. Explica su historia desde la versión 3.1 en 2004 hasta la versión actual 5, y cubre temas como su escritorio, requerimientos de hardware y arquitecturas soportadas.
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
Distributed processing of probabilistic top k queries in wireless sensor netw...IEEEFINALYEARPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.co¬m-Visit Our Website: www.finalyearprojects.org
JAVA 2013 IEEE DATAMINING PROJECT Distributed processing of probabilistic top...IEEEGLOBALSOFTTECHNOLOGIES
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
An Improved Differential Evolution Algorithm for Data Stream ClusteringIJECEIAES
A Few algorithms were actualized by the analysts for performing clustering of data streams. Most of these algorithms require that the number of clusters (K) has to be fixed by the customer based on input data and it can be kept settled all through the clustering process. Stream clustering has faced few difficulties in picking up K. In this paper, we propose an efficient approach for data stream clustering by embracing an Improved Differential Evolution (IDE) algorithm. The IDE algorithm is one of the quick, powerful and productive global optimization approach for programmed clustering. In our proposed approach, we additionally apply an entropy based method for distinguishing the concept drift in the data stream and in this way updating the clustering procedure online. We demonstrated that our proposed method is contrasted with Genetic Algorithm and identified as proficient optimization algorithm. The performance of our proposed technique is assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall is 90.30% and F-measure estimate is 88.60%.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
This document presents research on using the DBSCAN clustering algorithm to solve the problem of software component restructuring. It begins with an abstract that introduces DBSCAN and describes how it can group related software components. It then provides background on software component clustering and describes DBSCAN in more detail. The methodology section outlines the 4 phases of the proposed approach: data collection and processing, clustering with DBSCAN, visualization and analysis, and final restructuring. Experimental results show that DBSCAN produces more evenly distributed clusters compared to fuzzy clustering. The conclusion is that DBSCAN is a better technique for software restructuring as it can identify clusters of varying shapes and sizes without specifying the number of clusters in advance.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
This document discusses using k-means clustering to partition datasets that have been generated through horizontal aggregation of data from multiple database tables. It provides background on horizontal aggregation techniques like pivot tables and describes the k-means clustering algorithm. The algorithm is applied as an example to cluster a sample dataset into two groups. The document concludes that k-means clustering can effectively partition large datasets produced by horizontal aggregations to facilitate further data mining analysis.
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
This document presents an evaluation of different algorithms for performing parallel k-nearest neighbor (kNN) queries on big data using the MapReduce framework. It first discusses how kNN algorithms do not scale well for large datasets. It then reviews existing MapReduce-based kNN algorithms like H-BNLJ, H-zkNNJ, and RankReduce that improve performance by partitioning data and distributing computation. The document also proposes using an adaptive indexing technique with the RankReduce algorithm. An implementation of this approach on a airline on-time statistics dataset shows it achieves better precision and speed than other algorithms.
Optimizing the Data Collection in Wireless Sensor NetworkIRJET Journal
This document discusses optimizing data collection in wireless sensor networks. It begins by introducing the concepts of wireless sensor networks and data collection trees. It then discusses using Breadth-First Search (BFS) for data collection and proposes a Parallel Data Collection in BFS (PDCBFS) approach. PDCBFS allows nodes to aggregate data from themselves and child nodes into a single packet to send to the parent node, reducing transfer time compared to individual packets in BFS. The document analyzes and compares the performance of BFS and PDCBFS in terms of data collected and delay required for collection.
Performance Analysis and Parallelization of CosineSimilarity of DocumentsIRJET Journal
This document discusses performance analysis and parallelization of the cosine similarity algorithm for calculating document similarity. It proposes an optimized algorithm that utilizes parallel computing to calculate cosine similarity for large sets of retrieved documents more efficiently. The conventional cosine similarity algorithm becomes inefficient for large document sets. The parallelized approach aims to enhance efficiency and reduce latency by processing more documents in less time. The document reviews related work applying techniques like parallelization, cosine similarity, and dimensionality reduction to problems involving document clustering, text summarization, and information retrieval.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
1) The document proposes a mathematical model and optimization service to predict the optimal number of parallel TCP streams needed to maximize data throughput in a distributed computing environment.
2) It develops a novel model that can predict the optimal number using only three data points, and implements this service in the Stork Data Scheduler.
3) Experimental results show the optimized transfer time using this prediction and optimization service is much less than without optimization in most cases.
Resource Allocation Optimization for Heterogeneous NetworkIOSRjournaljce
The improvement of communication throughput using a cross layer mode of communication is proposed. Due to critical mode of applications of sensor networks, it is required to achieve data at faster rate for high refreshment. As the communicating data’s are sensed or measured data, it is required to have higherthroughput and less interference, to achieve accuracy and optimal resource utilization. In this paper we propose aintegrated cross layer mechanism to achieve higher performance in sensor networks. A new approach of cross layer controlling, based on integrated factor of power allocationand memory blockage is proposed.This mode of communication results at faster data transfer, with lower energy resource consumption over a wirelesssensor network architecture.
This document compares two solutions for filtering hierarchical data sets: Solution A uses MySQL and Python, while Solution B uses MongoDB and C++. Both solutions were tested on a 2011 MeSH data set using various filtering methods and thresholds. Solution A generally had faster execution times at lower thresholds, while Solution B scaled better to higher thresholds. However, the document concludes that neither solution is clearly superior, and further study is needed to evaluate their performance for real-world human users.
This document summarizes a study on human activity recognition using mobile sensors. It analyzes the performance of a k-nearest neighbors classifier on a publicly available dataset containing accelerometer data from sensors on both arms during various activities. The study finds that some activities are recognized more accurately than others by the basic classifier. It also shows that combining data from both arms and selecting optimal features improves recognition performance compared to using each arm individually with all features.
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
Location-based services have been mostly used in many systems. preceding systems uses a pull model or user-initiated model, where a user arrival a query to a server which gives response with location-aware answers. To offer outcomes to users with fast responses, a push model or server-initiated model is flattering an important computing model in the next-generation location-based services. In the push model, subscribers arrive spatio-textual subscriptions to fastening their curiosities, and publishers send spatio-textual messages. It is used for a high-performance location-aware publish/subscribe system to send publishers’ messages to valid subscribers. In this paper, we find the exploration happenstances that start in manipulative a location-aware publish/subscribe system. We recommend an R-tree based index by merging textual descriptions into R-tree nodes. We design efficient filtering algorithms and effective pruning techniques to accomplish high performance. This method can support likewise conjunctive queries and ranking queries.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...IJECEIAES
This paper proposes an improved handover algorithm method for vehicle location prediction (VLP-HA) using mobility prediction database. The main advantage of this method is the mobility prediction database is based on real traffic data traces. Furthermore, the proposed method has the ability to reduce handover decision time and solve resource allocation problem. The algorithm is simple and can be computed very rapidly; thus, its implementation for a high-speed vehicle is possible. To evaluate the effectiveness of the proposed method, QualNet simulation is carried out under different velocity scenarios. Its performance is compared with conventional handover method. The superiority of the proposed method over conventional handover method in deciding the best handover location and choosing candidate access points is highlighted by simulation. It was found that VLP-HA has clearly reduced handover delay by 45% compared to handover without VLP, give high accuracy, hence low complexity algorithm.
Similar to Network Flow Pattern Extraction by Clustering Eugine Kang (20)
Handover Algorithm based VLP using Mobility Prediction Database for Vehicular...
Network Flow Pattern Extraction by Clustering Eugine Kang
1. Network Flow Pattern Extraction by Clustering
Eugine Kang, Seojoon Jin
Korea University
Abstract
Network is a system which allows users to connect with the
internet. Although the internet seems to be ever increasing and
boundless, the infrastructure to provide the internet has physical
limitations. Network flow analysis tools are developed for efficient
network management. This paper discusses a statistical learning method
for efficient network management. Network flow data is first transformed
by time block and then followed by calculating Within Sum of Squares
(WSS) and Calinski-Harabasz index (CH) for optimal number of clusters.
The clustering algorithms used for this paper is k- means clustering.
Principal Component Analysis (PCA) allows dimension reduction of the
data set to visually confer the clusters by relative position of
observations. The stability of the clusters is measured by the mean of
Jaccard coefficient from 10,000 bootstrap resampled data sets.
Keywords: Calinski-Harabasz index, principal component analysis, Jaccard
coefficient, bootstrap, k-means clustering
Introduction
Data is collected from all fields and the amount of data collected
is exponentially growing. Richer datasets allow researchers and entities
to extract meaningful knowledge with the use of machine learning
algorithms. In many business and engineering applications, raw data
collected from systems needs to be processed into various fo rms of
structured and semantically rich records before data analysis tasks can
produce accurate, useful, and understandable results. However, the
difficulty comes when the raw data is difficult to understand and
2. manipulate. This paper walks through the network flow data collected at
Korea University Sejong Campus, and transform the data to extract
meaningful knowledge. This project first started under the goal to
increase network flow efficiency within the campus. The flow of data is
labeled with the type of application and trying to minimize or detect the
use of non-academic purpose flow was essential. Torrent is a p2p file
sharing service commonly used among users for downloading audio and
video files. The campus labels torrent as a non-academic purpose flow
and deeper analysis of the torrent flow within the campus is the main
purpose of this paper.
Recent works of network flow analysis has been conducted on a
time basis but fail to produce any meaningful pattern from the users.
Reference to previous studies and similar studies with explanation will be
added to this section on a later date.
The rest of the paper is organized as follows. We first give an
overview of the data collected from the network. We then discuss the
data transformation steps, which include time base patterns, and parsing
each such pattern to construct statistical features. We report the
experiments of using clustering techniques on the network flow data. We
then describe how to utilize the knowledge extracted from experiments.
Collected Data from Campus Network
Figure 1: Diagram of network flow data collection
3. Korea University Sejong Campus data center collects each IP’s
network flow information on a minute-to-minute basis. There is about
5,000 IPs designated for desktop connections and a separate data center
for collecting mobile devices connection through WI-FI. The type of
information collected can be seen on table 1. Service and protocol level
features tells us the type of service being used by the users on the local
IP address. The type of service varies from web browsing, video
streaming, e- mail, and torrent, which is the type of service we are
interested in for this paper. Other features in table 1 collect information
on the connection to the local IP. The network flow information is stored
in a table like structure, e.g. relational database. However, the size of the
entire network flow exceeds 3TB and required a different type of
database to store and extract data. Hadoop Distributed File System
(HDFS) was used to store the enormous amount of data and applications
were used accordingly to extract data.
Table 1: Network connection features
Feature
Name
Feature Meaning
Feature
Name
Feature
Meaning
flow_I flow index o_octet
outbound byte
count
con_I
continue index(flow's parent
index)
i_pkt
inbound pkt
count
sip local ip i_octet
inbound byte
count
sport local port flag
flag information
flag
prot transmission protocol sr_lvl1 service level 1
dip remote ip sr_lvl2 service level 2
dport remote port sr_lvl3 service level 3
time time of flow start pt_lvl1 protocol level 1
o_pkt outbound pkt count pt_lvl2 protocol level 2
Data Transformation
The bandwidth limitation of the campus network has a limit at a given
4. time. We are more interested in torrent usage according to time blocks.
Detecting torrent usage patterns by time will allow the network administrator to
prepare for heavy loads of network use at peak times. The data transformation
step parses data into weekdays, hours, and by flow, incoming, outgoing data
amount.
Figure 2: Heatmap visualization of data transformation
Figure 2 shows one local IP’s torrent flow usage over time. The
horizontal axis represents weekday and vertical axis represents the hour of day.
The red region indicates relative heavy usage compared to the green regions.
Through this data transformation process we are able to detect patterns from
local IP torrent usage by time blocks. This information will be helpful for the
network administrator to prevent heavy bandwidth loading caused by torrent
usage.
Experiment Technique
k -means clustering
k -means clustering is a method of vector quantization, that is popular
5. for cluster analysis in data mining. The goal of this algorithm is to partition n
observations into k clusters in which each observation belongs to the cluster
with the nearest mean. In this paper we set each observation as a local IP, and
observe how they partition.
The algorithm uses an iterative refinement technique. Given an
randomly initialized set of k means
(1) (1)
1 , , km m , the algorithm proceeds by
iterating two steps, assignment step and update step. The assignment step
assigns each observation to the cluster with the nearest centroid. Equation 1
assign each observation px to exactly one
( )t
S . Once each observations are
assigned to a cluster, the update step calculates the centroids of the cluster with
newly updated observations. Equation 2 sums up the features of each
observation and generate a new centroid.
2 2( ) ( ) ( )
: ,1t t t
i p p i p jS x x m x m j j k (1)
( )
( 1)
( )
1
t
j i
t
i jt
x Si
m x
S
(2)
Given the number of clusters to be partitioned, the objective of the
algorithm is to find equation 3. The cluster orientation which minimizes the
within sum of squares (WSS). WSS is also used to determine the appropriate
number of clusters, but inappropriate due to nature of decreasing WSS value as
number of clusters increases.
2
1
arg min
i
k
S i
i x S
x
(3)
Calinski-Harabasz Index
k -means clustering is a fast and efficient clustering technique. However,
one drawback is the pre-determination of k before the algorithms proceed. To
optimize the clustering results, WSS is a common scoring method to determine
6. k . As mentioned above, the nature of decreasing WSS value as k increases
makes it difficult for optimization. Calinski-Harabasz index is best suited in
finding the optimal number of cluster for k-means clustering solutions.
The Calinski-Harabasz index is also called the variance ratio criterion.
Equation 4 calculates the variance between clusters and equation 5 the variance
between observations within a cluster. Well-defined clusters have a large
between-cluster variance and a small within-cluster variance. The larger the
Calinski-Harabasz index, the better the data partition. To determine the optimal
number of clusters, the best practice is to maximize equation 6 respect to k .
2
1
Overall between-cluster variance
k
B i i
i
SS n m m
(4)
2
1
Overall within-cluster variance
i
k
W i
i x S
SS x m
(5)
1
B
W
N kSS
CH
SS k
(6)
In theory, determining k which maximizes the Calinski-Harabasz index
results in the best partitioned cluster. However, in practice the number of clusters
tend to be lower than the amount of clusters needed for distinct pattern analysis.
Principal Component Analysis
Determining k for clustering data may not always lead to desired
results. WSS and CH scores are a method to determine the optimal number of
clusters for k -means clustering. In practice, a visual determination of optimal
number of clusters helps with the decision process. Principal component analysis
(PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of values
of linearly uncorrelated variables called principal components. The number of
7. principal components allowed are determined by the number of features given
in the data set. By selecting the first two principal components, the data goes
through data reduction to be represented on a lower dimensional space.
Figure 3: Data reduction on data set by PCA
Figure 3 represents 4,111 local IP’s with 168 features on a two
dimensional space. Due to the orthogonal transformation gone through PCA,
the values on the first and second principal components do not provide
statistically significant meaning. However, the dimension reduction visualization
provides relative nearness from observation to observation. We induce
observations closer to one another share similar feature values and hence
belong in the same partition for cluster analysis. PCA is used as a visual tool to
determine the fitness of our cluster analysis.
Cluster Stability Assessment
Even the most sophisticated clustering algorithms tend to generate
partitions even for fairly homogeneous data sets. Cluster validation is very
important in cluster analysis. An important aspect of cluster validity is stability.
Stability means that a meaningful valid cluster should not disappear if the data
8. set is altered in a non-typical way. There could be serval conceptions what a
“non-typical alternation” of the data set is. In terms of this paper, data set
extracted from the underlying database should give rise to more or less the
same clustering.
The stability of the clusters formed is tested through the methods
provided by Hennig (2007). Equation 7 calculates the Jaccard coefficient between
two clusters as a measure of similarity. Jaccard coefficient does not depend on
two clusters having identical amount of observations for the measurement to be
valid. This is an advantage for cluster analysis because clusters with var ying
observation counts may form.
, , ,a b
a b a b
a b
S S
S S S S x
S S
(7)
Resampling methods is quite natural to assess cluster stability. Bootstrap
is performed to resample new data sets from the original data set. The desired
clustering algorithm is applied and the similarity values are recorded for every
original cluster to the most similar cluster in the proceeding iteration. The
stability of every single cluster is taken by the mean similarity taken over the
resampled data sets. Equation 8 calculates the mean similarity of clusters over
bootstrapped data sets. *B is the number of bootstrap replications.
,
1
1
*
B
S S i
iB
(8)
Experiment Results
The experiment to extract torrent usage pattern by local IP begins with
transforming the data to represent weekday to time usage. The reason for this
transformation is due to the natural limitation of bandwidth within the campus
network system. Identifying heavy loads of non-academic network usage by time
is necessary for efficient management of the network system.
9. One drawback of k -means clustering is the pre-determination of k .
This paper first find candidates for suitable k using WSS and CH. Figure shows
the WSS and CH score respect to increasing k . WSS naturally decreases as the
number of clusters increases, but the pattern we are looking for is a significant
drop of WSS from a k value to the next. Just looking at WSS, a significant drop
can be observed when 3,4,7k . Although the drop from 6 clusters to 7 is not
as steep compared to others we are searching for appropriate candidates at this
point. CH measures the ratio of variance between clusters and variance between
observations within clusters. In literature, the maximum value of CH respect to
k leads to the best partition. 3k stands out as the best candidate, but
restricting to 3 clusters may lead to lack to distinct torrent usage patterns.
Minimizing the torrent usage to 3 types may not as be helpful to network
administrators and lead to unnecessary restriction on network usage to local IPs.
The next candidate we look for is a bump in CH value compares to the overall
trend. 7k shows a small bump compared to the trend given by previous and
after CH values. This narrows our k candidates to 3, and 7.
PCA is used as a dimension reduction technique to visualize the relative
Figure 4: WSS vs CH to determine optimal k
10. positions of observations. When the number of candidates for k can not be
reduced to one, PCA can help settle the tie through visualization. Figure 5 shows
the clusters when 3k . The relative positioning of observations how each
clusters are closely formed, and cluster 2 show observations spread out the
entire reduced dimensional space. This finding is difficult to conclude distinct
patterns within the cluster. Compared to cluster 2, the other clusters are relatively
closely located to other observations. This leads to a distinct pattern within the
cluster while the other does not.
Figure 5: PCA visualization for 3 clusters
Figure 6 shows a different formation of clusters compared to figure 5.
The 7 clusters formed in figure 6 show relative closeness compared to when
there are 3 clusters. Cluster 3 remains the same while the other clusters in figure
5 are partition by 3. The clusters formed in figure 6 show closely partitioned
observations. Although 3k may have a higher CH score compared to 7k ,
the results from PCA visualization tells us a different story. Due to the objective
of this paper, extracting more patterns for the network administrator will lead to
useful knowledge in determining heavy torrent usage by time. The spread of the
11. observations in figure 6 also tells us relative closeness between observations
within the cluster. When applying k -means clustering, we narrowed possible
candidates for k by comparing WSS and CH scores respect to k . When there
are multiple possible candidates or the purpose of experiment requires more
clusters to be formed, PCA is able to perform dimension reduction to visualize
the cluster formation and compare observation closeness. For this data set and
experiment, the number of clusters to perform further analysis is 7k .
Figure 6: PCA visualization for 7 clusters
Table 2: Cluster stability
Cluster Stability Observations
1 0.842 76
2 0.957 86
3 0.808 91
4 0.773 273
5 0.652 101
6 0.985 3314
7 0.692 170
12. Non-typical alteration with the data may lead to different clusters
forming. Testing the stability is very important in cluster analysis. When we are
detecting network patterns with the cluster formation, the stability and validity of
these patterns occurring with a different data set must be tested. We use the
Jaccard coefficient as a measurement of similarity and generate multiple data
sets by the bootstrap resampling method. We generated 7 clusters with the k -
means clustering algorithm and generated a stability value by the mean of
Jaccard coefficients over 10,000 bootstrap resampled data sets. The stability
value varies between 0-1. Values over 0.75 are considered stable and the same
cluster forming with non-typical alteration.
5 out of 7 clusters form stable clusters to derive meaningful partitioning.
The highest stability comes from cluster 6 with a value close to 1. The number of
local IPs belonging in cluster 6 is also the highest containing about 80% of the
total local IPs. Cluster 5 and 7 fails to reach a stability of 0.75 or higher. The
patterns of stable clusters will be visually examined.
Figure 7: Torrent usage pattern for cluster 1
13. Cluster 1 shows a regular usage of torrent during working hours. The
campus network is used by students, researchers, and professors, but they are
also used by office staffs as well. The torrent usage seems to follow the user ’s
computer usage time, and indicate the start of the week the most concentrated.
This could be explained by users downloading shows over the weekend.
Figure 8: Torrent usage pattern for cluster 2
Cluster 2 takes up most of the torrent usage over all observations. The
86 local IPs belonging to cluster 2 are responsible for loading the network
system with heavy torrent usage. Figure 8 makes it quite obvious to see the
reason why. Unlike cluster 1 with specific time usage, cluster 2 does not know
when to stop torrent usage. Managing and curving down the non-academic
network usage of these 86 local IPs is the key for making more room for
network bandwidth.
Cluster 3 tells us a similar pattern to cluster 1 but with heavier intensity.
The difference between cluster 1 and 3 is the moment of peak usage. Cluster 1
show a gradual decrease in usage as the weekday goes on, but cluster 3 show
14. peak on Mon, Thurs, and Friday. The maximum flow visible through the legend is
2-3 folds the maximum of cluster 1. The usage pattern for cluster 1 and cluster 3
share common relationship between working hours and torrent usage, but differ
in moment of intensity and maximum amount.
Figure 9: Torrent usage pattern for cluster 3
Cluster 4 and 6 share common patterns which show the lack of torrent
activity. This is odd discovery to be revealed from this experiment. The current
network system is overloaded with the usage of torrent activity, but the clusters
pattern show the contrary when looking at the number of local IPs actively using
the service. Cluster 4 and 6 take more than 80% of local IPs, but the torrent
usage seems to be nearly none. This discovery gives heavier emphasis to the
pattern shown in cluster 2. Small amount of users are responsible for the
majority of network load within this campus. Cluster 4 shows a peak on Saturday
morning, but this is not an important feature to worry as a network
administrator because the overall network usage during that time is low anyway.
The peak is within cluster 4 but compared to other clusters the intensity is
significant lower than others. The pattern from cluster 4 and 6 allows us to
15. conclude that most of the local IPs do not require special administration to
curve down the usage of non-academic network usage such as torrent.
Figure 10: Torrent usage pattern for cluster 4
Figure 11: Torrent usage pattern for cluster 6
16. Cluster 5 and 7 fail to achieve stability from the mean of Jaccard
coefficient by bootstrap resample data sets. The patterns detected are hard to
assume stability and hence can change with non-typical alteration to extracted
data set.
Conclusion
The network system within an entity requires management for efficiency.
This paper experimented with a method to detect patterns from a massive data
set collected from network flow. The steps we took was first to transform the
data by time blocks and extract network flow data which are labeled as torrent.
Torrent is a non-academic network service which takes a big burden to the
network system. WSS and CH were calculated within the transformed data set to
narrow down the possible number of clusters, and the clusters were visually
reexamined through PCA. The dimension reduction technique allows us to
observe relative positioning from one observation to another. The stability of
clusters formed by k-means clustering was validated through bootstrap
resampling and taking the mean of the Jaccard coefficient of the most similar
clusters formed over 10,000 iterations.
The patterns detected from each clusters are visualized by heatmap to
easily detect peaks in usage over time. The main discovery of this cluster
analysis reveals a small amount of users are responsible for the majority of
network load to the entire system. 80% or more of the users barely load the
network with non-academic network usage. As a network administrator, the
patterns detected can be used for special administration for users with a history
of heavy usage. Because the users who belong to this field are low, the job of
the network administrator can be lessened. Further analysis of the clusters show
relative peak, and this information can also prepare a network administrator for
the specific time slot where network usage will increase. The proposed method
can also be applied other network services as well.
17. Reference
Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering
algorithm." Applied statistics (1979): 100-108.
Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. "The global k-means
clustering algorithm." Pattern recognition 36.2 (2003): 451-461.
Ding, Chris, and Xiaofeng He. "K-means clustering via principal component
analysis." Proceedings of the twenty-first international conference on Machine
learning. ACM, 2004.
Caliński, Tadeusz, and Jerzy Harabasz. "A dendrite method for cluster
analysis." Communications in Statistics-theory and Methods 3.1 (1974): 1-27.
Wold, Svante, Kim Esbensen, and Paul Geladi. "Principal component
analysis." Chemometrics and intelligent laboratory systems 2.1 (1987): 37-52.
Abdi, Hervé, and Lynne J. Williams. "Principal component analysis." Wiley
Interdisciplinary Reviews: Computational Statistics 2.4 (2010): 433-459.
Becker, R. A., J. M. Chambers, and A. R. Wilks. "The New S Language Pacific
Grove CA: Wadsworth & Brooks/Cole." BeckerThe New S Language1988(1988).
Mardia, Kantilal Varichand, John T. Kent, and John M. Bibby. Multivariate analysis.
Academic press, 1979.
Venables, William N., and Brian D. Ripley. Modern applied statistics with S. Springer
Science & Business Media, 2002.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational
Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for
general cluster analysis methods. Journal of Multivariate Analysis 99, 1154-1176.