Grid computing environment due to its scale and heterogeneous nature is more vulnerable to faults. To store and analyze fault and workload information, FTA Fault Trace Archive and GWA Grid Workload Archive are used. Previously researchers have analyzed FTA and GWA as separate research problems but in this research paper we have proposed a method for conducting a combined analysis of FTA and GWA based on session based mapping of trace file variables. This is the first attempt to conduct a combined analysis of these two trace files. Along with the step by step process of combining trace files we have also included do’s and don’ts while conducting this analysis .Through this combined analysis we have established a correlation based relationship among number of node failures, number of failed jobs, failure duration and number of nodes. We have found that
these variables are positively correlated with different correlation coefficients.
This document summarizes the "KDD Cup - 1999" data sets used for computer network intrusion detection. The data sets contain over 5 million connection records with 41 features each, representing different types of attacks including denial of service, user to root, remote to local, and probing attacks. The document discusses the data redundancy, imbalance, and partitioning of the data sets. It also provides sample code for analyzing the data sets in Python, Java, and R and concludes that while the data sets were useful for early intrusion detection research, they have limitations such as redundancy and imbalance that prompted the introduction of new data sets like NSL-KDD.
Optimization of workload prediction based on map reduce frame work in a cloud...eSAT Publishing House
This document summarizes a research paper that proposes optimizing workload prediction in Hadoop clusters using MapReduce and genetic algorithms. It describes collecting job history data from Hadoop, analyzing workload patterns, and using genetic algorithms to predict future workloads and optimize performance. The implementation analyzes a sample Hadoop trace log to calculate error rates for workload predictions. The goal is to integrate workload prediction into multi-node Hadoop clusters for real-time optimization.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Editor IJCATR
Scheduling jobs to resources in grid computing is complicated due to the distributed and heterogeneous nature of the resources.
The purpose of job scheduling in grid environment is to achieve high system throughput and minimize the execution time of applications.
The complexity of scheduling problem increases with the size of the grid and becomes highly difficult to solve effectively.
To obtain a good and efficient method to solve scheduling problems in grid, a new area of research is implemented. In this paper, a job
scheduling algorithm is proposed to assign jobs to available resources in grid environment. The proposed algorithm is based on Ant
Colony Optimization (ACO) algorithm. This algorithm is combined with one of the best scheduling algorithm, Suffrage. This paper uses
the result of Suffrage in proposed ACO algorithm. The main contribution of this work is to minimize the makespan of a given set of
jobs. The experimental results show that the proposed algorithm can lead to significant performance in grid environment.
This document summarizes the "KDD Cup - 1999" data sets used for computer network intrusion detection. The data sets contain over 5 million connection records with 41 features each, representing different types of attacks including denial of service, user to root, remote to local, and probing attacks. The document discusses the data redundancy, imbalance, and partitioning of the data sets. It also provides sample code for analyzing the data sets in Python, Java, and R and concludes that while the data sets were useful for early intrusion detection research, they have limitations such as redundancy and imbalance that prompted the introduction of new data sets like NSL-KDD.
Optimization of workload prediction based on map reduce frame work in a cloud...eSAT Publishing House
This document summarizes a research paper that proposes optimizing workload prediction in Hadoop clusters using MapReduce and genetic algorithms. It describes collecting job history data from Hadoop, analyzing workload patterns, and using genetic algorithms to predict future workloads and optimize performance. The implementation analyzes a sample Hadoop trace log to calculate error rates for workload predictions. The goal is to integrate workload prediction into multi-node Hadoop clusters for real-time optimization.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
Proposing a New Job Scheduling Algorithm in Grid Environment Using a Combinat...Editor IJCATR
Scheduling jobs to resources in grid computing is complicated due to the distributed and heterogeneous nature of the resources.
The purpose of job scheduling in grid environment is to achieve high system throughput and minimize the execution time of applications.
The complexity of scheduling problem increases with the size of the grid and becomes highly difficult to solve effectively.
To obtain a good and efficient method to solve scheduling problems in grid, a new area of research is implemented. In this paper, a job
scheduling algorithm is proposed to assign jobs to available resources in grid environment. The proposed algorithm is based on Ant
Colony Optimization (ACO) algorithm. This algorithm is combined with one of the best scheduling algorithm, Suffrage. This paper uses
the result of Suffrage in proposed ACO algorithm. The main contribution of this work is to minimize the makespan of a given set of
jobs. The experimental results show that the proposed algorithm can lead to significant performance in grid environment.
This document proposes an adaptive algorithm called DyBBS that dynamically adjusts the batch size and execution parallelism in Spark Streaming to minimize end-to-end latency. The algorithm is based on two observations: 1) processing time increases monotonically with batch size, and 2) there is an optimal execution parallelism for a given batch size. DyBBS uses isotonic regression to learn and adapt batch size and parallelism as workload and conditions change. Experimental results show it significantly reduces latency compared to static configurations and other state-of-the-art approaches.
1) Researchers are building a metadata database using DSpace to allow cross-searching of various observational data distributed across research institutes studying the upper atmosphere.
2) The metadata database stores complete metadata descriptions as content rather than metadata due to DSpace's default Dublin Core format being less flexible than the metadata format needed.
3) The metadata database will provide location information for observational data to allow an analysis software to download and plot the data for studying the upper atmosphere.
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
- The document describes a computational materials design pipeline that uses theory, optimization, and natural language processing (NLP) to accelerate materials discovery.
- Key components of the pipeline include optimization algorithms like Rocketsled to find best materials solutions with fewer calculations, and NLP tools to extract and analyze knowledge from literature to predict promising new materials and benchmarks.
- The pipeline has shown speedups of 15-30x over random searches and has successfully predicted new thermoelectric materials discoveries 1-2 years before their reporting in literature.
The document introduces a new data management system called Metadata Event Log (MEL) to store inconsistent metadata entries from a large-scale landslide monitoring project. MEL uses a tabular format to record sensor node metadata and events over time without a rigid data structure. Functions are written to query MEL and infer missing data, returning relevant entries within the specified time period. The system provides a flexible way to track dynamic sensor node updates compared to traditional rigid data management systems.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Privacy Preserving Multi-keyword Top-K Search based on Cosine Similarity Clus...IRJET Journal
This document proposes a privacy preserving multi-keyword top-k search technique based on cosine similarity clustering. It aims to improve search efficiency for encrypted documents stored on cloud servers. The technique uses cosine similarity clustering to form clusters of similar documents before encrypting and uploading them to the cloud. An encrypted search index is also generated and uploaded. When a user submits a search query, a trapdoor is generated and the most similar document cluster is identified. The top-k most similar encrypted documents within that cluster are then returned as search results. Experimental results show the technique requires less time for document searching and cluster formation compared to other methods.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
1. The document discusses load rebalancing algorithms for distributed file systems in cloud computing. It aims to balance the load across storage nodes to improve performance and resource utilization.
2. A large file is divided into chunks which are distributed across multiple storage nodes. If some nodes become overloaded (heavy nodes) while others are underloaded (light nodes), chunks can be migrated from heavy to light nodes using load rebalancing algorithms.
3. The algorithms structure storage nodes in a distributed hash table to allow efficient lookup and migration of chunks between nodes. Nodes independently calculate their load and migrate chunks to balance load without global knowledge of all nodes' loads.
Improving the Performance of Mapping based on Availability- Alert Algorithm U...AM Publications
Performance of Mapping can be improved and it is needed arise in several fields of science and engineering.
They'll be parallelized in master-worker fashion and relevant programming ways have been projected to cut back
applications. In existing system, the performance of application is considered only for homogenous systems due to simplicity.
In this we use Availability-Alert algorithm using Poisson arrival to extend our approach for Heterogeneous systems in Multi
core Architecture systems. Our proposed algorithm also considers the requirement needed for the application for their
execution in Heterogeneous systems in Multi core Architecture systems while maintaining good performance. Performance
prediction errors are minimized by using this approach at the end of the execution. We present simulation results to quantify
the benefits of our approach.
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
Basic idea is that the search tree could be divided into sub process of equivalence
classes. And since generating item sets in sub process of equivalence classes is independent from
each other, we could do frequent item set mining in sub trees of equivalence classes in parallel. So
the straightforward approach to parallelize Éclat is to consider each equivalence class as a data
(agriculture). We can distribute data to different nodes and nodes could work on data without any
synchronization. Even though the sorting helps to produce different sets in smaller sizes, there is a
cost for sorting. Our Research to analysis is that the size of equivalence class is relatively small
(always less than the size of the item base) and this size also reduces quickly as the search goes
deeper in the recursion process. Base on time using more than using agriculture data we can handle
large amount of data so first we develop éclat algorithm then develop parallel éclat algorithm then
compare with using same data with respect time .with the help of support and confidence.
Applications of Natural Language Processing to Materials DesignAnubhav Jain
This document discusses using natural language processing (NLP) techniques to extract useful information from unstructured text sources in materials science literature. It describes how NLP models can be trained on large datasets of materials science publications to perform tasks like chemistry-aware search, summarizing material properties, and suggesting synthesis methods. The models are developed using techniques like word embeddings, LSTM networks, and named entity recognition. The goal is to organize materials science knowledge from text into a database called Matscholar to enable new applications of the information.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
The document discusses software tools for high-throughput materials design and machine learning developed by Anubhav Jain and collaborators. The tools include pymatgen for structure analysis, FireWorks for workflow management, and atomate for running calculations and collecting output into databases. The matminer package allows analyzing data from atomate with machine learning methods. These open-source tools have been used to run millions of calculations and power databases like the Materials Project.
This document summarizes work on developing clear sky detection methods and photovoltaic data analytics tools. It describes collaborating with NREL and kWh Analytics to build a robust clear sky detection method for the RdTools software. The goal is to automatically learn the best parameters for the PVLib clear sky model by comparing its labels to known clear sky labels from satellite data. It also discusses developing open-source software to analyze string-level I-V curves collected by Sandia National Labs to detect mismatching and extract IV parameters. The work aims to help researchers by providing data management, analytics and predictive modeling through a DuraMat Data Hub.
The document summarizes a system for integrating crop data and meteorological data using a standardized data exchange framework. The system uses a metadata database and broker service called MetBroker to provide consistent access to heterogeneous weather databases. Crop data from different sources can be uploaded and integrated into a central database. The system then allows users to query the integrated crop and weather data and analyze relationships to support applications like crop modeling.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
This study discusses the utilization of three types of relationships in performing data replication. As grid
computing offers the ability of sharing huge amount of resources, resource availability is an important issue
to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in
ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and
evaluation is made based on execution time, storage usage, network bandwidth and computing element
usage. Results suggested that the proposed strategy produces a better outcome than an existing method even
though various job workload is introduced.
Data repository for sensor network a data mining approachijdms
The development of sensor data repositories will aid the researchers to create benchmark dataset. These
benchmark dataset will provide a platform for all the researchers to access the data, test and compare the
accuracy of their algorithms. However, the storage and management of sensor data itself is a challenging
task due to various reasons such as noisy, redundant, missing, and faulty data. Therefore it is very
important to create a data repository which contains the precise and accurate data and also storage and
management of data is effective. Hence, in this paper we are proposing to use the combination of
quantitative association rules and decision tree for classification of faulty data and normal data. Usage of
multiple linear regression models for the estimation of missing data. A symbolic table approach for storage
and management of sensor data. And development of a graphical user interface for visualization of sensor
data.
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGcsandit
Backup software information is a potential source for data mining: not only the unstructured
stored data from all other backed-up servers, but also backup jobs metadata, which is stored in
a formerly known catalog database. Data mining this database, in special, could be used in
order to improve backup quality, automation, reliability, predict bottlenecks, identify risks,
failure trends, and provide specific needed report information that could not be fetched from
closed format property stock property backup software database. Ignoring this data mining
project might be costly, with lots of unnecessary human intervention, uncoordinated work and
pitfalls, such as having backup service disruption, because of insufficient planning. The specific
goal of this practical paper is using Knowledge Discovery in Database Time Series, Stochastic
Models and R scripts in order to predict backup storage data growth. This project could not be
done with traditional closed format proprietary solutions, since it is generally impossible to
read their database data from third party software because of vendor lock-in deliberate
overshadow. Nevertheless, it is very feasible with Bacula: the current third most popular backup
software worldwide, and open source. This paper is focused on the backup storage demand
prediction problem, using the most popular prediction algorithms. Among them, Holt-Winters
Model had the highest success rate for the tested data sets.
This document proposes an adaptive algorithm called DyBBS that dynamically adjusts the batch size and execution parallelism in Spark Streaming to minimize end-to-end latency. The algorithm is based on two observations: 1) processing time increases monotonically with batch size, and 2) there is an optimal execution parallelism for a given batch size. DyBBS uses isotonic regression to learn and adapt batch size and parallelism as workload and conditions change. Experimental results show it significantly reduces latency compared to static configurations and other state-of-the-art approaches.
1) Researchers are building a metadata database using DSpace to allow cross-searching of various observational data distributed across research institutes studying the upper atmosphere.
2) The metadata database stores complete metadata descriptions as content rather than metadata due to DSpace's default Dublin Core format being less flexible than the metadata format needed.
3) The metadata database will provide location information for observational data to allow an analysis software to download and plot the data for studying the upper atmosphere.
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
- The document describes a computational materials design pipeline that uses theory, optimization, and natural language processing (NLP) to accelerate materials discovery.
- Key components of the pipeline include optimization algorithms like Rocketsled to find best materials solutions with fewer calculations, and NLP tools to extract and analyze knowledge from literature to predict promising new materials and benchmarks.
- The pipeline has shown speedups of 15-30x over random searches and has successfully predicted new thermoelectric materials discoveries 1-2 years before their reporting in literature.
The document introduces a new data management system called Metadata Event Log (MEL) to store inconsistent metadata entries from a large-scale landslide monitoring project. MEL uses a tabular format to record sensor node metadata and events over time without a rigid data structure. Functions are written to query MEL and infer missing data, returning relevant entries within the specified time period. The system provides a flexible way to track dynamic sensor node updates compared to traditional rigid data management systems.
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Privacy Preserving Multi-keyword Top-K Search based on Cosine Similarity Clus...IRJET Journal
This document proposes a privacy preserving multi-keyword top-k search technique based on cosine similarity clustering. It aims to improve search efficiency for encrypted documents stored on cloud servers. The technique uses cosine similarity clustering to form clusters of similar documents before encrypting and uploading them to the cloud. An encrypted search index is also generated and uploaded. When a user submits a search query, a trapdoor is generated and the most similar document cluster is identified. The top-k most similar encrypted documents within that cluster are then returned as search results. Experimental results show the technique requires less time for document searching and cluster formation compared to other methods.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
1. The document discusses load rebalancing algorithms for distributed file systems in cloud computing. It aims to balance the load across storage nodes to improve performance and resource utilization.
2. A large file is divided into chunks which are distributed across multiple storage nodes. If some nodes become overloaded (heavy nodes) while others are underloaded (light nodes), chunks can be migrated from heavy to light nodes using load rebalancing algorithms.
3. The algorithms structure storage nodes in a distributed hash table to allow efficient lookup and migration of chunks between nodes. Nodes independently calculate their load and migrate chunks to balance load without global knowledge of all nodes' loads.
Improving the Performance of Mapping based on Availability- Alert Algorithm U...AM Publications
Performance of Mapping can be improved and it is needed arise in several fields of science and engineering.
They'll be parallelized in master-worker fashion and relevant programming ways have been projected to cut back
applications. In existing system, the performance of application is considered only for homogenous systems due to simplicity.
In this we use Availability-Alert algorithm using Poisson arrival to extend our approach for Heterogeneous systems in Multi
core Architecture systems. Our proposed algorithm also considers the requirement needed for the application for their
execution in Heterogeneous systems in Multi core Architecture systems while maintaining good performance. Performance
prediction errors are minimized by using this approach at the end of the execution. We present simulation results to quantify
the benefits of our approach.
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
Basic idea is that the search tree could be divided into sub process of equivalence
classes. And since generating item sets in sub process of equivalence classes is independent from
each other, we could do frequent item set mining in sub trees of equivalence classes in parallel. So
the straightforward approach to parallelize Éclat is to consider each equivalence class as a data
(agriculture). We can distribute data to different nodes and nodes could work on data without any
synchronization. Even though the sorting helps to produce different sets in smaller sizes, there is a
cost for sorting. Our Research to analysis is that the size of equivalence class is relatively small
(always less than the size of the item base) and this size also reduces quickly as the search goes
deeper in the recursion process. Base on time using more than using agriculture data we can handle
large amount of data so first we develop éclat algorithm then develop parallel éclat algorithm then
compare with using same data with respect time .with the help of support and confidence.
Applications of Natural Language Processing to Materials DesignAnubhav Jain
This document discusses using natural language processing (NLP) techniques to extract useful information from unstructured text sources in materials science literature. It describes how NLP models can be trained on large datasets of materials science publications to perform tasks like chemistry-aware search, summarizing material properties, and suggesting synthesis methods. The models are developed using techniques like word embeddings, LSTM networks, and named entity recognition. The goal is to organize materials science knowledge from text into a database called Matscholar to enable new applications of the information.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
The document discusses software tools for high-throughput materials design and machine learning developed by Anubhav Jain and collaborators. The tools include pymatgen for structure analysis, FireWorks for workflow management, and atomate for running calculations and collecting output into databases. The matminer package allows analyzing data from atomate with machine learning methods. These open-source tools have been used to run millions of calculations and power databases like the Materials Project.
This document summarizes work on developing clear sky detection methods and photovoltaic data analytics tools. It describes collaborating with NREL and kWh Analytics to build a robust clear sky detection method for the RdTools software. The goal is to automatically learn the best parameters for the PVLib clear sky model by comparing its labels to known clear sky labels from satellite data. It also discusses developing open-source software to analyze string-level I-V curves collected by Sandia National Labs to detect mismatching and extract IV parameters. The work aims to help researchers by providing data management, analytics and predictive modeling through a DuraMat Data Hub.
The document summarizes a system for integrating crop data and meteorological data using a standardized data exchange framework. The system uses a metadata database and broker service called MetBroker to provide consistent access to heterogeneous weather databases. Crop data from different sources can be uploaded and integrated into a central database. The system then allows users to query the integrated crop and weather data and analyze relationships to support applications like crop modeling.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
This study discusses the utilization of three types of relationships in performing data replication. As grid
computing offers the ability of sharing huge amount of resources, resource availability is an important issue
to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in
ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and
evaluation is made based on execution time, storage usage, network bandwidth and computing element
usage. Results suggested that the proposed strategy produces a better outcome than an existing method even
though various job workload is introduced.
Data repository for sensor network a data mining approachijdms
The development of sensor data repositories will aid the researchers to create benchmark dataset. These
benchmark dataset will provide a platform for all the researchers to access the data, test and compare the
accuracy of their algorithms. However, the storage and management of sensor data itself is a challenging
task due to various reasons such as noisy, redundant, missing, and faulty data. Therefore it is very
important to create a data repository which contains the precise and accurate data and also storage and
management of data is effective. Hence, in this paper we are proposing to use the combination of
quantitative association rules and decision tree for classification of faulty data and normal data. Usage of
multiple linear regression models for the estimation of missing data. A symbolic table approach for storage
and management of sensor data. And development of a graphical user interface for visualization of sensor
data.
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGcsandit
Backup software information is a potential source for data mining: not only the unstructured
stored data from all other backed-up servers, but also backup jobs metadata, which is stored in
a formerly known catalog database. Data mining this database, in special, could be used in
order to improve backup quality, automation, reliability, predict bottlenecks, identify risks,
failure trends, and provide specific needed report information that could not be fetched from
closed format property stock property backup software database. Ignoring this data mining
project might be costly, with lots of unnecessary human intervention, uncoordinated work and
pitfalls, such as having backup service disruption, because of insufficient planning. The specific
goal of this practical paper is using Knowledge Discovery in Database Time Series, Stochastic
Models and R scripts in order to predict backup storage data growth. This project could not be
done with traditional closed format proprietary solutions, since it is generally impossible to
read their database data from third party software because of vendor lock-in deliberate
overshadow. Nevertheless, it is very feasible with Bacula: the current third most popular backup
software worldwide, and open source. This paper is focused on the backup storage demand
prediction problem, using the most popular prediction algorithms. Among them, Holt-Winters
Model had the highest success rate for the tested data sets.
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
This document describes using the Febrl (Freely Extensible Biomedical Record Linkage) framework to perform efficient record de-duplication. It discusses how Febrl allows for data cleaning, standardization, indexing, field comparison, and weight vector classification. Indexing techniques like blocking indexes, q-grams, and canopy clustering are used to reduce the number of record pair comparisons. Field comparison functions calculate matching weights, and classifiers like Fellegi-Sunter and support vector machines are used to determine matches. The method is evaluated on real-world health data, showing accuracy, precision, recall, and false positive rates for different partitioning methods.
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
This document proposes mechanisms to improve the efficiency of the Hadoop distributed file system and MapReduce framework. It suggests using locality-sensitive hashing to colocate related files on the same data nodes, which would improve data locality. It also proposes implementing a cache to store the results of MapReduce tasks, so that duplicate computations can be avoided when the same task is run again on the same data. Implementing these mechanisms could help speed up execution times in Hadoop by reducing unnecessary data transmission and repetitive task executions.
Data collection in multi application sharing wireless sensor networksPvrtechnologies Nellore
- This document discusses algorithms for minimizing data collection in wireless sensor networks that are shared by multiple applications. It introduces the interval data sharing problem, where each application requires continuous interval data sampling rather than single data points.
- The problem is formulated as a non-linear, non-convex optimization problem. A 2-factor approximation algorithm is proposed with time complexity O(n^2) and memory complexity O(n) to address the high complexity of solving the optimization problem on resource-constrained sensor nodes.
- A special case where sampling intervals are the same length is analyzed, and a dynamic programming algorithm is provided that runs in optimal O(n^2) time and O(n) memory. Three online algorithms
The Impact of Data Replication on Job Scheduling Performance in Hierarchical ...graphhoc
In data-intensive applications data transfer is a primary cause of job execution delay. Data access time depends on bandwidth. The major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks and Internet. Effective scheduling can reduce the amount of data transferred across the internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism. Objective of dynamic replica strategies is reducing file access time which leads to reducing job runtime. In this paper we develop a job scheduling policy and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies. We study our approach and evaluate it through simulation. The results show that our algorithm has improved 12% over the current strategies
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes an algorithm called Replica Placement in Graph Topology Grid (RPGTG) to optimally place data replicas in a graph-based data grid while ensuring quality of service (QoS). The algorithm aims to minimize data access time, balance load among replica servers, and avoid unnecessary replications, while restricting QoS in terms of number of hops and deadline to complete requests. The article describes how the algorithm converts the graph structure of the data grid to a hierarchical structure to better manage replica servers and proposes services to facilitate dynamic replication, including a replica catalog to track replica locations and a replica manager to perform replication
Sharing of cluster resources among multiple Workflow Applicationsijcsit
Many computational solutions can be expressed as workflows. A Cluster of processors is a shared
resource among several users and hence the need for a scheduler which deals with multi-user jobs
presented as workflows. The scheduler must find the number of processors to be allotted for each workflow
and schedule tasks on allotted processors. In this work, a new method to find optimal and maximum
number of processors that can be allotted for a workflow is proposed. Regression analysis is used to find
the best possible way to share available processors, among suitable number of submitted workflows. An
instance of a scheduler is created for each workflow, which schedules tasks on the allotted processors.
Towards this end, a new framework to receive online submission of workflows, to allot processors to each
workflow and schedule tasks, is proposed and experimented using a discrete-event based simulator. This
space-sharing of processors among multiple workflows shows better performance than the other methods
found in literature. Because of space-sharing, an instance of a scheduler must be used for each workflow
within the allotted processors. Since the number of processors for each workflow is known only during
runtime, a static schedule can not be used. Hence a hybrid scheduler which tries to combine the advantages
of static and dynamic scheduler is proposed. Thus the proposed framework is a promising solution to
multiple workflows scheduling on cluster.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcscpconf
This study discusses the utilization of three types of relationships in performing data replication. As grid computing offers the ability of sharing huge amount of resources, resource availability is an important issue to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and evaluation is made based on execution time, storage usage, network bandwidth and computing element usage. Results suggested that the proposed strategy produces a better outcome than an existing method even though various job workload is introduced.
This document summarizes a paper that presents a novel method for passive resource discovery in cluster grid environments. The method monitors network packet frequency from nodes' network interface cards to identify nodes with available CPU cycles (<70% utilization) by detecting latency signatures from frequent context switching. Experiments on a 50-node testbed showed the method can consistently and accurately discover available resources by analyzing existing network traffic, including traffic passed through a switch. The paper also proposes algorithms for distributed two-level resource discovery, replication and utilization to optimize resource allocation and access costs in distributed computing environments.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET) that proposes a new dynamic data replication and job scheduling strategy for data grids. The strategy aims to improve data access time and reduce bandwidth consumption by replicating data based on file popularity, storage limitations at nodes, and data category. It replicates more popular files that are in the same category as frequently accessed data to nodes close to where jobs are run. This is intended to optimize performance by locating data and jobs close together. The document provides context on related work and outlines the proposed system architecture and replication/scheduling approach.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKSijwmn
This document proposes a new lossless compression scheme called Mixed Mode Summary-based Lossless Compression for Mobile Networks log files (MMSLC). MMSLC uses the Apriori algorithm to mine frequent patterns from log files. It then assigns unique codes to frequent patterns based on their compression gain. MMSLC exploits correlations between consecutive log files by using a mixed online and offline compression approach. It applies frequent patterns mined from previous files during online compression of current files, while also mining patterns from current files for future compression. The method achieves high compression ratios and provides summaries of frequent patterns to aid in network monitoring.
T AXONOMY OF O PTIMIZATION A PPROACHES OF R ESOURCE B ROKERS IN D ATA G RIDSijcsit
A novel taxonomy of replica selection techniques is proposed. We studied some data grid approaches
where the selection strategies of data management is different. The aim of the study is to determine the
common concepts and observe their performance and to compare their performance with our strategy
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
In today’s Internet world, log file analysis is becoming a necessary task for analyzing the customer’s
behavior in order to improve advertising and sales as well as for datasets like environment, medical,
banking system it is important to analyze the log data to get required knowledge from it. Web mining is the
process of discovering the knowledge from the web data. Log files are getting generated very fast at the
rate of 1-10 Mb/s per machine, a single data center can generate tens of terabytes of log data in a day.
These datasets are huge. In order to analyze such large datasets we need parallel processing system and
reliable data storage mechanism. Virtual database system is an effective solution for integrating the data
but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage by
Hadoop Distributed File System and MapReduce programming model which is a parallel processing
system for large datasets. Hadoop distributed file system breaks up input data and sends fractions of the
original data to several machines in hadoop cluster to hold blocks of data. This mechanism helps to
process log data in parallel using all the machines in the hadoop cluster and computes result efficiently.
The dominant approach provided by hadoop to “Store first query later”, loads the data to the Hadoop
Distributed File System and then executes queries written in Pig Latin. This approach reduces the response
time as well as the load on to the end system. This paper proposes a log analysis system using Hadoop
MapReduce which will provide accurate results in minimum response time.
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
This document proposes a log analysis system called HMR Log Analyzer that uses Hadoop MapReduce to analyze large volumes of web application log files in parallel. It discusses how Hadoop Distributed File System stores and distributes log files across nodes for fault tolerance. The system first pre-processes logs to clean and organize the data before applying the MapReduce algorithm. MapReduce jobs break the analysis into map and reduce phases to efficiently process logs in parallel and generate summarized results like page view counts. The system provides an interface for users to query and visualize results.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
Similar to METHOD FOR CONDUCTING A COMBINED ANALYSIS OF GRID ENVIRONMENT’S FTA AND GWA THROUGH SESSION BASED MAPPING OF TRACE VARIABLES (20)
11th International Conference on Computer Science, Engineering and Informati...ijgca
11th International Conference on Computer Science, Engineering and Information
Technology (CSEIT 2024) will provide an excellent international forum for sharing knowledge
and results in theory, methodology and applications of Computer Science, Engineering and
Information Technology. The Conference looks for significant contributions to all major fields of
the Computer Science and Information Technology in theoretical and practical aspects. The aim
of the conference is to provide a platform to the researchers and practitioners from both academia
as well as industry to meet and share cutting-edge development in the field.
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...ijgca
Cloud computing is a concept of providing user and application oriented services in a virtual environment.
Users can use the various cloud services as per their requirements dynamically. Different users have
different requirements in terms of application reliability, performance and fault tolerance. Static and rigid
fault tolerance policies provide a consistent degree of fault tolerance as well as overhead. In this research
work we have proposed a method to implement dynamic fault tolerance considering customer
requirements. The cloud users have been classified in to sub classes as per the fault tolerance requirements.
Their jobs have also been classified into compute intensive and data intensive categories. The varying
degree of fault tolerance has been applied consisting of replication and input buffer. From the simulation
based experiments we have found that the proposed dynamic method performs better than the existing
methods.
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...ijgca
Cloud computing is a concept of providing user and application oriented services in a virtual environment.
Users can use the various cloud services as per their requirements dynamically. Different users have
different requirements in terms of application reliability, performance and fault tolerance. Static and rigid
fault tolerance policies provide a consistent degree of fault tolerance as well as overhead. In this research
work we have proposed a method to implement dynamic fault tolerance considering customer
requirements. The cloud users have been classified in to sub classes as per the fault tolerance requirements.
Their jobs have also been classified into compute intensive and data intensive categories. The varying
degree of fault tolerance has been applied consisting of replication and input buffer. From the simulation
based experiments we have found that the proposed dynamic method performs better than the existing
methods.
11th International Conference on Computer Science, Engineering and Informatio...ijgca
11th International Conference on Computer Science, Engineering and Information Technology (CSEIT 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Computer Science, Engineering and Information Technology. The Conference looks for significant contributions to all major fields of the Computer Science and Information Technology in theoretical and practical aspects. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
Authors are solicited to contribute to the conference by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to.
Load balancing functionalities are crucial for best Grid performance and utilization. Accordingly,this
paper presents a new meta-scheduling method called TunSys. It is inspired from the natural phenomenon of
heat propagation and thermal equilibrium. TunSys is based on a Grid polyhedron model with a spherical
like structure used to ensure load balancing through a local neighborhood propagation strategy.
Furthermore, experimental results compared to FCFS, DGA and HGA show encouraging results in terms
of system performance and scalability and in terms of load balancing efficiency.
11th International Conference on Computer Science and Information Technology ...ijgca
11th International Conference on Computer Science and Information Technology (CSIT 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Computer Science and Information Technology. The Conference looks for significant contributions to all major fields of the Computer Science and Information Technology in theoretical and practical aspects. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
AN INTELLIGENT SYSTEM FOR THE ENHANCEMENT OF VISUALLY IMPAIRED NAVIGATION AND...ijgca
Technological advancement has brought the masses unprecedented convenience, but unnoticed by many, a
population neglected through the age of technology has been the visually impaired population. The visually
impaired population has grown through ages with as much desire as everyone else to adventure but lack
the confidence and support to do so. Time has transported society to a new phase condensed in big data,
but to the visually impaired population, this quick-pace living lifestyle, along with the unpredictable nature
of natural disaster and COVID-19 pandemic, has dropped them deeper into a feeling of disconnection from
the society. Our application uses the global positioning system to support the visually impaired in
independent navigation, alerts them in face of natural disasters, and reminds them to sanitize their devices
during the COVID-19 pandemic
13th International Conference on Data Mining & Knowledge Management Process (...ijgca
13th International Conference on Data Mining & Knowledge Management Process (CDKP 2024) provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
Authors are solicited to contribute to the conference by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
Call for Papers - 15th International Conference on Wireless & Mobile Networks...ijgca
15th International Conference on Wireless & Mobile Networks (WiMoNe 2023) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Wireless & Mobile computing Environment. Current information age is witnessing a dramatic use of digital and electronic devices in the workplace and beyond. Wireless, Mobile Networks & its applications had received a significant and sustained research interest in terms of designing and deploying large scale and high performance computational applications in real life. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
Call for Papers - 4th International Conference on Big Data (CBDA 2023)ijgca
4th International Conference on Big Data (CBDA 2023) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the areas of Big Data. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the area of Big Data.
Call for Papers - 15th International Conference on Computer Networks & Commun...ijgca
15th International Conference on Computer Networks & Communications (CoNeCo 2023) looks for significant contributions to the Computer Networks & Communications for Wired and Wireless Networks in theoretical and practical aspects. Original papers are invited on Computer Networks, Network Protocols and Wireless Networks, Data Communication Technologies, and Network Security. The goal of this Conference is to bring together researchers and practitioners from academia and industry to focus on advanced networking concepts and establishing new collaborations in these areas.
Call for Papers - 15th International Conference on Computer Networks & Commun...ijgca
15th International Conference on Computer Networks & Communications (CoNeCo 2023) looks for significant contributions to the Computer Networks & Communications for Wired and Wireless Networks in theoretical and practical aspects. Original papers are invited on Computer Networks, Network Protocols and Wireless Networks, Data Communication Technologies, and Network Security. The goal of this Conference is to bring together researchers and practitioners from academia and industry to focus on advanced networking concepts and establishing new collaborations in these areas.
Call for Papers - 9th International Conference on Cryptography and Informatio...ijgca
9th International Conference on Cryptography and Information Security (CRIS 2023) provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum. It aims to bring together scientists, researchers and students to exchange novel ideas and results in all aspects of cryptography, coding and Information security.
Call for Papers - 9th International Conference on Cryptography and Informatio...ijgca
9th International Conference on Cryptography and Information Security (CRIS 2023) provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum. It aims to bring together scientists, researchers and students to exchange novel ideas and results in all aspects of cryptography, coding and Information security.
Call for Papers - 4th International Conference on Machine learning and Cloud ...ijgca
4th International Conference on Machine learning and Cloud Computing (MLCL 2023) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Cloud computing. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
Call for Papers - 11th International Conference on Data Mining & Knowledge Ma...ijgca
11th International Conference on Data Mining & Knowledge Management Process (DKMP 2023) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Data Mining and knowledge management process. The goal of this conference is to bring together researchers and practitioners from academia and industry to focus on understanding Modern data mining concepts and establishing new collaborations in these areas.
Call for Papers - 4th International Conference on Blockchain and Internet of ...ijgca
4th International Conference on Blockchain and Internet of Things (BIoT 2023) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Blockchain and Internet of Things. The Conference looks for significant contributions to all major fields of the Blockchain and Internet of Things in theoretical and practical aspects.
Call for Papers - International Conference IOT, Blockchain and Cryptography (...ijgca
The 4th International Conference on Cloud, Big Data and Web Services (CBW 2023) will take place from March 25-26, 2023 in Sydney, Australia. The conference aims to facilitate the exchange of innovative ideas and research related to cloud computing, big data, and web services. Authors are invited to submit papers by February 18, 2023 on topics including cloud platforms, big data analytics, and web service models and architectures. Selected papers will be published in related journals.
Call for Paper - 4th International Conference on Cloud, Big Data and Web Serv...ijgca
4th International Conference on Cloud, Big Data and Web Services (CBW 2023) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the areas of Cloud, Big Data and Web services. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the area of Cloud, Big Data and web services.
Call for Papers - International Journal of Database Management Systems (IJDMS)ijgca
The International Journal of Database Management Systems (IJDMS) is a bi monthly open access peer-reviewed journal that publishes articles which contributenew results in all areas of the database management systems & its applications. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding Modern developments in this filed and establishing new collaborations in these areas.
Impartiality as per ISO /IEC 17025:2017 StandardMuhammadJazib15
This document provides basic guidelines for imparitallity requirement of ISO 17025. It defines in detial how it is met and wiudhwdih jdhsjdhwudjwkdbjwkdddddddddddkkkkkkkkkkkkkkkkkkkkkkkwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwioiiiiiiiiiiiii uwwwwwwwwwwwwwwwwhe wiqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq gbbbbbbbbbbbbb owdjjjjjjjjjjjjjjjjjjjj widhi owqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq uwdhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhwqiiiiiiiiiiiiiiiiiiiiiiiiiiiiw0pooooojjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj whhhhhhhhhhh wheeeeeeee wihieiiiiii wihe
e qqqqqqqqqqeuwiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiqw dddddddddd cccccccccccccccv s w c r
cdf cb bicbsad ishd d qwkbdwiur e wetwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww w
dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffw
uuuuhhhhhhhhhhhhhhhhhhhhhhhhe qiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii iqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccc bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbu uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuum
m
m mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm m i
g i dijsd sjdnsjd ndjajsdnnsa adjdnawddddddddddddd uw
Blood finder application project report (1).pdfKamal Acharya
Blood Finder is an emergency time app where a user can search for the blood banks as
well as the registered blood donors around Mumbai. This application also provide an
opportunity for the user of this application to become a registered donor for this user have
to enroll for the donor request from the application itself. If the admin wish to make user
a registered donor, with some of the formalities with the organization it can be done.
Specialization of this application is that the user will not have to register on sign-in for
searching the blood banks and blood donors it can be just done by installing the
application to the mobile.
The purpose of making this application is to save the user’s time for searching blood of
needed blood group during the time of the emergency.
This is an android application developed in Java and XML with the connectivity of
SQLite database. This application will provide most of basic functionality required for an
emergency time application. All the details of Blood banks and Blood donors are stored
in the database i.e. SQLite.
This application allowed the user to get all the information regarding blood banks and
blood donors such as Name, Number, Address, Blood Group, rather than searching it on
the different websites and wasting the precious time. This application is effective and
user friendly.
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...DharmaBanothu
The Network on Chip (NoC) has emerged as an effective
solution for intercommunication infrastructure within System on
Chip (SoC) designs, overcoming the limitations of traditional
methods that face significant bottlenecks. However, the complexity
of NoC design presents numerous challenges related to
performance metrics such as scalability, latency, power
consumption, and signal integrity. This project addresses the
issues within the router's memory unit and proposes an enhanced
memory structure. To achieve efficient data transfer, FIFO buffers
are implemented in distributed RAM and virtual channels for
FPGA-based NoC. The project introduces advanced FIFO-based
memory units within the NoC router, assessing their performance
in a Bi-directional NoC (Bi-NoC) configuration. The primary
objective is to reduce the router's workload while enhancing the
FIFO internal structure. To further improve data transfer speed,
a Bi-NoC with a self-configurable intercommunication channel is
suggested. Simulation and synthesis results demonstrate
guaranteed throughput, predictable latency, and equitable
network access, showing significant improvement over previous
designs
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
This study Examines the Effectiveness of Talent Procurement through the Imple...DharmaBanothu
In the world with high technology and fast
forward mindset recruiters are walking/showing interest
towards E-Recruitment. Present most of the HRs of
many companies are choosing E-Recruitment as the best
choice for recruitment. E-Recruitment is being done
through many online platforms like Linkedin, Naukri,
Instagram , Facebook etc. Now with high technology E-
Recruitment has gone through next level by using
Artificial Intelligence too.
Key Words : Talent Management, Talent Acquisition , E-
Recruitment , Artificial Intelligence Introduction
Effectiveness of Talent Acquisition through E-
Recruitment in this topic we will discuss about 4important
and interlinked topics which are
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
METHOD FOR CONDUCTING A COMBINED ANALYSIS OF GRID ENVIRONMENT’S FTA AND GWA THROUGH SESSION BASED MAPPING OF TRACE VARIABLES
1. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
DOI: 10.5121/ijgca.2014.5101 1
METHOD FOR CONDUCTING A COMBINED
ANALYSIS OF GRID ENVIRONMENT’S FTA AND
GWA THROUGH SESSION BASED MAPPING OF
TRACE VARIABLES
Ramandeep Singh1
and R.K.Bawa2
1
Lovely Professional University, Jalandhar, India
2
Department of Computer Science, Punjabi University Patiala, India
ABSTRACT
Grid computing environment due to its scale and heterogeneous nature is more vulnerable to faults. To
store and analyze fault and workload information, FTA Fault Trace Archive and GWA Grid Workload
Archive are used. Previously researchers have analyzed FTA and GWA as separate research problems but
in this research paper we have proposed a method for conducting a combined analysis of FTA and GWA
based on session based mapping of trace file variables. This is the first attempt to conduct a combined
analysis of these two trace files. Along with the step by step process of combining trace files we have also
included do’s and don’ts while conducting this analysis .Through this combined analysis we have
established a correlation based relationship among number of node failures, number of failed jobs, failure
duration and number of nodes. We have found that these variables are positively correlated with different
correlation coefficients.
Keywords: Grid Computing, FTA Fault Trace Archive, GWA Grid Workload Archive,
Correlation
1. Introduction
Grid computing is a concept which involves combining computing and data storage resources
from different sources into virtual organizations (VO). These resources are of heterogeneous
nature and are connected through dedicated network links or via internet. Due to its large size and
lack of central control Grid resources are vulnerable to different types of faults. These faults
affect the quality of service of Grid by either failing the submitted jobs or by delaying execution
of these jobs. So researchers and developers need to analyze workload and fault information to
study the Grid environment and to improve the overall Grid performance along with reliability.
Trace files are used to collect data about the events taking place inside the Grid. These events
include faults which are occurring on the Grid resources and events related to the jobs which are
submitted for execution on Grid. Basically two types of trace files are used in Grid for collecting
resource availability and job data which are FTA and GWA.
FTA stands for Fault Trace Archive [1] and GWA stands for Grid Workload Archive [2]. FTA
collects information about the faults which are taking place on the Grid nodes and GWA collects
2. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
2
information about the workload which is being submitted on the Grid. FTA contain information
about nodes, platforms, sites, hardware configuration, availability and unavailability events, time
of occurrence of these events and the corresponding step which was taken to deal with the failure
situation. In the same way GWA contain information about job id, submission time, waiting time,
processor requirement, estimated execution time, status and other information fields related to
queues, users and groups. Status of a job can have different values e.g. completed, failed,
cancelled or any other trace file specific value. Trace files are available in different formats e.g.
raw, tabbed or mysql.FTA is a collection of different files which can be joined with each other
with the help of the common fields, like we perform join operation among SQL tables. Due to
space limitation we cannot include the whole format of trace files but details about the format and
design of FTA can be found from FTA web page [3] and the detailed format information of GWA
can be found from GWA web page [4][9].
These trace files can be analyzed using different tools e.g. Matlab, GridSim or mysql. GridSim is
a good tool in case you want to run simulations on these trace files according to the constraints of
a Grid environment. These trace files i.e. FTA and GWA have been individually analyzed by
many researchers for studying Grid environment for different purposes, but no one have
considered combining these trace files to study the relationship among two. In this research paper
we are going to propose and implement a technique using which we can combine FTA and GWA.
Both these trace files are collected from the same Grid platform simultaneously i.e. in parallel.
Using this technique we can study the influence of the events of one trace file on the events of
another trace file.
2. Related Work
Trace files are a good way of collecting data about the events taking place inside a system either
distributed or centralized. Analysis of these trace files can reveal many interesting facts related to
the system. Artur Andrzejak et. al [5]have worked on SETI@home host trace files, which
consisted of 48000 hosts, to analyze host availability patterns. They proposed a model which can
ensure that a certain number of hosts will be available for a certain amount of time either with
replication or with over provisioning of resources. Similarly Bahman Javadi, et al., [6] have
collected and analyzed SETI @home trace files for discovering subset of hosts, which share
similar kind of statistical availability patterns. Out of 230000 hosts they have discovered that
availability of around 34% of hosts is a truly random process but rest of these hosts can often be
modeled in the form of different groups with few distinctions from one another. User oriented
analysis reveals the facts about the workload patterns which are submitted by different users
[11][12].
Bianca [7] have used trace file to analyze and predict the average life span of hard drives being
used at a high performance cluster. Data sheets of these hard drives show that MTTF (Mean Time
to Failure) is around 1,000,000 to 15,000,000 hours, suggesting an annual failure rate around
0.88%. But analysis to the actual field data show that the minimum disk replacement rate is 1%
and usually it is up to 3-4%, and in some cases up to 13% replacement rate is there. Analysis
show that actual data gathered from field may differ from the conceptual or ideal data. Nezih
Yigitbasi et.al. have analyzed availability and unavailability of Grid nodes and have identified
that there exist a predictable pattern in these events and this behavior can be modeled. They have
also established a correlation based model for node failures [8].
3. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
3
GWA trace files are used by researchers to study what kind of jobs are submitted on the Grid and
how does the success and failure of a job is dependent on different parameters of job. Grid
environments are either application specific or job specific. Application specific Grid can execute
or process only a specific application related tasks while on the other hand job specific Grid can
execute different types of jobs which may belong to different applications. Although individual
analysis of trace files is very useful but in this research paper we are going to propose a method
of combining these two trace files for a combined analysis.
3. Combining FTA and GWA
Following are the operations which need to be performed for conducting combined analysis.
3.1 Acquiring FTA and GWA Trace Files
For conducting a combined analysis of FTA and GWA very first condition is that both trace files
should belong to the same Grid environment and should be collected at the same time i.e. events
should be logged in parallel. Reason for this is that if these trace files belong to different
platforms or are collected at different time intervals then there will be no benefit of conducting
the combined analysis because then these events cannot be related to each other and the analysis
will make no sense. In the absence of mysql format of trace files, we can use SPSS for importing
data and converting it in required formats. From the excel format we can insert data in SQL or
mysql table.
3.2 Trimming Trace Files
After acquiring both trace files we need to check event start time and event end time of both trace
files. Reason for this is that collection process of these trace files may start and end at different
times so we need to trim one or both the trace files at a common point of event start time and
event end time. Let’s consider an example where FTA trace file was started at 10:00:00 am, 10
Jan 2013 and it ends on 12:00:00 pm, 15 Jan 2014. Similarly GWA was started at 09:00:00 am,
31 Dec 2012 and it ended on 11:00:00 am, 1 March 2014. Although both these trace files are
collected from the same environment and individually there is nothing wrong with these trace
files but for combined analysis we need to trim either one or both from either beginning or end to
synchronize the event start and event end time. So in case of above considered example we will
trim GWA so that first event start time is 10:00:00 am, 10 Jan 2013 and last event end time will
be 12:00:00 pm, 15 Jan 2014. In order to remove noise we will also have to remove the events
which start before the trimming point but end after it, because such events may cause deviations
in calculations. Same will be applicable to events which start before the trimming point but end
after the trimming point. Although we can say that few such events will not make a difference
when we are talking about millions of such events, but to minimize the errors this is the one
precaution which should be considered in the beginning. So at the end of this step we will have
two trace files which start and end at the same time.
3.3 Slicing and Dicing
Now that we have two trace files of equal time duration we can start with our analysis. Slicing
means that we will divide trace files into sub parts. But the one thing to keep in mind is that these
4. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
4
slices should be of equal size, otherwise it will lead to inconsistent results. Slicing duration can
vary from an hour to day or week and even a month depending on the type of analysis. Because
the event time is epoch time so we would have to convert seconds to days, weeks or months and
then add this value subsequently for slicing and dicing operations.
Figure 1: Mapping and Slicing of FTA and GWA
Figure 1 represents slicing operations in which FTA and GWA trace files of 12 months duration
have been sliced in 6 month and 3 month durations. Vertical lines represent slicing operations. As
we can see that synchronizing is very well considered here as both the trace files have been sliced
at the same locations i.e. time.
3.4 Data Extraction
After slicing down the trace files now we can retrieve data from these trace files for our combined
analysis. Data selection and extraction is based on what kind of analysis we are performing and
which fields of the trace file are required for the analysis. As we know that these two trace files
do not share any common field except the event time variable so we cannot combine FTA and
GWA directly. We have used a different approach which is based on aggregation functions.
a) FTA Data Extraction
Fault Trace Archive contains information about Grid nodes. If NF represents number of failures
and NF(i)(s) represents number of failures for node i where }...,.........4,3,2,1{ nJi ∈ .J represents
collection of n nodes which are part of Grid environment and S represents time slot i.e. session. D
represents total failure duration of all the nodes on the Grid whereas D(i) represents total failure
duration for node i. F(i)T represents failure frequency of node I for duration T.R(i) represents
average resume time after failure for node i.
In the absence of common variable, we have used the method of aggregation functions (basically
summation and average) to collect and retrieve data for analysis. NF(i), D(i) and F(i) can be
directly calculated if we are considering the whole trace file as our possible data set. But we can
also calculate these values for a specific duration S which is our sampling duration. Now the
value of NF(i), D(i) and F(i) will be calculated from this duration S. S can be generated randomly
5. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
5
so that data should be retrieved form the whole trace file rather than one particular section of the
trace file. This helps in getting more accurate results.
b) GWA Data Extraction
Similar to FTA now we need to extract data from GWA and map it with FTA variables. Once
again we will use aggregation function and sample duration for retrieving data. Jid represents one
job form the GWA with unique identifier id where },........3,2,1{ kWid ∈ here W represents set of
jobs which were submitted on the Grid for execution. C id represents number of processors which
are requested by job Jid for execution. WTid represents waiting time for job Jid and Rid represents
runtime requested by Jid..Uid represents user id of the user who submitted job Jid. Sid represents
status of the job Jid. Status as already discussed can be completed, failed, cancelled or any other
trace file specific reason. Ncompleted(s), Nfailed(s), Ncancelled(s) represents number of completed, failed
and canceled jobs respectively in duration S.
4. Combining Data
After the data extraction, now data of different variables should be mapped according to the time
slots. In case the time slot shuffling is taking place in both the data sets simultaneously, then it
will not lead to any inconsistencies, but if one data set’s time slot have been shifted but not the
other one, then it will lead to incorrect results. Graphical representation of this concept is shown
in figure 2. In this figure S1, S2 and S3 represents the time slots for which aggregation functions
have been used to retrieve data. In the figure Fvar and Gvar represents variables of FTA and
GWA. Variable can be any variable which the analyst wants to include in the analysis and which
have some relationship with the other variables of the trace file. This relation can be direct or
indirect. For example we may consider that there exist a relationship between number of node
failures and number of failed jobs. So on one side i.e. FTA we can have data about number of
node failures and other relevant variables such as failure duration, resume time etc. and on the
other side i.e. GWA we can have data related to number of job failures and the other relevant
variables which are required for this analysis such as waiting time, number of processors
requested by job, run time etc. It is not mandatory that the number of variables from FTA and
GWA which are being used in analysis needs to be same in number. There can be a case where
we may have only one or two variables from FTA but from GWA side there may be more than
two. This is totally dependent on the analysis model.
6. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
6
Figure 2: Combining FTA and GWA with Mapping Time Slots
5. Conducting Analysis
Based on the above proposed technique we have conducted a correlation based analysis by
combining FTA and GWA. Through this analysis we have identified relationship among the
variables of FTA and GWA. Variables which we have considered in our analysis are NF(s)
(Number of node Failures in duration S), D(s) (Failure Duration in time S), NN (Number of
Nodes), Nfailed(s) (Number of Failed Jobs in duration S).
Serial No. Variable Variable Description Variable Name for Plotting
1 NF(s) Number of node Failures in duration S NumNodeFailures
2 D(s) Failure Duration in time S FailureDuration
3 NN Number of Nodes NumNodes
4 Nfailed(s) Number of Failed Jobs in duration S NJobFailed
Correlation is a statistical measure that indicates the extent to which two or more variable values
fluctuate together. A positive correlation indicates the extent to which two or more than two
variables increase in parallel and a negative correlation indicates the extent to which one variable
increases as the other decreases. The strength of the linear association among two variables is
quantified by the correlation coefficient whose value can vary from -1 to +1. We have used
Spearman correlation for this analysis because of the absence of linear variation in data from both
trace files. Reason being number of jobs submitted or failed and number of node failures are
varying at a large scale from time to time .This technique uses a ranking based approach for
calculating correlation coefficients. Following equation represents Spearman’s correlation.
7. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
7
2
2
6
1
( 1)
S
d
r
n n
×
= −
−
∑
Here rs represents Spearman’s correlation coefficient, d represents the difference in the ranks of
two variables whose correlation we are calculating and n represents the number of values which
are being used for conducting this correlation. From the randomly collected data set we
conducted a correlation based analysis of different variables. Result of this analysis is shown in
the following table 1 and the corresponding plots are shown in figure 3
We can make the following conclusions from these results.
I. There exists a positive correlation between number of node failures and number of failed
jobs. So we can say that with the increase in the number of node failures number of failed
jobs is also increasing and this explains the very basic behavior of the Grid environment.
So number of node failures has a direct effect on the quality of service of the
environment. Scheduling policy can use this information in order to make better
scheduling decisions in a failure critical situation.
II. If we look at the correlation coefficient of number of node failures and failure duration
then we can see that there exists a considerable correlation between two equal to 0.660.
So based on the correlation with increase in number of node failures the failure duration
can be predicted
III. Correlation coefficient between failure duration and number of failed jobs is also positive
equal to 0.498. Although it is not a strong correlation value but it supports out hypothesis
that longer the nodes stay unavailable more will be the failed or cancelled jobs.
Table 1 : Correlation Analysis Results
8. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
8
Figure 3: Correlation Analysis Plots
6. Conclusion and Future Work
In this research paper we have proposed first technique of combining Fault Trace Archive and
Grid Workload Archive as a single research problem. We have discussed the step by step
approach of combining FTA and GWA. We have also identified and discussed that what type of
mistakes can be made while conducting the analysis and what kind of impact these mistakes can
have on the results. Finally with the help of correlation based combined analyses we have found
out that there are positive correlations among FTA and GWA variables and have also identified
the coefficient of these correlations. In future we can establish a regression based model of
different variables and can predict system behavior in response to variation of events.
9. International Journal of Grid Computing & Applications (IJGCA) Vol.5, No.1, March 2014
9
References
[1] Fault Trace Archieve. Fault Trace Archieve. [Online]. http://fta.scem.uws.edu.au
[2] Hui Li, Mathieu jan, Shanny Anoep Alexandru Iosup, "The Grid Workload Archieve,".
[3] Alexandru Iosup,Matthieu Gallet,Emmanuel Jeannot,Derrick Kondo,Bahman Javadi,Artur Andrzejak
Dick Epema. (2009) Fault Trace Archive: For improving the reliability of distributed systems.
[Online]. HYPERLINK http://fta.scem.uws.edu.au/index.php?n=Main.FTAFormat
[4] Catalin Dumitrescu, Dick Epema, Alexandru Iosup, Mathieu Jan, Hui Li, Lex Wolters Shanny Anoep.
(2007, January) The Grid Workloads Archive. [Online]. HYPERLINK http://gwa.ewi.tudelft.nl
[5] Derrick Kondo, David P.Anderson Attur Andrzejak, "Ensuring Collective Availability in Volatile
Resource Pools via Forecasting,".
[6] Derrick Kondo, Jean-Marc Vincent, David P.Anderson Bahman Javedi, "Mining for Statistical
Models of Availability in Large-Scale Distributed Systems: An Emperical Study of SETI@home,".
[7] Garth A,Gibson Bianca Schroeder, "Disk Failures in Real World: What does and MTTF of 1,000,000
hours mean to you?," in 5th USENIX Conference on File and Storage Technologies, San Hose, CA,
Feb 14-16,2007.
[8] Matthieu Gallet, Derrick Kondo, Alexandru Iosup, Dick Epema Nezih Yigitbasi, "Analysis and
Modeling of Time Correlated Failures in Large Scale Distributed Systems," 8th IEEE/ACM
International Conference on Grid Computing, PDS-2010-004, 2010.
[9] Bahman Javadi, Alexandru Iosup, Dick Epema Derrick Kondo, "The Failure Trace Archive :
Enabling Comparative Analysis of Failures,".
[10] Dick Epema Alexandru Iosup, "Grid Computing Workloads," IEEE Transactions on Internet
Computng, vol. 15, no. 2, pp. 19-26, 2011.
[11] Dick Epema Alexandru Iosup, "Grid Computing Workloads: Bags of Tasks, Workflows, Pilots, and
Others," , Netherlands, 2011.
[12] Carsten Ememann, and Ramin Yahyapour Baiyi Song, "User Group-based Workload Analysis and
Modelling," in 2005 IEEE International Symposium on Cluster Computing and the Grid, 2005, pp.
953-961.