Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
ABSRACT: Missing data is one of the major issues in data miningand pattern recognition. The knowledge contains in attributeswith missing data values are important in improvingregression correlationprocessofanorganization.Thelearningprocesson each instance is necessary as it may containsome exceptional knowledge. There are various methods tohandle missing data in regression correlation. Analysis of photovoltaic cell, Sunlight striking on different geographical location to know the defective or connectionless photovoltaic cell plate. we mainly aims To showcasing the energy produced at different geographical location and to find the defective plate. And also analysis the data sets of energy produced along with current weather in particular area to know the status of the photovoltaic plates. In this project, we used Hadoop Map-Reduce framework to analyze the solar energy datasets.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Correlation of artificial neural network classification and nfrs attribute fi...eSAT Journals
Abstract
Mostly 5 to 15% of the women in the stage of reproduction face the disease called Polycystic Ovarian Syndrome (PCOS) which is the multifaceted, heterogeneous and complex. The long term consequences diseases like endometrial hyperplasia, type 2 diabetes mellitus and coronary disease are caused by the polycystic ovaries, chronic anovulation and hyperandrogenism are characterized with the resistance of insulin and the hypertension, abdominal obesity and dyslipidemia and hyperinsulinemia are called as Metabolic syndrome (frequent metabolic traits) The above cause the common disease called Anovulatory infertility. Computer based information along with advanced Data mining techniques are used for appropriate results. Classification is a classic data mining task, with roots in machine learning. Naïve Bayesian, Artificial Neural Network, Decision Tree, Support Vector Machines are the classification tasks in the data mining. Feature selection methods involve generation of the subset, evaluation of each subset, criteria for stopping the search and validation procedures. The characteristics of the search method used are important with respect to the time efficiency of the feature selection methods. PCA (Principle Component Analysis), Information gain Subset Evaluation, Fuzzy rough set evaluation, Correlation based Feature Selection (CFS) are some of the feature selection techniques, greedy first search, ranker etc are the search algorithms that are used in the feature selection. In this paper, a new algorithm which is based on Fuzzy neural subset evaluation and artificial neural network is proposed which reduces the task of classification and feature selection separately. This algorithm combines the neural fuzzy rough subset evaluation and artificial neural network together for the better performance than doing the tasks separately.
Keywords: ANN, SVM, PCA, CFS
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Software Bug Detection Algorithm using Data mining TechniquesAM Publications
The main aim of software development is to develop high quality software and high quality software is
developed using enormous amount of software engineering data. The software engineering data can be used to gain
empirically based understanding of software development. The meaning full information can be extracted using
various data mining techniques. As Data Mining for Secure Software Engineering improves software productivity and
quality, software engineers are increasingly applying data mining algorithms to various software engineering tasks.
However mining software engineering data poses several challenges, requiring various algorithms to effectively mine
sequences, graphs and text from such data. Software engineering data includes code bases, execution traces,
historical code changes, mailing lists and bug data bases. They contains a wealth of information about a projectsstatus,
progress and evolution. Using well established data mining techniques, practitioners and researchers can
explore the potential of this valuable data in order to better manage their projects and do produce higher-quality
software systems that are delivered on time and within budget
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
ABSRACT: Missing data is one of the major issues in data miningand pattern recognition. The knowledge contains in attributeswith missing data values are important in improvingregression correlationprocessofanorganization.Thelearningprocesson each instance is necessary as it may containsome exceptional knowledge. There are various methods tohandle missing data in regression correlation. Analysis of photovoltaic cell, Sunlight striking on different geographical location to know the defective or connectionless photovoltaic cell plate. we mainly aims To showcasing the energy produced at different geographical location and to find the defective plate. And also analysis the data sets of energy produced along with current weather in particular area to know the status of the photovoltaic plates. In this project, we used Hadoop Map-Reduce framework to analyze the solar energy datasets.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Correlation of artificial neural network classification and nfrs attribute fi...eSAT Journals
Abstract
Mostly 5 to 15% of the women in the stage of reproduction face the disease called Polycystic Ovarian Syndrome (PCOS) which is the multifaceted, heterogeneous and complex. The long term consequences diseases like endometrial hyperplasia, type 2 diabetes mellitus and coronary disease are caused by the polycystic ovaries, chronic anovulation and hyperandrogenism are characterized with the resistance of insulin and the hypertension, abdominal obesity and dyslipidemia and hyperinsulinemia are called as Metabolic syndrome (frequent metabolic traits) The above cause the common disease called Anovulatory infertility. Computer based information along with advanced Data mining techniques are used for appropriate results. Classification is a classic data mining task, with roots in machine learning. Naïve Bayesian, Artificial Neural Network, Decision Tree, Support Vector Machines are the classification tasks in the data mining. Feature selection methods involve generation of the subset, evaluation of each subset, criteria for stopping the search and validation procedures. The characteristics of the search method used are important with respect to the time efficiency of the feature selection methods. PCA (Principle Component Analysis), Information gain Subset Evaluation, Fuzzy rough set evaluation, Correlation based Feature Selection (CFS) are some of the feature selection techniques, greedy first search, ranker etc are the search algorithms that are used in the feature selection. In this paper, a new algorithm which is based on Fuzzy neural subset evaluation and artificial neural network is proposed which reduces the task of classification and feature selection separately. This algorithm combines the neural fuzzy rough subset evaluation and artificial neural network together for the better performance than doing the tasks separately.
Keywords: ANN, SVM, PCA, CFS
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Software Bug Detection Algorithm using Data mining TechniquesAM Publications
The main aim of software development is to develop high quality software and high quality software is
developed using enormous amount of software engineering data. The software engineering data can be used to gain
empirically based understanding of software development. The meaning full information can be extracted using
various data mining techniques. As Data Mining for Secure Software Engineering improves software productivity and
quality, software engineers are increasingly applying data mining algorithms to various software engineering tasks.
However mining software engineering data poses several challenges, requiring various algorithms to effectively mine
sequences, graphs and text from such data. Software engineering data includes code bases, execution traces,
historical code changes, mailing lists and bug data bases. They contains a wealth of information about a projectsstatus,
progress and evolution. Using well established data mining techniques, practitioners and researchers can
explore the potential of this valuable data in order to better manage their projects and do produce higher-quality
software systems that are delivered on time and within budget
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION cscpconf
While designing a new type of engineering material one has to search for some existing
materials which suits design requirement and then he can try to produce new kind of
engineering material. This selection process itself is tedious as he has to select few numbers of
materials out of a set of lakhs of materials. Therefore in this paper a model is proposed to select
a particular material which suits the user requirement, by using some similarity/distance
measuring functionalities. Here thirteen different types of similarity/distance measuring
functionalities are examined. Performance Index Measure(PIM) is calculated to verify the
relative performance of the selected material with the target material. Then all the results are
normalised for the purpose of analysing the results. Hence the proposed model reduces the
wastage of time in selection and also avoids the haphazardly selection of the materials in materials design and manufacturing industries.
Internet becomes the most popular surfing environment which increases the
service oriented data size. As the data size grows, finding and retrieving the most
similar data from the large volume of data would become more difficult task. This
problem is focused in the various research methods, which attempts to cluster the
large volume of data. In the existing research method Clustering-based Collaborative
Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar
kind of data together, so that retrieval time cost can be reduced considerably.
However, existing research methods cannot find the similar reviews accurately which
needs to be focused more for efficient and accurate recommendation system. This is
ensured in the proposed research method by introducing the novel research technique
namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR).
In this research method, initially k means algorithm is used to cluster the similar
movie reviewer together, so that recommendation process can be done in the easier
way. In order to handle the large volume of data this research work adapts the map
reduce framework which will divide the entire data into subsets which will assigned
on separate nodes with individual key values. After clustering, the clustered outcome
is merged together using inverted index procedure in which similarity between movies
would be calculated. Here collaborative filtering is applied to remove the movies that
are not relevant to input. Finally recommendations of movies are made in the accurate
way by using the logistic regression method. The overall evaluation of the proposed
research method is done in Hadoop from which it can be proved that the proposed
research technique can lead to provide better outcome than the existing research
techniques
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Abstract In this paper, the concept of data mining was summarized and its significance towards its methodologies was illustrated. The data mining based on Neural Network and Genetic Algorithm is researched in detail and the key technology and ways to achieve the data mining on Neural Network and Genetic Algorithm are also surveyed. This paper also conducts a formal review of the area of rule extraction from ANN and GA. Keywords: Data Mining, Neural Network, Genetic Algorithm, Rule Extraction.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
Feature selection is a problem closely related to dimensionality reduction. A commonly used
approach in feature selection is ranking the individual features according to some criteria and
then search for an optimal feature subset based on an evaluation criterion to test the optimality.
The objective of this work is to predict more accurately the presence of Learning Disability
(LD) in school-aged children with reduced number of symptoms. For this purpose, a novel
hybrid feature selection approach is proposed by integrating a popular Rough Set based feature
ranking process with a modified backward feature elimination algorithm. The approach follows
a ranking of the symptoms of LD according to their importance in the data domain. Each
symptoms significance or priority values reflect its relative importance to predict LD among the
various cases. Then by eliminating least significant features one by one and evaluating the
feature subset at each stage of the process, an optimal feature subset is generated. The
experimental results shows the success of the proposed method in removing redundant
attributes efficiently from the LD dataset without sacrificing the classification performance.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION cscpconf
While designing a new type of engineering material one has to search for some existing
materials which suits design requirement and then he can try to produce new kind of
engineering material. This selection process itself is tedious as he has to select few numbers of
materials out of a set of lakhs of materials. Therefore in this paper a model is proposed to select
a particular material which suits the user requirement, by using some similarity/distance
measuring functionalities. Here thirteen different types of similarity/distance measuring
functionalities are examined. Performance Index Measure(PIM) is calculated to verify the
relative performance of the selected material with the target material. Then all the results are
normalised for the purpose of analysing the results. Hence the proposed model reduces the
wastage of time in selection and also avoids the haphazardly selection of the materials in materials design and manufacturing industries.
Internet becomes the most popular surfing environment which increases the
service oriented data size. As the data size grows, finding and retrieving the most
similar data from the large volume of data would become more difficult task. This
problem is focused in the various research methods, which attempts to cluster the
large volume of data. In the existing research method Clustering-based Collaborative
Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar
kind of data together, so that retrieval time cost can be reduced considerably.
However, existing research methods cannot find the similar reviews accurately which
needs to be focused more for efficient and accurate recommendation system. This is
ensured in the proposed research method by introducing the novel research technique
namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR).
In this research method, initially k means algorithm is used to cluster the similar
movie reviewer together, so that recommendation process can be done in the easier
way. In order to handle the large volume of data this research work adapts the map
reduce framework which will divide the entire data into subsets which will assigned
on separate nodes with individual key values. After clustering, the clustered outcome
is merged together using inverted index procedure in which similarity between movies
would be calculated. Here collaborative filtering is applied to remove the movies that
are not relevant to input. Finally recommendations of movies are made in the accurate
way by using the logistic regression method. The overall evaluation of the proposed
research method is done in Hadoop from which it can be proved that the proposed
research technique can lead to provide better outcome than the existing research
techniques
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Abstract In this paper, the concept of data mining was summarized and its significance towards its methodologies was illustrated. The data mining based on Neural Network and Genetic Algorithm is researched in detail and the key technology and ways to achieve the data mining on Neural Network and Genetic Algorithm are also surveyed. This paper also conducts a formal review of the area of rule extraction from ANN and GA. Keywords: Data Mining, Neural Network, Genetic Algorithm, Rule Extraction.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...ijcsa
Front end of data collection and loading into database manually may cause potential errors in data sets and a very time consuming process. Scanning of a data document in the form of an image and recognition of corresponding information in that image can be considered as a possible solution of this challenge. This paper presents an automated solution for the problem of data cleansing and recognition of user written data to transform into standard printed format with the help of artificial neural networks. Three different neural models namely direct, correlation based and hierarchical have been developed to handle this issue. In a very hostile input environment, the solution is developed to justify the proposed logic.
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONIJDKP
Developing predictive modelling solutions for risk estimation is extremely challenging in health-care
informatics. Risk estimation involves integration of heterogeneous clinical sources having different
representation from different health-care provider making the task increasingly complex. Such sources are
typically voluminous, diverse, and significantly change over the time. Therefore, distributed and parallel
computing tools collectively termed big data tools are in need which can synthesize and assist the physician
to make right clinical decisions. In this work we propose multi-model predictive architecture, a novel
approach for combining the predictive ability of multiple models for better prediction accuracy. We
demonstrate the effectiveness and efficiency of the proposed work on data from Framingham Heart study.
Results show that the proposed multi-model predictive architecture is able to provide better accuracy than
best model approach. By modelling the error of predictive models we are able to choose sub set of models
which yields accurate results. More information was modelled into system by multi-level mining which has
resulted in enhanced predictive accuracy.
Machine Learning Approaches and its Challengesijcnes
Real world data sets considerably is not in a proper manner. They may lead to have incomplete or missing values. Identifying a missed attributes is a challenging task. To impute the missing data, data preprocessing has to be done. Data preprocessing is a data mining process to cleanse the data. Handling missing data is a crucial part in any data mining techniques. Major industries and many real time applications hardly worried about their data. Because loss of data leads the company growth goes down. For example, health care industry has many datas about the patient details. To diagnose the particular patient we need an exact data. If these exist missing attribute values means it is very difficult to retain the datas. Considering the drawback of missing values in the data mining process, many techniques and algorithms were implemented and many of them not so efficient. This paper tends to elaborate the various techniques and machine learning approaches in handling missing attribute values and made a comparative analysis to identify the efficient method.
Identification of important features and data mining classification technique...IJECEIAES
Employees absenteeism at the work costs organizations billions a year. Prediction of employees’ absenteeism and the reasons behind their absence help organizations in reducing expenses and increasing productivity. Data mining turns the vast volume of human resources data into information that can help in decision-making and prediction. Although the selection of features is a critical step in data mining to enhance the efficiency of the final prediction, it is not yet known which method of feature selection is better. Therefore, this paper aims to compare the performance of three well-known feature selection methods in absenteeism prediction, which are relief-based feature selection, correlation-based feature selection and information-gain feature selection. In addition, this paper aims to find the best combination of feature selection method and data mining technique in enhancing the absenteeism prediction accuracy. Seven classification techniques were used as the prediction model. Additionally, cross-validation approach was utilized to assess the applied prediction models to have more realistic and reliable results. The used dataset was built at a courier company in Brazil with records of absenteeism at work. Regarding experimental results, correlationbased feature selection surpasses the other methods through the performance measurements. Furthermore, bagging classifier was the best-performing data mining technique when features were selected using correlation-based feature selection with an accuracy rate of (92%).
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Running Head Data Mining in The Cloud .docxhealdkathaleen
Running Head: Data Mining in The Cloud 1
Data Mining in The Cloud 13
Data Mining in The Cloud
Student’s Name:
Institution:
Instructor:
Big data mining on the cloud
Big data mining techniques
Abstract.
Management and analysis of data is becoming a nightmare in every organization day by day. This is because there is flooding of data. This data can only be analyzed by using Information Governance and big data mining techniques. This paper aims to look at some of the big data mining techniques which can be used to analyze data in organizations with flooding of data. It will also show how information governance support big data. The paper begins with an overview of data mining, narrows down to the big data mining techniques and then finally the ways in which Information governance support big data.
Introduction
Data mining is the way toward looking at tremendous amounts of information so as to make a factually likely expectation. Data mining can be utilized, for example, to recognize when high going through clients connect with your business, to figure out which advancements succeed, or investigate the effect of the climate on your business. Information mining standards have been around for a long time related to information distribution centers, and have now taken on more noteworthy pervasiveness with the appearance of Enormous Information. Information examination and the development in both organized and unstructured information has likewise incited information mining strategies to change, since organizations are currently managing bigger informational collections with progressively fluctuated substance (Khan, Anjum, Soomro and Tahir, 2015). Also, man-made brainpower and AI are mechanizing the procedure of data mining.
Despite the methods applied, data mining involves three steps. These steps include exploration, modelling and deployment. The data must first be prepared and sorted out to is needed and what is not needed. This helps one to do away with useless data or even duplicates and ensuring that the final data that is sampled is the only one that is crucial and needed the most. Creating the statistical models with the aim of determining the one which will give the best and most accurate forecasting. This however can consume a lot of time as there are various and different models to the same data set which is applied severally to the sets of data respectively and finally analysis of data should be done. Lastly, in the last step, the model has to be tested against the old and the current data (Milani & Navimipour, 2017). This helps an individual to determine the results which he or she should expect in future.
Big data mining techniques
Data mining is a very significant and effective method when proper techniques are ap ...
Detection of Outliers in Large Dataset using Distributed ApproachEditor IJMTER
In this paper, a distributed method is introduced for detecting distance-based outliers in very large
data sets. The approach is based on the concept of outlier detection solving set, which is a small subset of the data
set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to
obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits
excellent performances. From the theoretical point of view, for common settings, the temporal cost of our
algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to
detect outliers. Experimental results show that the algorithm is efficient and that it’s running time scales quite well
for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of
data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the
solving set computed in a distributed environment has the same quality as that produced by the corresponding
centralized method.
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...IJDKP
Incomplete data is present in many study contents. This incomplete or uncollected data information is named as missing data (values), and considered as vital problem for various researchers. Even this missing data problem is faced more in air pollution monitoring stations, where data is collected from multiple monitoring stations widespread across various locations. In literature, various imputation methods for missing data are proposed, however, in this research we considered only existing imputation methods for missing data and recorded their performance in ensemble creation. The five existing imputation methods for missing data deployed in this research are series mean method, mean of nearby points, median of nearby points, linear trend at a point and linear interpolation respectively. Series mean (SM) method demonstrated comparatively better to other imputation methods with least mean absolute error and better performance accuracy for SVM ensemble creation on CO data set using bagging and boosting algorithms.
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...IJDKP
Incomplete data is present in many study contents. This incomplete or uncollected data information is
named as missing data (values), and considered as vital problem for various researchers. Even this missing
data problem is faced more in air pollution monitoring stations, where data is collected from multiple
monitoring stations widespread across various locations. In literature, various imputation methods for
missing data are proposed, however, in this research we considered only existing imputation methods for
missing data and recorded their performance in ensemble creation. The five existing imputation methods
for missing data deployed in this research are series mean method, mean of nearby points, median of
nearby points, linear trend at a point and linear interpolation respectively. Series mean (SM) method
demonstrated comparatively better to other imputation methods with least mean absolute error and better
performance accuracy for SVM ensemble creation on CO data set using bagging and boosting algorithms.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
2. 30796 M.Nalini and S.Anbu
cannot satisfy the customer’s expected behavior. These non-conforming patterns are
often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants in different application domains.
Of these anomalies and outliers are two terms used most commonly in the context of
anomaly detection and sometimes interchangeably. Anomaly detection finds
extensive use in a wide variety of applications. Such as fraud detection for credit
cards, insurance or health care, intrusion detection for cyber-security, fault detection
in safety critical systems, and military surveillance for enemy activities.
The importance of anomaly detection is due to the fact that anomalies in data
that translates to significant (and often critical) actionable information in a wide
variety of application domains. For example, an anomalous traffic pattern in a
computer network could mean that a hacked computer is sending out sensitive data to
an unauthorized destination. An anomalous MRI image may indicate presence of
malignant tumors. Anomalies in credit card transaction data could indicate credit card
or identity theft or anomalous readings from a space craft sensor could signify a fault
in some component of the space craft. Detecting outliers or anomalies in data has
been studied in the statistics community as early as the 19th
century. Over time, a
variety of anomaly detection techniques have been developed in several research
communities. Many of these techniques have been specifically developed for certain
application domains, while others are more generic.
Also in this paper the data taken into account is uncertain data, the size of the
data is also too large.PSO approach is applied here to find and count the de-
duplication where it combines various pieces of evidence extracted from the data
content to produce de-duplication method. This approach is able to identify whether
any entries in a repository are same or not. In the same manner, by combining more
pieces in the data, taken as evidence, and compare with the whole data as training
data. This function is applied repeatedly on the whole data or in the repositories.
Newly inserted data can also be compared in the same manner to avoid replica by
comparing with evidence. A method applied to record de-duplication should
accomplish individual but contradictory objectives: this process should effectively
increase the identification of the records replicated. The approach GP [15] is chosen
as the basic approach which is suitable for finding accurate answers to a given
problem without searching all the data on the whole. Due to the record de-duplication
problem, in the existing approach [14, 16] the genetic programming is applied to
provide good solutions to it.
In this paper, the existing system results in [16] are taken for comparison
without PSO based approach, where our approach is able to automatically find more
effective de-duplication methods. Moreover, PSO based approach can interoperable
with existing best de-duplication methods to change on the replication identification
limits used to classify a pair of records as a match or not. In our experiment, real time
dataset having all scientific domain based article citations and hotel index records.
Also, the real time data set having synthetic generated datasets to control in best
experimental environment. In all the scenarios, our approach can be applied to all the
possible scenarios.
3. Anomaly Detection Via Eliminating Data Redundancy 30797
On the whole, our contribution of this paper is PSO based approach to find
and count the De-Duplication is as follows:
A less computational time based solution can be obtained in terms of
duplication detection
Reduce the individual comparison using PSO approach to find out the
similarity values.
Choosing the replicas by computing TPR and FPR among the data
Rectify the errors in the data entries.
RELATED WORKS
In [3] the author proposed an approach to data reduction. This data reduction
functions are very essential to machine learning and data mining. An agent based
population algorithm is used for solving data reduction. Only data reduction is not
only the solution for improving the quality of databases. Various sizes of database are
used to provide high classification among the data to find out anomalies. Two
algorithms such as evolutionary and non-evolutionary are applied and the results are
compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary
relations are computed to define the patterns in the dataset [5] where it provides
relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two
systems developed to analyze RDBMS. The main objective of the data mining
technique is to detect and classify data in a huge set of database [7] without
negotiating the speed of the process. PCA is used for data reduction and SVM is used
for data classification in [7]. In [8] the data redundancy method is explored using
mathematical representation. Software developed with safe, correct and reliable
operations for avionics and automobile based database systems [9]. A statistical QA-
[Question Answer] model is applied to develop a prototype to avoid web based data
redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On-
Line Analytical Processing) is applied to Gist database and other spatial database
analysis, indexing, and generating various set of reports without any error. In [12], an
effective method was proposed for P-2-P sharing data. During the data sharing the
data duplication is removed using the effective method. Web entity data extraction
associated with the attributes of the data [13] can be obtained using a novel approach
which uses duplicated attribute value pairs.
G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication
and convert de-duplication in the data also mainly concentrated on identifying the
entries are repository or replica. This approach outperformed and provides 6.2% of
accuracy more than the earlier approaches for two different data sets found in [2]. Our
proposed approach can be extended for various benchmark data with real time data
such as time series data, clinical data, 20-20 new group etc.
PARTICLE SWARM OPTIMIZATION GENERAL CONCEPTS
The natural selection process influences all virtually living things and it can be
inspired by evolutionary programming approaches based ideas. Particle swarm
4. 30798 M.Nalini and S.Anbu
optimization programming is one of the best known evolutionary programming
techniques. It is considered as a heuristic approach and initially applied for optimizing
the data properties and availabilities. PSO is also considered as a multi-objective
problem which can restrict the environment. PSO and the other various evolutionary
approaches are mostly known and applied variety of applications due to their good
performance in terms of searching over a large set of data. PSO creates more
populations for individuals, instead of processing on a single point in the search space
of the problem. This behavior is the essential aspect of the PSO approach and it
creates additional new solutions with new combined features. It also moves forward
comparing with the existing solutions in the search space.
PARTICLE OPERATIONS
PSO generates random particles representing individuals. In this paper, the current
modeling the trees representing arithmetic functions which are illustrated in Figure-1.
When using, this tree representation of the PSO based approach, the set of all inputs,
variables, constants and methods should be defined [8]. Some of the nodes
terminating the trees are called as leaves. The collection of operators, statements and
methods are used in the PSO evolutionary process to manipulate the terminal values.
All these methods are placed in the internal nodes of the tree is shown in Figure-1.In
general PSO is analyzing social behavior of birds. In order to search for food, every
bird in a flock of birds is referred by velocity based on the personal experience and
information collected by interacting with each other inside the flock. This is the basic
idea about the PSO. Each particle denoting each bird, flies denotes searching in the
subspace for the optimization problem searching for the optimum solution. In PSO,
the solutions within the iteration are called as swarm and equal in population.
Tree(a, b, c) = a + ( b + b)
FIGURE-1: Tree Used For Mapping A Function
+
x /
x z
5. Anomaly Detection Via Eliminating Data Redundancy 30799
PROPOSED APPROACH
The proposed approach utilizes the functionality of the PSO optimization method for
finding the difference among entities in each record in a database. This difference
indicates the similarity index among two data entities can decide the duplication. In
this case, if the distance between two data entities [ and ] is less than a threshold
value [ ], then and are decided as duplicate. The algorithm of PSO applied in
this paper is given here:
1. Generate random population P is representing each individual of data entries.
2. Assume a random feasible solution from particles.
3. For I = 1to P
4. Evaluate all particles based on the objective function
5. The objective function = ( [ ], [ ]) ≤ ≥∝
6. Gbest = best solution based particle
7. Compute the velocity of the Gbest particles
8. Update the current position of the best solution
9. Next i
A Database is a rectangular table consists of number of Records as:
= { , ,… , } ---(1)
And each record has number of Entities as:
= =
: : :
---(2)
is the entity at row and column in the data. Here represent the rows
and j represents the column. In this paper the threshold value is user defined very
small value among 0 and 1.
6. 30800 M.Nalini and S.Anbu
Fig.1: Proposed Approach
The overall functionality of the proposed approach is depicted in Fig.1. The
database may be in any form like ORACLE, SQL, and MY-SQL, MS-ACCESS or
EXCEL.
PREPROCESSING
Let us consider an example of an employee data for an MNC company, where the
company branches are located overall world. The entire data are read from the
database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant
characters placed as an entity in the database. [Example, if an entity is numerical data,
then it should contain only the digits from 0 to 9. If it is a name, then it should
represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant
characters presented in any dataset, then those data entity are treated as error data and
it will be corrected, removed or changed by any other relevant characters.
If the data-type of the field is a string, then the preprocessing function assigns
“NULL” in the corresponding entity, else if the data type of the field is a numeric,
then preprocessing function assigns 0’s [according to the length of the numeric data
type] in the corresponding entity. Similarly, preprocessing function replaces the entity
as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on.
Once the data is preprocessed the results of the SQL-Query are good else error
generated.
For example, in the following table-1, the first row says the Field name and all
the rows contain set of records. In the given Table-1, the first record fourth field is
having an irrelevant character as “~”. In the same way the second record 3rd
field
consists “##” instead of numbers. It gives an error, when a query
Pre Processing the Data
Normalizing the Data
Anomaly Detection
Load Data Pre-process the
data
Divide data as
windows
Finding Similarity Ratio
Normalize the Data
Check data
redundancy & Error
Yes
No
D
B
Mark redundant data
Persistent the data
Anomaly Detection
7. Anomaly Detection Via Eliminating Data Redundancy 30801
;
Select City from EMP;
is passed in the table EMP [Table-1]. To avoid error during query process the City,
Age fields are corrected by verifying the original data sources. If it is not possible
then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing
and correcting the error. If it is not possible to correct the record, those data are
marked [‘*’] and moved to a separate pool area.
Table-1: Sample Error Records Pre-Processed and Marked [‘*’] [EMP].
No Name Age City State Comment
0001* Kabir 45 ~ty Employee
0002* Ramu ## Chennai TN Employee
The entire data can be divided as sub windows for easy and fast process. Let
the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and
DB2. Each DB1 and DB2 has a number of windows as 1, 2 … .
DATA NORMALIZING
In general an uncertain data stream is considered for anomaly detection. The main
problem defined in this paper is anomaly detection for any kind of Data streams.
Since, the size of the data stream is huge, in our approach the complete data are
divided into subsets of data streams. A data stream DS is divided into two uncertain
data streams DS1 and DS2, are taken for our problem, where both data stream
consists of a sequence of continuously occurring uncertain objects in various time
intervals, are denoted as
DS1 = { [1], [2],… … [ ], … } --- (3)
DS2 = { [1], [2],… … [ ], … } --- (4)
Where [ ] or [ ] is a k-dimensional uncertain objects at the time interval
and is the current time interval. According to grouping the nearest neighbor, the
objects should retrieve a close pair of objects within a period. Thus a compartment
window concept is adapted for the uncertain stream group operator. From figure-2, a
USG operator always considers the most recent CW uncertain data in the stream, that
is
CW(DS1) = { [ − + 1], [ − + 2],… … … , [ ]} --- (5)
CW(DS2) = { [ − + 1], [ − + 2], … … … , [ ]} --- (6)
8. 30802 M.Nalini and S.Anbu
At the current time intervals .It can say in other words, when a new certain
object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1]
(y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1]
(y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval
(t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t-
w+2],….,y[t+1]}) of size cw.
Fig.2: Data Set Divided as Sub-Windows
For Grouping the uncertain Data Streams the two data streams DS1 and DS2
and distance threshold value and a probabilistic threshold α ∈ [0, 1].A group on
uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i]
within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the
current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the
similarity distance can be obtained using PSO. Such that
PSO (Pr { ( [ ], [ ]) ≤ ≥∝) --- Equ (7)
Holds, where t-cw+1 ≤ I, j ≤ t, and dist(., .) is a Euclidean distance function
between two objects. To perform a USG Equation (7), users need to register two
parameters, distance threshold and probabilistic threshold α in PSO. Since, each
uncertain object at a timestamp consists of R samples, the grouping probability
P|r{dist( x[i], y[i]) ≤ } in Inequality (7) can be rewritten via samples as
Pr{ ( [ ], [ ]) ≤ } =
{ 1[ ]. . 2[ ]. , ( 1[ ], 2[ ]) ≤ ;
0 ℎ
--- Equ (8).
DS1
DS2
y[1]…….,y[t-cw+1]…………..| …………………………….y[t] y[t+1]
Compartment window at
time interval t - CW (DS1)Uncertain data
stream
Expired uncertain object at
time interval (t+1)
New uncertain
object
Compartment window at
time interval t - CW (DS2)
USG
answers
9. Anomaly Detection Via Eliminating Data Redundancy 30803
Note that, one straightforward method to directly perform USG over
compartment windows is to follow the USG definition. That is for every object pair
<X[i], Y[i] > fromcompartment windows CW(DS1) and CW(DS2) respectively. We
compute the grouping probability that X[i] is within distance from Y[i] (via
samples) based on (8).If the resulting probability is greater than or equal to
probabilistic threshold α, then this pair <X[i], Y[i] > is reported as the USG answer,
otherwise it is a false alarm and can be carefully discarded. The number of false alarm
is counted using PSO by repeating n number of times and generating numbers of
particles in the search space for each individual data. For any comparison, verification
and other relevant tasks the window based data makes easy and fulfill the task very
quickly for any DBMS. For example, if the database is having 1000 records can be
divided into 4 sub datasets having 250 records each.
Data in the database can be normalized using any normalization form for fast
and accurate query process. In this paper user defined normalization is also applied to
improve the efficiency such as arranging the data in a proper manner like ascending
order or descending order according to the SQL query keywords.
PSO BASED SIMILARITY COMPUTATION
This paper focuses on applying a PSO based comparison for finding similar or
dissimilar. PSO has a measurement among two data in a database well defined by
appropriate features. Since it accounts for unequal variances as well as the correlation
between features it will adequately evaluate the distance by assigning different
weights or important factors to the features of data entities. In this paper the
inconsistency of data can be removed in real-time digital libraries.
Assume two sets of groups and having data about girls and boys in a
school. Let number girls are categorized as same sub-group in since their
attribute or characteristics are same. It is computed by PSO as
= ( − ) ≤ 1 --- (9)
The correlation among dataset is computed using Similarity-Distance. Data
entities are the main objects of data mining. The data entities are arranged in an order
according to the attributes. The data set with number of attributes is considered
as K-dimensional vector is represented as:
= ( , ,… , ). --- (10)
N number data entities form a set
= ( , , … , ) ⊂ ℝ --- (11)
is known as data set. can be represented by an matrix
= --- (12)
10. 30804 M.Nalini and S.Anbu
where is the jth component of the data set .There are various methods used for
data mining. Numerous such methods, for example, NN-classification techniques,
cluster investigation, and multi-dimensional scaling methods are based on the
processes of similarity between data. As a replacement for measuring similarity,
dissimilarity among the entities too will give the same results. For measuring
dissimilarity one of the parameters that can be used is distance. This category of
measures is also known as separability, divergence or discrimination measures.
A distance metric is a real-values function , such that for any data points
, , and :
( , ) ≥ 0, ( , ) = 0, = --- (13)
( , ) = ( , ) --- (14)
( , ) ≤ ( , ) + ( , ) --- (15)
The first line (13), positive definiteness assures the distance is a non-negative
value. The distance can be zero for the points to be the same. The second property
indicates the symmetry nature of distance. There are various distance formulas are
available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper
the Similarity-Distance is taken as the main method to find the similarity distance
among two data sets. The distance among a set of observed groups in m-dimensional
space determined by m variables is known as Similarity-Distance method. The less
distance value says the data in the groups are very close and the other is not close. The
mathematical formula for Similarity-Distance for two set of data samples as X and Y
is written as:
( , ) = ( − ) ∑ ( − ) --- (16)
∑ is the inverse co-variance matrix.
The similarity value among the sub-windows of the dataset DB1 and the
dataset DB2 is computed and the result is stored in a variable named score.
[ ] = ∑ ( 1) − ( 2) ---- (17)
[ ] ≤ ℎ 1
[ ] = 0 ℎ 0
[ ] > ℎ − 1
---- (18)
The first line in (18) says that the data available in both windows of 1
and 2 are more or less similar.The next line says that exactly same and the third
line says that the data are different. Whenever the distance among dataset satisfies
[ ] = 0 and [ ] ≤ both data are marked in the DB. The value of [ ] gives two
solution such as:
11. Anomaly Detection Via Eliminating Data Redundancy 30805
TPR—if the similarity value lies above this boundary [-1 to 1], the records are
considered as replicas;
TNR—if the similarity value lies below this boundary then the records are considered
as not being replicas.
In this situation the similarity values lies among the two boundaries then the
records are classified as “possiblematches”. In this case, a human judgment is also
necessary to find the matching score. Usually most of the existing approaches to
replica identification depend on several choices to set their parameters and they may
not be always optimal. Setting these parameters requires the accomplishment of the
following tasks:
Selecting the best proof to use- as evidence, it takes more time to find out the
duplication due to apply more processes to compute the similarity among the data.
Decide how to merge the best evidence, some evidence may be more effective for
duplication identification than others. Finding the best boundary values to be used,
Bad boundaries may increase the number of identification errors (e.g., false positives
and false negatives),nullifying the whole process. Window1 from DB1 is compared
with Window1, window2, window3 and so on from DB2 can be written as:
[ ] = ( 1) − ∑ ( 2) ---- (19)
If the [ ] =0, then both ( 1) and ( 2) are same and mark it as
duplicate. Else ( 1) is compared with ( 2).
The objective of this paper is to improve the quality of the data in a DBMS is
error free and can provide fast outputs for any SQL query. It is also concentrates on
de-duplication if possible in the data model. The removal of duplicate is not efficient
in Government based organization and it is difficult to remove. Avoiding duplicate
data provides high retrieval of quality data from huge data set like banking.
DATA:
For experimenting the proposed approach two real time data sets commonly
employed for evaluating the record de-duplication purposes. They are based on the
current data gathered from the web index. Additionally, some more data sets also
created using a synthetic data set generator. One of the dataset is
the ℎ data set is a assembly of 1,295 different credentials to 122
computer science papers occupied from the Cora research paper through search
engine. These credentials were separated into numerous characteristics (author names,
year, title, venue, and pages and other info) by an information mining system. The
another real-time data set is the Restaurants data set comprises 864 records of
restaurant names and supplementary data together with 112 replicas that were attained
by incorporating records from Fodor and Zagat’s guidebooks. It is used the following
attributes from this data set: (restaurant) name, address, city, and specialty. The
synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32]
available in the Febrl [26] package.
12. 30806 M. Nalini and S.Anbu
Since the real time dataset are not sufficient and not easily accessible for the
experiment, such as time series data set, 20-20 news data set and customer data from
OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like
social security number). Using SDG it also can create manually some errors in data
and duplication in data. Some of the modifications also can be applied on the records
attribute level. The data taken for experiments are
DATA-1: This data set contains four files of 1000 records (600 originals and 400
duplicates) with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and in full record.
DATA-2: This data set contains four files of 1000 records (750 originals and 250
duplicates), with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and four in the full record.
DATA-3: This data set contains four files of 1000 records (800 originals and 200
duplicates) with a maximum of seven duplicates, based on one original record (using
a Poisson distribution of duplicate records) and with a maximum of four
modifications in a single attribute and five in the full record. The duplication can be
applied on each attribute of the data in the form of [i.e the evidence]
〈 , − 〉
The experiment on the time series data is done in MATLAB software and the
time complexity is compared with the existing system. The elapsed time taken for
implementing the proposed approach is 5.482168 seconds. The results obtained for all
the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.
13. Anomaly Detection Via Eliminating Data Redundancy 30807
Fig.3: Original Data Not Preprocessed
Fig.3 shows the data originality as such taken from web. It has errors,
redundancy and noise. The three lines show that the data DB is divided into DB1,
DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted
and overlapped in many places which indicates the data redundancy. Also it is drawn
in zigzag form says it is not preprocessed. In the time series data there are 14
numerical data are preprocessed [replaced as 0’s], it is verified from the database.
Fig.4: Preprocessed Data
14. 30808 M.Nalini and S.Anbu
The data is preprocessed and normalized is shown in Fig.4. User defined
normalization of the data is arranging the data in an order for easy process. Even
DB1, DB2 and DB3 have overlapped data itself, which indicates more data are
similar. DB1 and DB2 have more similar overlapped data is clearly shown in Fig.4.
Finding the similarity index and those same data can be removed for easy process and
make it as duplication.
Fig.5: Single Window Data in DB1
After normalization the data is divided into windows and shown in Fig.5
where the window size is 50 defined by the developer. Each window has 50 data for
fast comparison. In order to confirm this behavior observed with real data, we
conducted additional experiments using our synthetic data sets. The user-selected
evidence setup used in this experiment was built using the following list of evidence:
<firstname, PSO>, <lastname, PSO>, <street number, string distance>,
<address1, PSO>, <address2, PSO>, <suburb, PSO>, <postcode, string distance>,
<state, PSO>, <date of birth, string distance>, <age, string distance>,
<phone number, string distance>, <social security number, string distance>.
This list of evidence, using the PSO similarity function for free text attributes
and a string distance function for numeric attributes was chosen. Since, it required
less time to be processed in our initial tuning tests.
15. Anomaly Detection Via Eliminating Data Redundancy 30809
Table-2: Original Data
Data set Original Data Good Data Similar Data Error Data
Time Series 1000 600 400 24%
Restaurant 1000 750 250 15%
Student Database 1000 800 200 12.4%
Cora 1000 700 300 19.2%
Table-3: Data Duplication Detection and De-Duplication
Data set Original DataMarked DuplicationDe-Duplicated Not De-Duplicated
Time Series 1000 400 395 5
Restaurant 1000 250 206 44
Student DB 1000 200 146 54
Cora 1000 300 244 46
Fig.6: Performance Evaluation of Proposed Approach
The performance of proposed approach is evaluated by comparing the
detection of duplication, error, marking duplication, number of de-duplication
achieved and error correction for various datasets. Fig.6 shows the performance
evaluation of the proposed approach using the Similarity-Distance . According the
distance score the duplicate error records are detected and marked. Similarity-
Distance rectifies the error of 24%, 15%, 12.4% and 19.2% for Time series data,
Restaurant data, student data and Cora respectively.
The number of duplicate records detected by PSO is 400, 250, 200 and 300 for
Time series data, Restaurant data, student data and Cora and de-duplicate the data are
16. 30810 M.Nalini and S.Anbu
395, 206, 146 and 244 respectively. Due more complex or error in the data the de-
duplication is not obtained 100%.
Some of the performance metrics can be calculated to find out the accuracy of
our proposed approach as:
=
Number of Duplication Find correctly
Total number of data
TNR =
=
Number of Duplication wrongly obtained
Total Number of data to be Identi ied
FNR =
Sensitivity = = 99%
Specificity = = 88.5%
Accuracy = = 96.3%
Where P = TP+FN and N =FP+TN. The proposed approach proved the
efficiency is better in terms of Duplication detection, Error Detection, and De-
duplication in terms of accuracy is 96.3%. Hence Similarity-Distance based
duplication detection is more efficient.
CONCLUSION
In this paper the PSO based distance method is taken as the main method for finding
the similarity [redundancy] in any database. Where the similarity score is computed
for various databases and the performance is compared. The accuracy obtained using
this proposed approach is 96.3% for four different databases. The time series data is in
the form of Excel, Cora data is in the form of table, student data is in the form of MS-
Access and the restaurant data is in the form of SQL table. It is concluded from the
experiment results obtained using our proposed approach it is easy to do anomaly
detection and removal in terms of data redundancy and error. In future the reliability
and scalability is investigated in terms of data size and data variations.
17. Anomaly Detection Via Eliminating Data Redundancy 30811
REFERENCES
[1]. Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre Gonc¸alves,
andAltigran S. da Silva, “A Genetic Programming Approach to Record
Deduplication”, IEEE Transactions On Knowledge And Data Engineering,
Vol. 24, NO. 3, March 2012.
[2]. M. Wheatley, “Operation Clean Data,” CIO Asia Magazine, http:// www.cio-
asia.com, Aug. 2004.
[3]. Ireneusz Czarnowski, Piotr J drzejowicz, “Data Reduction Algorithm for
Machine Learning and Data Mining”,Volume 5027, 2008, pp 276-285.
[4]. Jose Ramon Cano, Francisco Herrera, Manuel Lozano, “Strategies for Scaling
Up Evolutionary Instance Reduction Algorithms for Data Mining”, Book-Soft
Computing, Volume 163, 2005, pp 21-39.
[5]. Gabriel Poesia, Loïc Cerf, “A Lossless Data Reduction for Mining Constrained
Patterns in n-ary Relations”, Machine Learning and Knowledge Discovery in
Databases, Volume 8725, 2014, pp 581-596.
[6]. Nick J. Cercone, Howard J. Hamilton, Xiaohua Hu, Ning Shan, “Data Mining
Using Attribute-Oriented Generalization and Information Reduction”, Rough
Sets and Data Mining1997, pp 199-22
[7] Vikrant Sabnis, Neelu Khare, “An Adaptive Iterative PCA-SVM Based
Technique for Dimensionality Reduction to Support Fast Mining of Leukemia
Data”, SocProS 2012.
[8] Paul Ammann, Dahlard L. Lukes, John C. Knight, “Applying data redundancy
to differential equation solvers”, Journal of Annals of Software Engineering,
1997, Volume 4, Issue 1, pp 65-77.
[9] P. E. Ammann, “Data Redundancy for the Detection and Tolerance of
Software Faults”, Computing Science and Statistics”,1992, pp 43-52.
[10] Rita Aceves-Pérez, Luis Villaseñor-Pineda, Manuel Montes-y-Gomez, “
Towards a Multilingual QA System Based on the Web Data Redundancy”,
Computer Science Volume 3528, 2005, pp 32-37.
[11]. Thiago Luís Lopes Siqueira, Cristina Dutra de Aguiar Ciferri, Valéria Cesário
Times, Anjolina Grisi de Oliveira, Ricardo Rodrigues Ciferri, “The impact of
spatial data redundancy on SOLAP query performance”,Journal of the
Brazilian Computer Society, June 2009, Volume 15, Issue 2, pp 19-34.
[12]. Ahmad Ali Iqbal, Maximilian Ott, Aruna Seneviratne, “Removing the
Redundancy from Distributed Semantic Web Data”, Database and Expert
Systems Applications Lecture Notes in Computer Science, Volume 6261,
2010, pp 512-519.
[13] Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang, Dianxi Shi, Lin Yuan,
“Exploiting Attribute Redundancy for Web Entity Data Extraction”, Digital
Libraries: For Cultural Heritage, Knowledge Dissemination, and Future
Creation Lecture Notes in Computer Science Volume 7008, 2011, pp 98-107.
[14] M.G. de Carvalho, M.A. Gonc¸alves, A.H.F. Laender, and A.S. da Silva,
“Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital
Libraries, pp. 41-50, 2006.
18. 30812 M.Nalini and S.Anbu
[15] J.R. Koza, Gentic Programming: On the Programming of Computers byMeans
of Natural Selection. MIT Press, 1992.
[16] M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva,
“Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM
Symp. Applied Computing (SAC), pp. 1801-1806, 2008.