The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
Clustering technology has been applied in numerous applications. It can enhance the performance
of information retrieval systems, it can also group Internet users to help improve the click-through rate of
on-line advertising, etc. Over the past few decades, a great many data clustering algorithms have been
developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two
new data clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density
peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest
data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics.
Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP,
do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all
of AP, DP and DBSCAN should be considered. Moreover, we find that the comparison of different clustering
algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the
Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This
work has important reference values for researchers and engineers who need to select appropriate clustering
algorithms for their specific applications.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Analysis of Classification Algorithm in Data Miningijdmtaiir
Data Mining is the extraction of hidden predictive
information from large database. Classification is the process
of finding a model that describes and distinguishes data classes
or concept. This paper performs the study of prediction of class
label using C4.5 and Naïve Bayesian algorithm.C4.5 generates
classifiers expressed as decision trees from a fixed set of
examples. The resulting tree is used to classify future samples
.The leaf nodes of the decision tree contain the class name
whereas a non-leaf node is a decision node. The decision node
is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. C4.5 uses information
gain to help it decide which attribute goes into a decision node.
A Naïve Bayesian classifier is a simple probabilistic classifier
based on applying Baye’s theorem with strong (naive)
independence assumptions. Naive Bayesian classifier assumes
that the effect of an attribute value on a given class is
independent of the values of the other attribute. This
assumption is called class conditional independence. The
results indicate that Predicting of class label using Naïve
Bayesian classifier is very effective and simple compared to
C4.5 classifier
The document describes a new technique called DNR (Improving search engines by demoting non-relevant documents) that aims to improve search engine precision. DNR generates additional queries by combining the terms from the original query in different ways. Documents retrieved from the new queries are evaluated using a heuristic to identify non-relevant documents. These non-relevant documents are then moved lower down in the search results, improving precision by promoting more relevant documents higher in the results. The technique is tested on a test collection using various retrieval models and weighting measures, with precision and recall used to evaluate performance against the original queries.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
Clustering technology has been applied in numerous applications. It can enhance the performance
of information retrieval systems, it can also group Internet users to help improve the click-through rate of
on-line advertising, etc. Over the past few decades, a great many data clustering algorithms have been
developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two
new data clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density
peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest
data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics.
Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP,
do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all
of AP, DP and DBSCAN should be considered. Moreover, we find that the comparison of different clustering
algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the
Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This
work has important reference values for researchers and engineers who need to select appropriate clustering
algorithms for their specific applications.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
Analysis of Classification Algorithm in Data Miningijdmtaiir
Data Mining is the extraction of hidden predictive
information from large database. Classification is the process
of finding a model that describes and distinguishes data classes
or concept. This paper performs the study of prediction of class
label using C4.5 and Naïve Bayesian algorithm.C4.5 generates
classifiers expressed as decision trees from a fixed set of
examples. The resulting tree is used to classify future samples
.The leaf nodes of the decision tree contain the class name
whereas a non-leaf node is a decision node. The decision node
is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. C4.5 uses information
gain to help it decide which attribute goes into a decision node.
A Naïve Bayesian classifier is a simple probabilistic classifier
based on applying Baye’s theorem with strong (naive)
independence assumptions. Naive Bayesian classifier assumes
that the effect of an attribute value on a given class is
independent of the values of the other attribute. This
assumption is called class conditional independence. The
results indicate that Predicting of class label using Naïve
Bayesian classifier is very effective and simple compared to
C4.5 classifier
The document describes a new technique called DNR (Improving search engines by demoting non-relevant documents) that aims to improve search engine precision. DNR generates additional queries by combining the terms from the original query in different ways. Documents retrieved from the new queries are evaluated using a heuristic to identify non-relevant documents. These non-relevant documents are then moved lower down in the search results, improving precision by promoting more relevant documents higher in the results. The technique is tested on a test collection using various retrieval models and weighting measures, with precision and recall used to evaluate performance against the original queries.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Privacy Preserving Clustering on Distorted dataIOSR Journals
- The document discusses privacy-preserving clustering on distorted data using singular value decomposition (SVD) and sparsified singular value decomposition (SSVD).
- It applies SVD and SSVD to distort a real-world dataset of 100 terrorists with 42 attributes, generating distorted datasets.
- K-means clustering is then performed on the original and distorted datasets for different numbers of clusters (k). The results show that SSVD more effectively groups the data objects into clusters compared to the original and SVD-distorted datasets, while preserving data privacy as measured by various metrics.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Data preprocessing involves cleaning data by filling in missing values, smoothing noisy data, and resolving inconsistencies. It also includes integrating and transforming data from multiple sources, reducing data volume through aggregation, dimensionality reduction, and discretization while maintaining analytical results. The key goals of preprocessing are to improve data quality and prepare the data for mining tasks through techniques like data cleaning, integration, transformation, reduction, and discretization of attributes into intervals or concept hierarchies.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
International Journal of Computational Engineering Research(IJCER)ijceronline
This document summarizes a research paper that proposes a novel approach to improve the detection rate and search efficiency of signature-based network intrusion detection systems (NIDS). The approach uses data mining and classification algorithms like C4.5 and ensemble algorithms like MadaBoost to improve detection rates. It also uses a modified signature apriori algorithm to more efficiently search for signatures of related attacks based on known signatures, in order to improve search efficiency. The full paper describes these approaches in more technical detail and evaluates their effectiveness at improving NIDS performance.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
This document discusses scoring and ranking documents in information retrieval systems. It introduces the vector space model and term weighting schemes like TF-IDF that are used to assign relevance scores to documents for a given query. TF-IDF weighting increases scores for terms that appear frequently in a document but rarely in the whole collection. This allows more relevant documents containing rare, informative query terms to be ranked higher. IDF on its own does not affect ranking for single-term queries but boosts rarer terms' influence for multi-term queries.
The document discusses data mining and knowledge discovery in databases. It defines data mining as the nontrivial extraction of implicit and potentially useful information from large amounts of data. With huge increases in data collection and storage, data mining aims to analyze data and discover patterns that can provide insights and knowledge about businesses and the real world. The data mining process involves selecting, preprocessing, transforming, and analyzing data to extract hidden patterns and relationships, which are then interpreted and evaluated.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
This paper proposes a classification-based approach for suppressing data to prevent sensitive information from being inferred. It uses decision tree algorithms to classify data elements based on attributes and considers suppressing data elements to secure the data. The paper aims to enhance data classification and generalization. It shows how data can be secured using "generalization" while maintaining usefulness for data mining tasks. The proposed system focuses on data generalization concepts to hide detailed information for privacy while allowing standard data mining techniques to still discover patterns. It evaluates suppressing multiple confidential values and developing a technique independent of individual classification methods based on information theory.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
This document summarizes an empirical study comparing several supervised machine learning approaches for word sense disambiguation: Naive Bayes, decision tree, decision list, and support vector machine (SVM). The study used a dataset of 15 words annotated with senses from WordNet and Senseval-3. Each approach was implemented and evaluated based on its accuracy in identifying the correct sense of each word. The results showed that the decision list approach achieved the highest overall accuracy of 69.12%, followed by SVM at 56.11%, naive Bayes at 58.32%, and decision tree at 45.14%. Thus, the study concluded that decision list performed best on this dataset for the task of word sense disambiguation.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGijaia
Imbalanced data is a significant challenge in classification with machine learning algorithms. This is particularly important with log message data as negative logs are sparse so this data is typically imbalanced. In this paper, a model to generate text log messages is proposed which employs a SeqGAN network. An Autoencoder is used for feature extraction and anomaly detection is done using a GRU network. The proposed model is evaluated with three imbalanced log data sets, namely BGL, OpenStack, and Thunderbird. Results are presented which show that appropriate oversampling and data balancing
improves anomaly detection accuracy.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Figure 1 from the Red Brick company illustrates the data explosion.
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
This document presents a novel approach to anomaly detection in link mining based on applying mutual information. It adapts the CRISP-DM methodology for link mining and applies it to a case study using co-citation data. The methodology includes data description, preprocessing, transformation, exploration, modeling through graph mapping and hierarchical clustering, and evaluation. Mutual information is used to interpret the semantics of anomalies identified in clusters. The case study identifies collective and community anomalies and confirms mutual information can validate clustering results by showing strong links within clusters but independence between objects in one cluster.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Privacy Preserving Clustering on Distorted dataIOSR Journals
- The document discusses privacy-preserving clustering on distorted data using singular value decomposition (SVD) and sparsified singular value decomposition (SSVD).
- It applies SVD and SSVD to distort a real-world dataset of 100 terrorists with 42 attributes, generating distorted datasets.
- K-means clustering is then performed on the original and distorted datasets for different numbers of clusters (k). The results show that SSVD more effectively groups the data objects into clusters compared to the original and SVD-distorted datasets, while preserving data privacy as measured by various metrics.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Data preprocessing involves cleaning data by filling in missing values, smoothing noisy data, and resolving inconsistencies. It also includes integrating and transforming data from multiple sources, reducing data volume through aggregation, dimensionality reduction, and discretization while maintaining analytical results. The key goals of preprocessing are to improve data quality and prepare the data for mining tasks through techniques like data cleaning, integration, transformation, reduction, and discretization of attributes into intervals or concept hierarchies.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
International Journal of Computational Engineering Research(IJCER)ijceronline
This document summarizes a research paper that proposes a novel approach to improve the detection rate and search efficiency of signature-based network intrusion detection systems (NIDS). The approach uses data mining and classification algorithms like C4.5 and ensemble algorithms like MadaBoost to improve detection rates. It also uses a modified signature apriori algorithm to more efficiently search for signatures of related attacks based on known signatures, in order to improve search efficiency. The full paper describes these approaches in more technical detail and evaluates their effectiveness at improving NIDS performance.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
This document discusses scoring and ranking documents in information retrieval systems. It introduces the vector space model and term weighting schemes like TF-IDF that are used to assign relevance scores to documents for a given query. TF-IDF weighting increases scores for terms that appear frequently in a document but rarely in the whole collection. This allows more relevant documents containing rare, informative query terms to be ranked higher. IDF on its own does not affect ranking for single-term queries but boosts rarer terms' influence for multi-term queries.
The document discusses data mining and knowledge discovery in databases. It defines data mining as the nontrivial extraction of implicit and potentially useful information from large amounts of data. With huge increases in data collection and storage, data mining aims to analyze data and discover patterns that can provide insights and knowledge about businesses and the real world. The data mining process involves selecting, preprocessing, transforming, and analyzing data to extract hidden patterns and relationships, which are then interpreted and evaluated.
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
This paper proposes a classification-based approach for suppressing data to prevent sensitive information from being inferred. It uses decision tree algorithms to classify data elements based on attributes and considers suppressing data elements to secure the data. The paper aims to enhance data classification and generalization. It shows how data can be secured using "generalization" while maintaining usefulness for data mining tasks. The proposed system focuses on data generalization concepts to hide detailed information for privacy while allowing standard data mining techniques to still discover patterns. It evaluates suppressing multiple confidential values and developing a technique independent of individual classification methods based on information theory.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
This document summarizes an empirical study comparing several supervised machine learning approaches for word sense disambiguation: Naive Bayes, decision tree, decision list, and support vector machine (SVM). The study used a dataset of 15 words annotated with senses from WordNet and Senseval-3. Each approach was implemented and evaluated based on its accuracy in identifying the correct sense of each word. The results showed that the decision list approach achieved the highest overall accuracy of 69.12%, followed by SVM at 56.11%, naive Bayes at 58.32%, and decision tree at 45.14%. Thus, the study concluded that decision list performed best on this dataset for the task of word sense disambiguation.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGijaia
Imbalanced data is a significant challenge in classification with machine learning algorithms. This is particularly important with log message data as negative logs are sparse so this data is typically imbalanced. In this paper, a model to generate text log messages is proposed which employs a SeqGAN network. An Autoencoder is used for feature extraction and anomaly detection is done using a GRU network. The proposed model is evaluated with three imbalanced log data sets, namely BGL, OpenStack, and Thunderbird. Results are presented which show that appropriate oversampling and data balancing
improves anomaly detection accuracy.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Figure 1 from the Red Brick company illustrates the data explosion.
Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.
This document presents a novel approach to anomaly detection in link mining based on applying mutual information. It adapts the CRISP-DM methodology for link mining and applies it to a case study using co-citation data. The methodology includes data description, preprocessing, transformation, exploration, modeling through graph mapping and hierarchical clustering, and evaluation. Mutual information is used to interpret the semantics of anomalies identified in clusters. The case study identifies collective and community anomalies and confirms mutual information can validate clustering results by showing strong links within clusters but independence between objects in one cluster.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
Data mining techniques are used to analyze large datasets and discover hidden patterns. There are three main types of data mining techniques: supervised, unsupervised, and semi-supervised learning. Supervised learning uses labeled training data to learn relationships between inputs and outputs. Unsupervised learning looks for patterns in unlabeled data. Semi-supervised learning uses some labeled and mostly unlabeled data. The knowledge discovery in databases (KDD) process is a nine step method for applying data mining techniques which includes data selection, preprocessing, transformation, mining, and interpretation.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. This method is compared to other privacy preservation techniques like data swapping and noise addition. It is found to reduce processing time compared to other methods. The document outlines literature on privacy preservation techniques including data modification, cryptography, and data reconstruction methods. It then describes the proposed method of using a fuzzy membership function to add noise to sensitive attributes before clustering the data.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. The fuzzy membership function uses a S-shaped curve to map original attribute values to modified values. Clustering is then performed on the distorted data. This approach aims to preserve privacy while reducing processing time compared to other privacy-preserving methods like cryptographic techniques, data swapping, and noise addition.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
The document discusses a link mining methodology adapted from the CRISP-DM process to incorporate anomaly detection using mutual information. It applies this methodology in a case study of co-citation data. The methodology involves data description, preprocessing, transformation, exploration, modeling, and evaluation. Hierarchical clustering identified 5 clusters, with cluster 1 showing strong links and cluster 5 weak links. Mutual information validated the results, showing cluster 5 had the lowest mutual information, indicating independent variables. The case study demonstrated the approach can interpret anomalies semantically and be used with real-world data volumes and inconsistencies.
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
This document proposes a new one-to-many data linkage technique using a One-Class Clustering Tree (OCCT) to link records from different datasets. The technique constructs a decision tree where internal nodes represent attributes from the first dataset and leaves represent attributes from the second dataset that match. It uses maximum likelihood estimation for splitting criteria and pre-pruning to reduce complexity. The method is applied to the database misuse domain to identify common and malicious users by analyzing access request contexts and accessible data. Evaluation shows the technique achieves better precision and recall than existing methods.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
This document outlines a course on data warehousing and data mining. It introduces key concepts like relational databases, data warehouses, dimensional modeling, and data mining techniques. It also details the course objectives, schedule, assignments, and policies. The goal is for students to gain experience applying data mining methods and understanding the relationship between data mining and other fields.
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
This document summarizes frequent itemset mining algorithms. It introduces data mining and the Apriori algorithm. Apriori generates candidate itemsets and prunes those that are not frequent by scanning the database multiple times. The document proposes two new algorithms to improve efficiency: Impression reduces scans by pruning candidates using an impression table, while Transaction Database Spin reduces the database size between iterations by removing transactions not containing large itemsets. Both aim to reduce database access compared to Apriori.
This document discusses using data mining techniques like machine learning to analyze air quality data and generate models for predicting pollution levels. It summarizes applying decision trees and neural networks to data on pollutants and weather factors in Cambridge, UK. The models showed air temperature as the dominant predictor of ozone levels. While data mining provided insights, the author notes it is most useful complementing existing scientific domain knowledge and physical models of air quality.
This document discusses using data mining techniques like machine learning to analyze air quality data and generate models for predicting pollution levels. It summarizes applying decision trees and neural networks to data on pollutants and weather in Cambridge, UK. The models showed air temperature is a dominant predictor of ozone levels. While data mining provides an empirical approach and short-term predictions, physical models are still needed to fully understand urban air quality given small-scale variations.
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
Abstract: Data ambiguity is major problem in the information retrieval ambiguity is due to the loss in the data dimension it causes lot of problem in various real life application. Database may incomplete due to missing dimension and value. In previous work is totally based on the missing value. We focus on the problem is to find the missing dimension in our work. Missing dimension leads towards the problem in the traditional query approach. Missing dimension information create computational problem, so large number of possible combinations of missing dimensions need to be examined to check similarity between the query object and the data objects . Our aim is to reduce the all recovery version to increase the system performance as number of possible recovery data is reduces the time to estimate the true result is also reduces. Keywords: Missing Dimensions, Similarity search, Whole sequence query, Probability triangle inequality, Temporal data.
Title: Enhance The Technique For Searching Dimension Incomplete Databases
Author: Mr. Amol Patil, Prof. Saba Siraj, Miss. Ashwini Sagade
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Similar to Duplicate Detection of Records in Queries using Clustering (20)
Help the Genetic Algorithm to Minimize the Urban Traffic on IntersectionsIJORCS
This document summarizes a research paper that uses genetic algorithms to optimize traffic light timing at intersections to minimize traffic. It first describes modeling traffic light intersections using Petri nets. It then explains how genetic algorithms can be used for optimization by coding the problem variables in chromosomes, defining a fitness function to evaluate populations over generations, and using operators like mutation and crossover. The fitness function aims to minimize average traffic light cycle times based on 14 parameters related to light timing and vehicle wait times at two intersections. The genetic algorithm optimization of traffic light timing parameters is found to improve traffic flow at intersections.
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 4). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
All paper submissions (http://www.ijorcs.org/submit-paper) are received and managed electronically by IJORCS Team. Detailed instructions about the submission procedure are available on IJORCS website (http://www.ijorcs.org/author-guidelines)
License plate recognition system is one of the core technologies in intelligent traffic control. In this paper, a new and tunable algorithm which can detect multiple license plates in high resolution applications is proposed. The algorithm aims at investigation into and identification of the novel Iranian and some European countries plate, characterized by both inclusion of blue area on it and its geometric shape. Obviously, the suggested algorithm contains suitable velocity due to not making use of heavy pre-processing operation such as image-improving filters, edge-detection operation and omission of noise at the beginning stages. So, the recommended method of ours is compatible with model-adaptation, i.e., the very blue section of the plate so that the present method indicated the fact that if several plates are included in the image, the method can successfully manage to detect it. We evaluated our method on the two Persian single vehicle license plate data set that we obtained 99.33, 99% correct recognition rate respectively. Further we tested our algorithm on the Persian multiple vehicle license plate data set and we achieved 98% accuracy rate. Also we obtained approximately 99% accuracy in character recognition stage.
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
This Paper is a review study of FPGA implementation of Finite Impulse response (FIR) with low cost and high performance. The key observation of this paper is an elaborate analysis about hardware implementations of FIR filters using different algorithm i.e., Distributed Arithmetic (DA), DA-Offset Binary Coding (DA-OBC), Common Sub-expression Elimination (CSE) and sum-of-power-of-two (SOPOT) with less resources and without affecting the performance of the original FIR Filter.
Using Virtualization Technique to Increase Security and Reduce Energy Consump...IJORCS
An approach has been presented in this paper in order to generate a secure environment on internet Based Virtual Computing platform and also to reduce energy consumption in green cloud computing. The proposed approach constantly checks the accuracy of stored data by means of a central control service inside the network environment and also checks system security through isolating single virtual machines using a common virtual environment. This approach has been simulated on two types of Virtual Machine Manager (VMM) Quick EMUlator (Qemu), HVM (Hardware Virtual Machine) Xen and outputs of the simulation in VMInsight show that when service is getting singly used, the overhead of its performance will be increased. As a secure system, the proposed approach is able to recognize malicious behaviors and assure service security by means of operational integrity measurement. Moreover, the rate of system efficiency has been evaluated according to the amount of energy consumption on five applications (Defragmentation, Compression, Linux Boot Decompression and Kernel Boot). Therefore, this has been resulted that to secure multi-tenant environment, managers and supervisors should independently install a security monitoring system for each Virtual Machines (VMs) which will come up to have the management heavy workload of. While the proposed approach, can respond to all VM’s with just one virtual machine as a supervisor.
Algebraic Fault Attack on the SHA-256 Compression FunctionIJORCS
The cryptographic hash function SHA-256 is one member of the SHA-2 hash family, which was proposed in 2000 and was standardized by NIST in 2002 as a successor of SHA-1. Although the differential fault attack on SHA-1compression function has been proposed, it seems hard to be directly adapted to SHA-256. In this paper, an efficient algebraic fault attack on SHA-256 compression function is proposed under the word-oriented random fault model. During the attack, an automatic tool STP is exploited, which constructs binary expressions for the word-based operations in SHA-256 compression function and then invokes a SAT solver to solve the equations. The simulation of the new attack needs about 65 fault injections to recover the chaining value and the input message block with about 200 seconds on average. Moreover, based on the attack on SHA-256 compression function, an almost universal forgery attack on HMAC-SHA-256 is presented. Our algebraic fault analysis is generic, automatic and can be applied to other ARX-based primitives.
Enhancement of DES Algorithm with Multi State LogicIJORCS
The principal goal to design any encryption algorithm must be the security against unauthorized access or attacks. Data Encryption Standard algorithm is a symmetric key algorithm and it is used to secure the data. Enhanced DES algorithm works on increasing the key length or complex S-BOX design or increased the number of states in which the information is to be represented or combination of above criteria. By increasing the key length, the number of combinations for key will increase which is hard for the intruder to do the brute force attack. As the S-BOX design will become the complex there will be a good avalanche effect. As the number of states increases in which the information is represented, it is hard for the intruder to crack the actual information. Proposed algorithm replace the predefined XOR operation applied during the 16 round of the standard algorithm by a new operation called “Hash function” depends on using two keys. One key used in “F” function and another key consists of a combination of 16 states (0,1,2…13,14,15) instead of the ordinary 2 state key (0, 1). This replacement adds a new level of protection strength and more robustness against breaking methods.
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
This paper presents a new algorithm for solving large scale global optimization problems based on hybridization of simulated annealing and Nelder-Mead algorithm. The new algorithm is called simulated Nelder-Mead algorithm with random variables updating (SNMRVU). SNMRVU starts with an initial solution, which is generated randomly and then the solution is divided into partitions. The neighborhood zone is generated, random number of partitions are selected and variables updating process is starting in order to generate a trail neighbor solutions. This process helps the SNMRVU algorithm to explore the region around a current iterate solution. The Nelder- Mead algorithm is used in the final stage in order to improve the best solution found so far and accelerates the convergence in the final stage. The performance of the SNMRVU algorithm is evaluated using 27 scalable benchmark functions and compared with four algorithms. The results show that the SNMRVU algorithm is promising and produces high quality solutions with low computational costs.
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 2). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
To view complete list of topics coverage of IJORCS, Aim & Scope, please visit, www.ijorcs.org/scope
Welcoming the research scholars, scientists around the globe in the Open Access Dimension, IJORCS is now accepting manuscripts for its next issue (Volume 4, Issue 1). Authors are encouraged to contribute to the research community by submitting to IJORCS, articles that clarify new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science.
Voice Recognition System using Template MatchingIJORCS
It is easy for human to recognize familiar voice but using computer programs to identify a voice when compared with others is a herculean task. This is due to the problem that is encountered when developing the algorithm to recognize human voice. It is impossible to say a word the same way in two different occasions. Human speech analysis by computer gives different interpretation based on varying speed of speech delivery. This research paper gives detail description of the process behind implementation of an effective voice recognition algorithm. The algorithm utilize discrete Fourier transform to compare the frequency spectra of two voice samples because it remained unchanged as speech is slightly varied. Chebyshev inequality is then used to determine whether the two voices came from the same person. The algorithm is implemented and tested using MATLAB.
Channel Aware Mac Protocol for Maximizing Throughput and FairnessIJORCS
The proper channel utilization and the queue length aware routing protocol is a challenging task in MANET. To overcome this drawback we are extending the previous work by improving the MAC protocol to maximize the Throughput and Fairness. In this work we are estimating the channel condition and Contention for a channel aware packet scheduling and the queue length is also calculated for the routing protocol which is aware of the queue length. The channel is scheduled based on the channel condition and the routing is carried out by considering the queue length. This queue length will provide a measurement of traffic load at the mobile node itself. Depending upon this load the node with the lesser load will be selected for the routing; this will effectively balance the load and improve the throughput of the ad hoc network.
A Review and Analysis on Mobile Application Development Processes using Agile...IJORCS
This document provides a review and analysis of mobile application development processes using agile methodologies. It begins with an introduction to agile software development and discusses how agile principles are a natural fit for mobile application development given the dynamic environment. The document then reviews several proposed mobile application development processes that combine agile and non-agile techniques, including Mobile-D, RaPiD7, a hybrid methodology, MASAM, and a Scrum and Lean Six Sigma integration approach. It concludes by noting that while agile methodologies show promise for mobile development, further empirical validation is still needed.
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...IJORCS
In general, nodes in Wireless Sensor Networks (WSNs) are equipped with limited battery and computation capabilities but the occurrence of congestion consumes more energy and computation power by retransmitting the data packets. Thus, congestion should be regulated to improve network performance. In this paper, we propose a congestion prediction and adaptive rate adjustment technique for Wireless Sensor Networks. This technique predicts congestion level using fuzzy logic system. Node degree, data arrival rate and queue length are taken as inputs to the fuzzy system and congestion level is obtained as an outcome. When the congestion level is amidst moderate and maximum ranges, adaptive rate adjustment technique is triggered. Our technique prevents congestion by controlling data sending rate and also avoids unsolicited packet losses. By simulation, we prove the proficiency our technique. It increases system throughput and network performance significantly.
A Study of Routing Techniques in Intermittently Connected MANETsIJORCS
A Mobile Ad hoc Network (MANET) is a self-configuring infrastructure less network of mobile devices connected by wireless. These are a kind of wireless Ad hoc Networks that usually has a routable networking environment on top of a Link Layer Ad hoc Network. The routing approach in MANET includes mainly three categories viz., Reactive Protocols, Proactive Protocols and Hybrid Protocols. These traditional routing schemes are not pertinent to the so called Intermittently Connected Mobile Ad hoc Network (ICMANET). ICMANET is a form of Delay Tolerant Network, where there never exists a complete end – to – end path between two nodes wishing to communicate. The intermittent connectivity araise when network is sparse or highly mobile. Routing in such a spasmodic environment is arduous. In this paper, we put forward the indication of prevailing routing approaches for ICMANET with their benefits and detriments
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
In the field of speech signal processing, Spectral subtraction method (SSM) has been successfully implemented to suppress the noise that is added acoustically. SSM does reduce the noise at satisfactory level but musical noise is a major drawback of this method. To implement spectral subtraction method, transformation of speech signal from time domain to frequency domain is required. On the other hand, Wavelet transform displays another aspect of speech signal. In this paper we have applied a new approach in which SSM is cascaded with wavelet thresholding technique (WTT) for improving the quality of speech signal by removing the problem of musical noise to a great extent. Results of this proposed system have been simulated on MATLAB.
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed SystemIJORCS
This summarizes a research paper that proposes an adaptive load sharing algorithm for heterogeneous distributed systems. The algorithm aims to balance load across nodes by migrating tasks from overloaded nodes to underloaded nodes, taking into account factors like node processing capacities, link capacities, and communication delays. It formulates mathematical models to represent changes in waiting times as tasks are added, completed or migrated between nodes. The goal is to minimize overall response times through decentralized load balancing decisions made locally at each node.
The Design of Cognitive Social Simulation Framework using Statistical Methodo...IJORCS
Modeling the behavior of the cognitive architecture in the context of social simulation using statistical methodologies is currently a growing research area. Normally, a cognitive architecture for an intelligent agent involves artificial computational process which exemplifies theories of cognition in computer algorithms under the consideration of state space. More specifically, for such cognitive system with large state space the problem like large tables and data sparsity are faced. Hence in this paper, we have proposed a method using a value iterative approach based on Q-learning algorithm, with function approximation technique to handle the cognitive systems with large state space. From the experimental results in the application domain of academic science it has been verified that the proposed approach has better performance compared to its existing approaches.
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
This document proposes a framework to improve the processing of spatio-temporal queries for global positioning systems. The framework employs a new indexing algorithm built on SQL Server 2008 that avoids the overhead of R-Tree indexing. It utilizes dynamic materialized views and an adaptive safe region to reduce communication costs and update loads. Caching is used to enhance performance. The notification engine processes concurrent queries using publish/subscribe to group similar queries. Experiments showed the framework outperformed R-Tree indexing.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
2. 30 M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar
B. Element Identification Algorithm:
Supervised learning methods use only some of the 1. D=φ
fields in a record for identification. This is the reason 2. Set the parameters W of C1 according to N
for query results obtained using supervised learning to 3. Use C1 to get a set of duplicate vector pairs d1 and f
contain duplicate records. Unsupervised Duplicate from P and N
Elimination (UDE) does not suffer from these types of 4. P = P- d1
user reference problems. A preprocessing step called 5. while | d1 |≠ 0
exact matching is used for matching relevant records. 6. N’ = N - f
It requires the data format of the records to be the 7. D = D + d1 + f
same. So, the exact matching method is applicable
8. Train C2 using D and N’
only for the records from the same data source.
9. Classify P using C2 and get a set of newly identified
Element identification thus merges the records that are
duplicate vector pairs d2
exactly the same in relevant matching fields.
10. P = P - d2
C. Ontology matching 11. D =D + d2
12. Adjust the parameters W of C1 according to N’ and
The term Ontology is derived from the Greek words
D
‘onto’ which means being and ‘logia’ which means
13. Use C1 to get a new set of duplicate vector pairs d1
written or spoken disclosure. In short, it refers to a
and f from P and N
specification of a conceptualization.
14. N=N’
Ontology basically refers to the set of concepts such 15. Return D
as things, events and relations that are specified in Figure 1: UDE Algorithm
some way in order to create an agreed-upon
vocabulary for exchanging information. Ontologies B. Certainty factor
can be represented in textual or graphical formats.
Usually, graphical formats are preferred for easy In the existing method of duplicate data elimination
understandability. Ontologies with a large knowledge [10], certainty factor (CF) is calculated by classifying
base [5] can be represented in different forms such as attributes with distinct and missing value, type and size
hierarchical trees, expandable hierarchical trees, of the attribute. These attributes are identified
hyperbolic trees, etc. In the expandable hierarchical manually based on the type of the data and the most
tree format, the user has the freedom to expand only important of data in that data warehouse. For example,
the node of interest and leave the rest in a collapsed if name, telephone and fax field are used for matching
state [2]. If necessary, the entire tree can be expanded then high value is assigned for certainty factor. In this
to get the complete knowledge base. This type of research work, best attributes are identified in the early
format can be used only when there are a large number stages of data cleaning. The attributes are selected
of hierarchical relationships. Ontology matching is based on the specific criteria and quality of the data.
used for finding the matching status of the record pairs Attribute threshold value is calculated based on the
by matching the record attributes. measurement type and size of the data. These selected
attributes are well suited for the data cleaning process.
III. SYSTEM METHODOLOGY Certainty factor is assigned based on the attribute
types. This is shown in the following table.
A. Unsupervised Duplicate Elimination
Table 1: Classification of attribute types
UDE employs a similarity function to find field Distinct Missing Types of
similarity. We use similarity vector to represent a pair S. No Key Attribute Size of data
values values data
of records. 1 √ √ √
Input: Potential duplicate vector set P Non-duplicate 2 √ √
vector set N 3 √ √
Output: Duplicate vector set D 4 √ √ √ √
5 √ √ √
C1 : A classification algorithm with adjustable
parameters W that identifies duplicate vector pairs 6 √ √ √
from P C2 : a supervised classifier, SVM 7 √ √ √
8 √ √
9 √ √
10 √ √
www.ijorcs.org
3. Duplicate Detection of Records in Queries using Clustering 31
• Matching key field with high type and high size
Rule 1: certainty factor 0.95 (No. 1 and No. 4)
• And matching field with high distinct value and
• Matching key field with high type and high size
high value data type
• And matching field with high distinct value, low
missing value, high value data type and matching Rule 12: certainty factor 0.7 (No. 2 and No. 9)
field with high range value • Matching key field with high size
• And matching field with high distinct value and
Rule 2: certainty factor 0.9 (No. 2 and No. 4)
high value data type
• Matching key field with high range value
• And matching field with high distinct value, low Rule 13: certainty factor 0.7 (No. 3 and No. 9)
missing value, and matching field with high range • Matching key field with high type
value • And matching field with high distinct value and
high value data type
Rule 3: certainty factor 0.9 (No. 3 and No. 4)
• Matching key field with high type Rule 14: certainty factor 0.7 (No. 1 and No. 10)
• And Matching field with high distinct value, • Matching key field with high type and high size
low missing value, high value data type and • And matching field with high distinct value and
matching field with high range value high range value
Rule 4: certainty factor 0.85 (No. 1 and No. 5) Rule 15: certainty factor 0.7 (No. 2 and No. 10)
• Matching key field with high type and high size • Matching key field with high size
• And matching field with high distinct value, low • And matching field with high distinct value and
missing value and high range value high range value
Rule 5: certainty factor 0.85 (No. 1 and No. 5) Rule 16: certainty factor 0.7 (No. 3 and No. 10)
• Matching key field and high size • Matching key field with high type
• And matching field with high distinct value, low • And matching field with high distinct value and
missing value and high range value high range value
Rule 6: certainty factor 0.85 (No. 2 and No. 5)
S Certainty Threshold
Rules
• Matching key field with high type No. Factor (CF) value (TH)
• And matching field with high distinct value, low 1 {TS}, {D, M, DT, DS} 0.95 0.75
2 {T, S}, {D, M, DT, DS} 0.9 0.80
missing value and high range value
{TS, T, S}, {D, M, DT},
3 0.85 0.85
Rule 7: certainty factor 0.85 (No. 1 and No. 6) {D, M, DS}
4 {TS, T, S}, {D, DT, DS} 0.8 0.9
• Matching key field with high size and high type 5 {TS, T, S}, {D, M} 0.75 0.95
• And matching field with high distinct value, low {TS, T, S}, {D, DT}, {D,
6 0.7 0.95
missing value and high value data type DS}
Rule 8: certainty factor 0.8 (No. 3 and No. 7) TS – Type and Size of key attribute
• Matching key field with high type T – Type of key attribute
• And Matching field with high distinct value, S – Size of key attributes
high value data type and high range value D – Distinct value of attributes
M – Missing value of attributes
Rule 09: certainty factor 0.75 (No. 2 and No. 8)
DT – Data type of attributes
• Matching key field with high size DS – Data size of attributes
• And matching field with high distinct value and
low missing value Duplicate records are identified in each cluster to
identify exact and inexact duplicate records. The
Rule 10: certainty factor 0.75 (No. 3 and No. 8) duplicate records can be categorized as match, may be
• Matching key field with high type match and no-match. Match and may be match
duplicate records are used in the duplicate data
• And matching field with high distinct value and
elimination rule. Duplicate data elimination rule will
low missing value identify the quality of the each duplicate record to
Rule 11: certainty factor 0.7 (No. 1 and No. 9) eliminate poor quality duplicate records.
www.ijorcs.org
4. 32 M.Anitha, A.Srinivas, T.P.Shekhar, D.Sagar
• Calculate the similarity of the documents matched in Data cleansing is a complex and challenging problem.
main concepts (Xmc) and the similarity of the This rule-based strategy helps to manage the
documents matched in detailed descriptions (Xdd). complexity, but does not remove that complexity. This
• Evaluate Xmc and Xdd using the rules to derive the approach can be applied to any subject oriented data
corresponding memberships. warehouse in any domain.
• Compare the memberships and select the minimum
V. REFERENCES
membership from these two sets to represent the
membership of the corresponding concept (high [1] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani,
similarity, medium similarity, and low similarity) for “Robust and Efficient Fuzzy Match for Online Data
each rule. Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.
• Collect memberships which represent the same [2] Kuhanandha Mahalingam and Michael N.Huhns,
“Representing and using Ontologies”,USC-CIT
concept in one set.
Technical Report 98-01.
• Derive the maximum membership for each set, and [3] Weifeng Su, Jiying Wang, and Federick H.Lochovsky,
compute the final inference result. ” Record Matching over Query Results from Multiple
C. Evaluation Metric Web Databases” IEEE transactions on Knowledge and
Data Engineering, vol. 22, N0.4,2010.
The overall performance can be found using [4] R. Ananthakrishna, S. Chaudhuri, and V. Ganti,
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
precision and recall where
Number of correctly identified duplicate pairs
“Eliminating Fuzzy Duplicates in Data Warehouses.
VLDB”, pages 586-597, 2002.
Number of all identified duplicate pairs
[5] Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall
𝑅𝑒𝑐𝑎𝑙𝑙 =
Number of correctly identified duplicate pairs
.E,”Ontology Driven Architectures and Potential Uses
Number of true duplicate pairs
of the Semantic Web in Software
Engineering”,W3C,Semantic Web Best Practices and
Deployment Working Group,Draft(2006).
The classification quality is evaluated using F-measure [6] Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma,
which is the harmonic mean of precision and recall “Instance-based Schema Matching for Web Databases
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
2(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑟𝑒𝑐𝑎𝑙𝑙)
by Domain-specific Query Probing”, Proceedings of the
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
30th VLDB Conference, Toronto, Canada, 2004.
[7] Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang
Hsu,and David W.Hsiao, “A Fuzzy Ontological
IV. CONCLUSION Knowledge Document Clustering Methodology”,IEEE
Transactions on Systems,Man,and Cybernetics-Part
Deduplication and data linkage are important tasks B:Cybernetics,Vol.39,No.3,june 2009.
in the pre-processing step for many data mining
projects. It is important to improve data quality before
data is loaded into data warehouse. Locating
approximate duplicates in large data warehouse is an
important part of data management and plays a critical
role in the data cleaning process. In this research wok,
a framework is designed to clean duplicate data for
improving data quality and also to support any subject
oriented data.y
In this research work, efficient duplicate detection
and duplicate elimination approach is developed to
obtain good result of duplicate detection and
elimination by reducing false positives. Performance
of this research work shows that the time saved
significantly and improved duplicate results than
existing approach.
The framework is mainly developed to increase the
speed of the duplicate data detection and elimination
process and to increase the quality of the data by
identifying true duplicates and strict enough to keep
out false-positives. The accuracy and efficiency of
duplicate elimination strategies are improved by
introducing the concept of a certainty factor for a rule.
www.ijorcs.org