Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
Applying Classification Technique using DID3 Algorithm to improve Decision Su...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
Data mining environment produces a large amount of data that need to be analyzed.
Using traditional databases and architectures, it has become difficult to process, manage and analyze
patterns. To gain knowledge about the Big Data a proper architecture should be understood.
Classification is an important data mining technique with broad applications to classify the various
kinds of data used in nearly every field of our life. Classification is used to classify the item
according to the features of the item with respect to the predefined set of classes. This paper put a
light on various classification algorithms including j48, C4.5, Naive Bayes using large dataset.
Applying Classification Technique using DID3 Algorithm to improve Decision Su...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
Abstract— Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. It concerns Large-Volume, Complex and growing data sets in both multiple and autonomous sources. Not only in science and engineering big data are now rapidly expanding in all domains like physical, bio logical etc...The main objective of this paper is to characterize the features of big data. Here the HACE theorem, that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective, is used. The aggregation of mining, analysis, information sources, user interest modeling, privacy and security are involved in this model. To explore and extract the large volumes of data and useful information or knowledge respectively is the most fundamental challenge in Big Data. So we should have a tendency to analyze these problems and knowledge revolution.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
Today, enormous amount of data is collected in medical databases. These databases may contain valuable
information encapsulated in nontrivial relationships among symptoms and diagnoses. Extracting such
dependencies from historical data is much easier to done by using medical systems. Such knowledge can be
used in future medical decision making. In this paper, a new algorithm based on C4.5 to mind data for
medince applications proposed and then it is evaluated against two datasets and C4.5 algorithm in terms of
accuracy.
Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggre...rahulmonikasharma
In microdata releases, main task is to protect the privacy of data subjects. Microaggregation technique use to disclose the limitation at protecting the privacy of microdata. This technique is an alternative to generalization and suppression, which use to generate k-anonymous data sets. In this dataset, identity of each subject is hidden within a group of k subjects. Microaggregation perturbs the data and additional masking allows refining data utility in many ways, like increasing data granularity, to avoid discretization of numerical data, to reduce the impact of outliers. If the variability of the private data values in a group of k subjects is too small, k-anonymity does not provide protection against attribute disclosure. In this work Role based access control is assumed. The access control policies define selection predicates to roles. Then use the concept of imprecision bound for each permission to define a threshold on the amount of imprecision that can be tolerated. So the proposed approach reduces the imprecision for each selection predicate. Anonymization is carried out only for the static relational table in the existing papers. Privacy preserving access control mechanism is applied to the incremental data.
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
Data mining environment produces a large amount of data, that need to be
analyses, pattern have to be extracted from that to gain knowledge. In this new period with
rumble of data both ordered and unordered, by using traditional databases and architectures, it
has become difficult to process, manage and analyses patterns. To gain knowledge about the
Big Data a proper architecture should be understood. Classification is an important data mining
technique with broad applications to classify the various kinds of data used in nearly every
field of our life. Classification is used to classify the item according to the features of the item
with respect to the predefined set of classes. This paper provides an inclusive survey of
different classification algorithms and put a light on various classification algorithms including
j48, C4.5, k-nearest neighbor classifier, Naive Bayes, SVM etc., using random concept.
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTIONIJCSEA Journal
The rapid development of computer networks around the world generated new areas especially in computer instruction processing. In grid computing, instruction processing is performed by external processors available to the system. An important topic in this area is task scheduling to available external resources. However, we do not deal with this topic here. In this paper we intend to work on strategic decision making on selecting the best alternative resources for processing instructions with respect to criteria in special conditions. Where the criteria might be security, political, technical, cost, etc. Grid computing should be determined with respect to the processing objectives of instructions of a program. This paper seeks a way through combining Analytic Hierarchy Process (AHP) and Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) to help us in ranking and selecting available resources according to considerable criteria in allocating instructions to resources. Therefore, our findings will help technical managers of organizations in choosing as well as ranking candidate alternatives for processing program instructions.
Harmonized scheme for data mining technique to progress decision support syst...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Harmonized scheme for data mining technique to progress decision support syst...eSAT Journals
Abstract Decision Support System (DSS) is equivalent synonym as management information systems (MIS). Decision supporting systems include also decisions made upon individual data from external sources, management feeling, and various other data sources not included in business intelligence. They serve as an integrated repository for internal and external data-intelligence critical to understanding and evaluating the business within its environmental context. Data mining have emerged to meet this need. With the addition of models, analytic tools, and user interfaces, they have the potential to provide actionable information that supports effective problem and opportunity identification, critical decision-making, and strategy formulation, implementation, and evaluation. The proposed system will support top level management to make a good decision in any time under any uncertain environment. Keywords: Dss, Dm, Mis, Clustering, Classification, Association Rule, K-Mean, Olap, Matlab
FUZZY ANALYTIC HIERARCHY BASED DBMS SELECTION IN TURKISH NATIONAL IDENTITY CA...ijistjournal
Database Management Systems (DBMS) play an important role to support enterprise application developments. Selection of the right DBMS is a crucial decision for software engineering process. This selection requires optimizing a number of criteria. Evaluation and selection of DBMS among several candidates tend to be very complex. It requires both quantitative and qualitative issues. Wrong selection of DBMS will have a negative effect on the development of enterprise application. It can turn out to be costly and adversely affect business process. The following study focuses on the evaluation of a multi criteria decision problem by the usage of fuzzy logic. We will demonstrate the methodological considerations regarding to group decision and fuzziness based on the DBMS selection problem. We developed a new Fuzzy AHP based decision model which is formulated and proposed to select a DBMS easily. In this decision model, first, main criteria and their sub criteria are determined for the evaluation. Then these criteria are weighted by pair-wise comparison, and then DBMS alternatives are evaluated by assigning a rating scale.
FUZZY ANALYTIC HIERARCHY BASED DBMS SELECTION IN TURKISH NATIONAL IDENTITY CA...ijistjournal
Database Management Systems (DBMS) play an important role to support enterprise application developments. Selection of the right DBMS is a crucial decision for software engineering process. This selection requires optimizing a number of criteria. Evaluation and selection of DBMS among several candidates tend to be very complex. It requires both quantitative and qualitative issues. Wrong selection of DBMS will have a negative effect on the development of enterprise application. It can turn out to be costly and adversely affect business process. The following study focuses on the evaluation of a multi criteria decision problem by the usage of fuzzy logic. We will demonstrate the methodological considerations regarding to group decision and fuzziness based on the DBMS selection problem. We developed a new Fuzzy AHP based decision model which is formulated and proposed to select a DBMS easily. In this decision model, first, main criteria and their sub criteria are determined for the evaluation. Then these criteria are weighted by pair-wise comparison, and then DBMS alternatives are evaluated by assigning a rating scale.
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
Now a day’s every second trillion of bytes of data is being generated by enterprises especially in internet.To achieve the best decision for business profits, access to that data in a well-situated and interactive way is always a dream of business executives and managers. Data warehouse is the only viable solution that can bring the dream into veracity. The enhancement of future endeavours to make decisions depends on the availability of correct information that is based on quality of data underlying. The quality data can only be produced by cleaning data prior to loading into data warehouse since the data collected from different sources will be dirty. Once the data have been pre-processed and cleansed then it produces accurate results on applying the data mining query. Therefore the accuracy of data is vital for well-formed and reliable decision making. In this paper, we propose a framework which implements robust data quality to ensure consistent and correct loading of data into data warehouses which ensures accurate and reliable data analysis, data mining and knowledge discovery.
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
A lot of classification algorithms are available in the area of data mining for solving the same kind of
problem with a little guidance for recommending the most appropriate algorithm to use which gives best
results for the dataset at hand. As a way of optimizing the chances of recommending the most appropriate
classification algorithm for a dataset, this paper focuses on the different factors considered by data miners
and researchers in different studies when selecting the classification algorithms that will yield desired
knowledge for the dataset at hand. The paper divided the factors affecting classification algorithms
recommendation into business and technical factors. The technical factors proposed are measurable and
can be exploited by recommendation software tools.
Similar to A statistical data fusion technique in virtual data integration environment (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
A statistical data fusion technique in virtual data integration environment
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
DOI : 10.5121/ijdkp.2013.3503 25
A STATISTICAL DATA FUSION TECHNIQUE IN
VIRTUAL DATA INTEGRATION ENVIRONMENT
Mohamed M. Hafez1
, Ali H. El-Bastawissy1
and Osman H. Mohamed1
1
Information Systems Dept., Faculty of Computers and Information,
Cairo Univ., Egypt
ABSTRACT
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
KEYWORDS
Data integration, duplicates detectors, data fusion, conflict handling & resolution, probabilistic databases
1. INTRODUCTION
Recently, many applications require data to be integrated from different data sources in order to
satisfy user queries. Therefore, it was the emergence of using virtual data integration. The user
submits the queries to a Global Schema (GS) with data stored in local data sources as shown in
Figure 1. Three techniques (GaV, LaV, and GLaV) are used to define the mapping between the
GS and local schemas (LS) of the data sources[1], [2]. In our technique, we focus more on the
GaV technique through defining views over LS. The defined mapping metadata determines the
data sources contributing in the answer of the user query.
Figure 1: Data Integration Components[2].
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
26
In such an integration environment, the user submits the query and waits for the clean and
consistent answers. To give the user such results, three steps should be processed in sequence:
Schema Matching, Duplicate Detection and Data Fusion as shown in Figure 2.
Figure 2: The Three Steps of Data Fusion[3]
Schema matching inconsistencies are handled in many research articles [4][5]. In step 2,
Integration the answers from the contributing data sources, many duplicate detection techniques
could be used to discover and group duplicates into clusters. Data conflicts arise after step 2,
which should be handled before sending the final result to the user.
Many possible obstacles should be addressed in both step 2 and step 3. One problem that might
arise in step 2 is if a set of records were considered duplicates incorrectly. Do we ignore these
duplicates? Can we use some predefined metadata to help solving this problem? Thus, we have a
challenge to find a smart way to improve the findings of such duplicates. Another important
problem is that within each cluster of duplicates, some values might conflict between different
records representing the same object. In order to solve such conflicts, should we ignore them and
rely on the user to choose from them, or to avoid such conflicts from the beginning by putting
some data sources preferences and choose based on them preventing any conflicts to occur? Or
one last option is to resolve such conflicts in the run-time using some fusion techniques?
Therefore, the challenge to be addressed here is to find the most appropriate way to overcome
conflicts and complete the data fusion process.
This paper is organized as follows; in section 2 we mention the description and classification of
the commonly used conflict handling. In section 3, the proposed technique for data fusion will be
explained, which is the core of our work. Then, in section 4 our five steps data fusion framework
will be presented. The conclusion and future work are presented in section 5.
2. RELATED WORK
Many strategies were developed to handle data conflicts, some of them were repeatedly
mentioned in the literature. These conflict handling strategies, shown in Figure 3, can be
classified into three main classes based on the way of handling conflicting data: ignorance,
avoidance, and resolutions[6], [7],[8].
2.1. CONFLICT IGNORANCE. No decisions are made to deal with conflicts at all; they are left to
the user to handle them. Two famous techniques to ignore such conflicts: PASS IT ON that takes
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
27
all conflicting values and passes them to the user to decide, and the other one is CONSIDER
ALL POSSIBILITIES which generates all possible combinations of values, some of them don’t
have to be present in the data sources, and show them to the user to choose.
Figure 3: Classification of the Conflict Handling Strategies[6].
This strategy works in any integration environment and could be easily implemented. From the
computational point of view, it might not be so effective to consider all combinations and give
them to the user for choosing. From the automation and integration point of view, it is ineffective
to involve the user in deciding on the good values because the user doesn’t know about the
structure of the integration system nor the organization and quality of data sources.
2.2. CONFLICT AVOIDANCE. Decisions are made before regarding the data values, which
prevents any possibility of hesitation or making decisions about the values to be chosen in the
data fusion step before showing the final result to the user. Conflict Avoidance can rely on either
the instances/data sources or the metadata stored in the global schema to deal with such conflicts.
Two famous techniques based on the instance are: TAKE THE INFORMATION in which only
the non-NULL values are taken, leaving aside the NULL ones and the other one is NO
GOSSIPING which takes only into consideration the answers from data sources that fulfill the
constraints or conditions added into the user query, and ignoring all of the inconsistent answers.
Another famous strategy based on the metadata stored in the GS is TRUST YOUR FRIENDS in
which data are preferred to be taken from one data source over another based on the user
preference in the query or automatically based on some quality criteria such as Timestamp,
Accuracy, Cost,…[2], [9],[10],[11]
This strategy doesn’t take extensive computations because it depends on pre-taken preferences
and decisions based on either data sources or metadata. On the other hand, it might not give good
results in terms of the accuracy and precision because it doesn’t take into account all factors in
solving the conflicts based on the metadata or the data values in the data sources such as the
nature of the values and its frequency, the dependency between attributes.
2.3. CONFLICT RESOLUTION. This strategy examines all the data values and metadata before
making any decisions about the way to handle conflicts. Two sub strategies to be considered
under conflict resolution are: deciding strategy and mediating strategy.
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
28
a) Deciding Strategy. This strategy resolves conflicts using the existing actual values
depending on the data values or the metadata[12]. Two common techniques are mostly used
under this strategy based on instance values, which are: CRY WITH THE WOLVES in
which we select the value that appears the most among the conflicting values, while the other
technique ROLE THE DICE picks a value at random from the conflicting ones. One of the
famous techniques based on the metadata is KEEP UP TO DATE which uses the timestamp
metadata about the data sources, attributes and the data to select from the conflicting values
based on recency.
b) Mediating Strategy. This strategy chooses a value that is not one of the exiting actual values
taking into account the data values and/or the stored metadata. An example of a technique of
this strategy is MEET IN THE MIDDLE which invents a new value from the conflicting
values to represent them by many ways: like taking the mean of the values for example. Other
techniques even use higher-level criteria such as provenance to resolve inconsistencies[13].
This strategy is more computationally expensive than the previous two strategies because all of its
techniques require computations in the run-time through accessing the actual data and the
metadata. Although its computational problem, this strategy might give more accurate answers
without any user intervention.
Some systems were developed that used one or more of the above strategies and techniques in
handling data conflicts in relational integrated data sources such as: HumMer[14] and
ConQuer[15], or extended SQL to be able to work with data fusion[14], [16]. Other techniques
were developed to work with probabilistic databases to handle inconsistent data[17], [18][19].
3. ANSWERS FUSION SCORING TECHNIQUE
In this section, we will focus on illustrating our new answers fusion technique. Our proposed
technique resolves conflicts automatically based on decisions relying on both the data values in
the instances and the metadata stored about each of the contributing data sources and attributes.
So, it is a mixed deciding strategy for conflict resolution using both instances data and metadata.
Our fusion technique is based on two basic scores which are: the dependency between all
attributes in each table in data source which is a measure of how well the attribute values are
correlated in each data source, and the other score is the relative frequency for each of the
conflicting data values which is a measure of how popular is the data value in its cluster of
duplicated records.
For simplicity, we will use the simple user query and the couple of data sources shown in Figure
4 (a), (b) respectively. So, as a preprocessing step for all data sources in the data integration
environment, the dependency between all attributes in each table in data source is calculated and
tored in the Dependency Statistics Module using the Attributes Dependency Score.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
29
Figure 4: a) a sample user query submitted to the global schema;
b) data sources contributing in answering the query
Equation 3.1 (Attributes Dependency Score)
Let C1 and C2 represents two attributes in the same data source DS, and let D1and D2 be the
list of values in C1 and C2 respectively and DS(Card) to present the number of records in the
data source. The score of the attributes dependency D (C1, C2) between C1 and C2 is defined
as:
Score D(C1, C2) =
Count (D1(k), D2(k))
ୈୗ(େୟ୰ୢ)
ୀ
DS(Card)
Where Count (D1 (k), D2 (k)) > 1.
A sample of how the attributes dependency score was calculated between ATT_A and ATT_X in
DS1.TAB_T is shown in Table 1. We consider the attributes dependency score between the
semantically correlated attributes given by the attributes semantic rules to equal 1 without any
further calculations. Given all of the attributes dependency scores and the query fragments, we
will calculate the local detectors. For each of the query requested attributes, the local detectors
within each data source consists of the lowest dependent attribute in addition to the intersection of
attributes between all of the contributing dependency lists having dependency score greater than
ZERO. Of course, we exclude the query attributes from the selection to ensure better duplicate
detection. In our example, the set of local detectors are: {ATT_A, ATT_B} and {ATT_A,
ATT_C} for DS1.TAB_T and DS2.TAB_T respectively.
Definition 3.1 (Attributes Dependency Score)
Attributes Dependency Score is the relative frequency for the summation of all value pairs
having support greater than one.
It determines the level of dependency between two attributes based on their values.
The score value ranges from zero to one. The closer the score to one; the highest the
dependency between the attributes values.
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
30
Table 1: Attributes Dependency Score between attributes ATT_A and ATT_X in data source DS1.TAB_T
Then, we define the unified detectors set, over all contributing data sources, to be the intersection
of all the sets of local detectors. In our case, the unified detector set will be {ATT_A} which will
be used to calculate the Local Attributes Scoring for the requested attributes of the query which
are ATT_X and ATT_Y for each data source.
Definition 3.2 (Local Attributes Scoring “LAS”)
Local Attributes Scoring (LAS) is the average of the dependency scores between the attribute
and all unified detectors; multiplied by the complement of the summation of the dependency
scores between all unified detectors.
It indicates the level of attribute dependency on the unified detectors.
The score value ranges from zero to one. The closer the score to one; the lowest the dependency
between the attribute and the unified detectors and the highest the dependency between the
unified detectors.
Equation 3.2 (Local Attributes Scoring “LAS”)
Let UN_DET be the list of unified detectors and UN_DET (Card) to present the number of
unified detectors over all contributing data sources. We define the LAS for a given attribute
ATT to be:
LAS (ATT) =
Score D൫ATT, UNୈ(୩)൯
ୈీు(ి౨ౚ)
ୀ
UDୈ(େୟ୰ୢ)
∗ (1 − Score D(UN_DET(k), UN_DET(j))
_ୈ(େୟ୰ୢ)
ୀ,ୀଵ
)
Where k ≠ j.
Thus, we get the complement of the summation of attributes dependency scores between
detectors, to make sure that we minimize the transitive dependency between detectors. We
calculate the LAS for all contributing attributes for all data sources, and integrate them into one
set of answers as in Table 2 (a) and (b) respectively.
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
31
Table 2: a) LAS for all requested attributes for all data sources; b) Query answers after integration
with LAS and data source for each answer.
Many token-based techniques have been used in the past to detect duplicates from different data
sources and cluster them into groups. We used the enhanced token-based technique for this job,
and provide it with the set of unified detectors for better detection of duplicates. As shown in
Table 3 (a) we got three clusters; one of them with only one record (shaded in grey). So, the two
clusters named C1 and C3 will be the only clusters that will be passed to the data fusion process
to calculate GAS for each of its their conflicting attribute values.
Furthermore, we can remove the unified detectors and the data source from the answers because
they were only used for the duplicate detection and clustering. The unified detectors are used for
enhancing the detection and clustering of duplicates, and the data source to ensure the fact that we
don’t get duplicates from the same data source as we are assuming that we have clean and
consistent data sources with no duplicates in them.
Equation 3.3 (Global Attributes Scoring “GAS”)
Let ATT be the attribute for which we want to calculate GAS, and let D represents the list of
values in ATT respectively and CLUS(Card) to present the number of records in the cluster. We
define the GAS for a given attribute ATT in a given record to be:
GAS (ATT) =
Count(D(k))
ୌ(େୟ୰ୢ)
ୀ
CLUS(Card)
In other words, we calculate the relative frequency in the cluster for each value and assign it to its
attribute to be its GAS.
Within each cluster, we work with each attribute separately to calculate its Total Attributes
Scoring (TAS) in order to apply the fusion rules.
Definition 3.3 (Global Attributes Scoring “GAS”)
Global Attributes Scoring (GAS) is the cluster relative frequency for the attribute value.
It determines the popularity level of the attribute value within its cluster values.
The score value ranges from zero to one. The closer the score to one; the highest the popularity
and the frequency of the value within its cluster.
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
32
Definition 3.4 (Total Attributes Scoring “TAS”)
Total Attributes Scoring (TAS) is the average between both the LAS and GAS values for this
attribute.
It is a mixed score indicating both the level of popularity of the attribute value within its cluster
and the level of attribute dependency on the unified detectors.
The score value ranges from zero to one. The closer the score to one; the highest the frequency
of the value within its cluster and the lowest the dependency on the unified detectors.
Equation 3.4 (Total Attributes Scoring “TAS”)
Let ATT be the attribute for which we want to calculate TAS, and let LAS and GAS represents
its Local Attributes Scoring and Global Attributes Scoring. We define the TAS for a given
attribute ATT in a given record to be:
TAS (ATT) =
ܵܣܮ ()ܶܶܣ + ܵܣܩ ()ܶܶܣ
2
Where GAS (ATT) ≠ 1; and TAS (ATT) = 1 otherwise.
As in Table 3 (b) two conflicting values appear to have the same GAS (0.5). So, we apply our
fusion rules to take the value with the highest LAS or the highest TAS which will be the same. In
case the LAS value in coincidence is the same for both conflicting values, we can recalculate the
LAS based on local detectors instead of the unified detectors, and still choose the value with the
highest LAS.
Finally, we use these TAS values to calculate the answers fusion score that will be shown to the
user.
Table 3: a) Clustered duplicates with LAS scores and data source for each answer; b) LAS, GAS,
TAS for each attribute within the clusters with data conflicts (grey color).
Definition 3.5 (Answers Fusion Scoring “AFS”)
Answers Fusion Scoring (AFS) is the percentage of average of all attributes TAS values of this
answer.
It determines the credibility level of the fused answer.
The score value ranges from zero to 100%. The closer the score to 100%; the highest the
credibility to the fused answer.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
33
Equation 3.5 (Answers Fusion Scoring “AFS”)
Let ANS be the fused answer for which we want to calculate AFS, and let TAS_LST represents
the list of TAS for all attributes in this answer, and ATT (Card) to indicates the number of
attributes in the answer . We define the AFS for the given fused answer to be:
AFS (ANS) =
TAS (TAS_LST(k))
(େୟ୰ୢ)
ୀ
ATT(Card)
∗ 100
Consequently, the final answers presented in Table 4 with the corresponding AFS gives the user a
certainty percentage about the answer.
Table 4: Final answers with AFS certainty percentage
Using this technique, we return answers with AFS equals 100% if all of the duplicates have the
same conflicting data values. Otherwise, our technique favors data values based on a combination
of its higher relative frequency in the cluster, and its lower dependency on the unified detectors
over all contributing data sources.
As most data in real world, especially those stored in databases, contain data and attributes having
dome kind of dependency and correlation between each other; our technique uses this fact to
resolve conflicts between duplicated records. So, it will work well if used in such environments,
but might face some difficulties if worked with unstructured or semi-structured data sources.
Further techniques should be used to detect semantic rules within such data sources in integration
with our proposed scoring methods.
4. FIVE STEPS DATA FUSION FRAMEWORK
We integrated all of the data fusion steps mentioned in the previous section into a data fusion
framework. We modified and extended the common three steps data fusion process shown in
Figure 2, which goes through schema mapping, duplicate detection and data fusion, into a new
Five Steps Data Fusion Framework as shown in Figure 5.
Our data fusion framework consists of five modules: two added new modules which are the one-
time processing module and the detectors and local scoring module, in addition to some
modifications to the common data fusion module with the two other common modules (Schema
Matching, and Duplicate Detection) remain unchanged.
We explain in details each of the five modules in the order of execution.
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
34
Figure 5: Five Steps Data Fusion Framework
a) One-time Preprocessing Module.
This is one of the two new modules. It is executed once per data sources update. The idea behind
this module is to reduce the computational time in case it was executed in the run-time. It will
work efficiently if our data sources don’t have extensive lifetime updates, which doesn’t include
data sources for banks, insurance, stocks and other fast-updated data sources. This module is
executed on each data source separately, and it consists of two components: Attributes Semantic
Rules, and Dependency Statistics Metadata. The firstly mentioned component stores the pre-
defined semantic dependency rules about some attributes within the data source. One famous
example is the semantic rule between the address and zip code. These semantic rules are usually
defined by the data source designer, and should be considered without further processing by the
other component. The Dependency Statistics Metadata component is the most important
component in this module, as it has all of the computations. It calculates the dependency between
all of the attributes within each table of the data source. Then for each attribute, the list of
association scores with all attributes is sorted in ascending order and passed to the Local
Detectors Component in the Detectors and Local Scoring Module.
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
35
b) Detectors and Local Scoring Module.
This is a novel module as well, and is considered the real-time start of the data fusion five steps
process. As most of the computations were done in the previous module, this module applies
some metadata rules to choose detectors, and less computation to calculate the LAS for each
attribute requested in the user query. Three components within this module: Local Detectors,
Unified Detectors and Local Attributes Scoring (LAS). After the user submits the query to the
global schema, the mapping process parse and fragment the posted query and send the query
fragments to the contributing data sources. The Local Detectors component within each data
source gets the query fragments and all attributes dependency lists from the first module, and
applies the local detectors rule on the user’s requested attributes only. Then, the next component
Unified Detectors runs over all of the data sources local detectors, and takes the intersection
between all lists to be the unified detectors for all contributing data sources. If no intersection
appears, then all of the fragment answers will be presented to the user for selection. The unified
detectors are posted back to the contributed data sources to calculate the LAS for each attribute.
Finally, the query answers, including the detectors and LAS calculated for each attribute, are
posted to the next module for integration.
c) Data Integration and Schema Matching Module.
This module was mentioned in the literature, which is used to match the data sources schemas
and do data integration. We assumed the all of the conflicts in the schema matching were
resolved and we only do data integration to all of the answers coming from the previous module.
So, this module only union all of fragmented answers into an integrated set of answers and passes
it to the next module.
d) Duplicate Detection and Clustering Module.
Many systems and applications applied this module to detect possible duplicates representing the
same object[20]. Some techniques uses the theory of record linkage[21]; others rely on similarity
measures[22]; while other people use tokens instead of the whole record to detect duplicates[23],
[24]. Some other work was concerned in combining records from probabilistic databases[25], and
other to find duplicated in XML data models[26].
Using an enhanced smart token-based technique[27], we claim that using the detectors attached to
the integrated answers could enhance its detection of duplicates. Furthermore, the detected
duplicated will be clustered into groups with a cluster number assigned to each group. Non-
clustered answers won’t pass by the data fusion module and will be presented to the user among
the final result with a percentage of fusion certainty score equal to 100%. We don’t need the
detectors anymore after this module because we just use them for the detection and clustering of
duplicates. So, we can reduce the size of the answers by removing the unified detectors attributes.
As for the clustered answers, containing duplicated answers with LAS assigned to attributes, they
will be passed to the next and last module for applying data fusion.
e) Data Fusion and Answers Scoring Module.
This module was modified to include four sequential components: Global Attributes Scoring
(GAS), Total Attributes Scoring (TAS), Fusion Rules, and Answers Fusion Scoring (AFS). These
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
36
four components are applied to each one of the clusters coming from the former module. The
Global Attributes Scoring component is calculated based on the relative frequency of each
answer’s data value along the cluster. In case those data values of the attribute are different, we
calculate the TAS as the mean value of the corresponding LAS and GAS. The fusion rules are
applied to resolve the conflicting attribute values within the cluster, by simply selecting the value
with the highest TAS. After resolving all conflicting data values, the Answers Fusion Scoring
component calculates the total score for each of the fused answers by taking the average TAS of
all attributes in the fused answer. Finally, this module returns the final answers with fusion
scoring certainty assigned to each one of these answers.
5. CONCLUSION AND FUTURE WORK
In this paper, we considered a new way to resolve data inconsistencies through data fusion based
on some statistical scores. These scores are calculated based on the pair-wise dependency scores
between the query requested attributes and the duplicate detectors on one side, and the relative
frequency of the data values within its cluster from the other side. We also used the proposed
scoring technique to define the local detectors for each of the data sources contributing in
answering the user query, towards defining the set of unified detector over all data sources. These
contributions were developed and integrated into a new five steps data fusion framework which
extended the common framework used in literature.
This technique could be applied on any matching application to give you a fused answer with
certainty of matching such as social network applications, or face detection applications if
provided with the suitable data values and appropriate attributes semantic rules. It could be
extended to help improving the work done in natural language processing, suggesting phrase and
paragraph structures.
Another open problem to be considered is how we can determine the duplicate detectors and
fusion detectors from the submitted user query using some preprocessing metadata, and if these
two types of detectors should be the same, or differ based on the submitted query. One last idea is
if we can substitute the scoring methods used to be based on information gain instead of
dependency and relative frequency score.
REFERENCES
[1] L. Xu and D. W. Embley, “Combining the Best of Global-as-View and Local-as-View for Data
Integration,” in In Proc. of the 3rd International Conference ISTA, 2004, pp. 123–135.
[2] A. Z. EL Qutaany, A. H. El Bastawissy, and O. Hegazy, “A Technique for Mutual
Inconsistencies Detection and Resolution in Virtual Data Integration Environment,” in
Informatics and Systems (INFOS), 2010 The 7th International Conference on. IEEE, 2010, pp.
1–8.
[3] F. Naumann and J. Bleiholder, “Data Fusion in Three Steps : Resolving Inconsistencies at
Schema- , Tuple- , and Value-level,” IEEE Data Eng., vol. 29, no. 2, pp. 21–31, 2006.
[4] L. Bertossi, J. Chomicki, and C. Guti, “Consistent Answers from Integrated Data Sources,” In
Flexible Query Answering Systems, pp. 71–85, 2002.
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
37
[5] X. L. Dong, A. Halevy, and C. Yu, “Data integration with uncertainty,” The VLDB Journal, vol.
18, no. 2, pp. 469–500, Nov. 2008.
[6] J. Bleiholder and F. Naumann, Conflict Handling Strategies in an Integrated Information System.
2006.
[7] J. Bleiholder and F. Naumann, “Data fusion,” ACM Computing Surveys, vol. 41, no. 1, pp. 1–
41, Dec. 2008.
[8] X. Dong and F. Naumann, “Data Fusion – Resolving Data Conflicts for Integration,” in
Proceedings of the VLDB Endowment 2(2), 2009, pp. 1645–1655.
[9] A. Motro and P. Anokhin, “Fusionplex: resolution of data inconsistencies in the integration of
heterogeneous information sources,” Information Fusion, vol. 7, no. 2, pp. 176–196, Jun. 2006.
[10] G. De Giacomo, R. La, V. Salaria, and I.- Roma, “Tackling Inconsistencies in Data Integration
through Source Preferences,” in In Proceedings of the 2004 international workshop on
Information quality in information systems. ACM, 2004, pp. 27–34.
[11] X. Wang, H. LIN-PENG, X.-H. XU, Y. ZHANG, and J.-Q. CHEEN, “A Solution for Data
Inconsistency in Data Integration,” Journal of information science and engineering, vol. 27, pp.
681–695, 2011.
[12] P. Anokhin, S. Engineering, and A. Motro, “Use of Meta-data for Value-level Inconsistency
Detection and Resolution During Data Integration,” ech. Rep. ISETR-03-06. Department of
Information and Software Engineering. George Mason University, USA., 2003.
[13] Y. Katsis, A. Deutsch, Y. Papakonstantinou, and V. Vassalos, “Inconsistency Resolution in
Online Databases,” in In Data Engineering (ICDE), 2010 IEEE 26th International Conference on
IEEE, 2010, pp. 1205–1208.
[14] A. Bilke, J. Bleiholder, F. Naumann, B. Christoph, and M. Weis, “Automatic Data Fusion with
HumMer,” in In Proceedings of the 31st international conference on Very large data bases.
VLDB Endowment., 2005, pp. 1251–1254.
[15] A. Fuxman and E. Fazli, “ConQuer : Efficient Management of Inconsistent Databases,” in In
Proceedings of the 2005 ACM SIGMOD international conference on Management of data.
ACM, 2005, pp. 155–166.
[16] J. Bleiholder and F. Naumann, “Declarative Data Fusion – Syntax , Semantics , and
Implementation,” In Advances in Databases and Information Systems, pp. 58–73, 2005.
[17] S. Maitra and R. Wason, “A Naive Approach for Handling Uncertainty Inherent in Query
Optimization of Probabilistic Databases Institute of Information Technology and Management,”
2011.
[18] X. Lian, L. Chen, and S. Song, “Consistent query answers in inconsistent probabilistic
databases,” Proceedings of the 2010 international conference on Management of data - SIGMOD
’10, vol. 8, p. 303, 2010.
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
38
[19] P. Andritsos, a. Fuxman, and R. J. Miller, “Clean Answers over Dirty Databases: A Probabilistic
Approach,” 22nd International Conference on Data Engineering (ICDE’06), pp. 30–30, 2006.
[20] A. K. Elmagarmid and S. Member, “Duplicate Record Detection : A Survey,” Knowledge and
Data Engineering, IEEE Transactions, vol. 19, no. 1, pp. 1–16, 2007.
[21] A. B. Sunteb and I. P. Fellegi, “A theory for record linkage,” Journal of the American Statistical
Association, vol. 64, no. 328, pp. 1183–1210, 1969.
[22] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity
measures,” Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining - KDD ’03, p. 39, 2003.
[23] C. I. Ezeife and T. E. Ohanekwu, “Use of Smart Tokens in Cleaning Integrated Watehouse
Data,” International Journal of Data Warehousing and Mining (IJDWM), vol. 1, no. 2, pp. 1–22,
2005.
[24] T. E. Ohanekwu and C. I. Ezeife, “A Token-Based Data Cleaning Technique for Data
Warehouse Systems,” in In Proceedings of the International Workshop on Data Quality in
Cooperative Information Systems, 2003, pp. 21–26.
[25] F. Panse and N. Ritter, “Tuple Merging in Probabilistic Databases,” in In Proceedings of the
fourth International Workshop on Management of Uncertain Data (MUD),Singapur, 2010, pp.
113–127.
[26] M. Weis and F. Naumann, “DogmatiX Tracks down Duplicates in XML,” in In Proceedings of
the 2005 ACM SIGMOD international conference on Management of data,ACM, 2005, pp. 431–
442.
[27] A. S. El Zeiny, A. H. El Bastawissy, A. S. Tolba, and O. Hegazy, “An Enhanced Smart Tokens
Algorithm for Data Cleansing in Data Warehouses,” in In the Proceedings of the Fifth
International Conference on Informatics and Systems (INFOS 2007), Cairo, Egypt, 2007.