Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...Editor IJMTER
This system technique is used for efficient data mining in SRMS (Student Records
Management System) through vertical approach with association rules in distributed databases. The
current leading technique is that of Kantarcioglu and Clifton[1]. In this system I deal with two
challenges or issues, one that computes the union of private subsets that each of the interacting users
hold, and another that tests the inclusion of an element held by one user in a subset held by another.
The existing system uses different techniques for data mining purpose like Apriori algorithm. The
Fast Distributed Mining (FDM) algorithm of Cheung et al. [2], which is an unsecured distributed
version of the Apriori algorithm. Proposed system offers enhanced privacy and data mining with
respect to the Encryption techniques and Association rule with Fp-Growth Algorithm in private
cloud (system contains different files of subjects with respect to their branches). Due to this above
techniques the expected effect on this system is that, it is simpler and more efficient in terms of
communication cost and combinational cost. Due to these techniques it will affect the parameter like
time consumption for execution, length of the code is decrease, find the data fast, extracting hidden
predictive information from large databases and the efficiency of this proposed system should
increase by the 20%.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents, pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be analyzed and executed as accurately as possible. The proposed model structures the unstructured data from social media in a structured form so that data can be queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering algorithm gives proper platform to extract value from massive amount of data and recommendation for user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in giving proper matching output to user query. The K-Means algorithm is one which works on clustering data using vector space model. It can be an appropriate method to produce recommendation for user
In the recent years the scope of data mining has evolved into an active area of research because of the previously unknown and interesting knowledge from very large database collection. The data mining is applied on a variety of applications in multiple domains like in business, IT and many more sectors. In Data Mining the major problem which receives great attention by the community is the classification of the data. The classification of data should be such that it could be they can be easily verified and should be easily interpreted by the humans. In this paper we would be studying various data mining techniques so that we can find few combinations for enhancing the hybrid technique which would be having multiple techniques involved so enhance the usability of the application. We would be studying CHARM Algorithm, CM-SPAM Algorithm, Apriori Algorithm, MOPNAR Algorithm and the Top K Rules.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...Editor IJMTER
This system technique is used for efficient data mining in SRMS (Student Records
Management System) through vertical approach with association rules in distributed databases. The
current leading technique is that of Kantarcioglu and Clifton[1]. In this system I deal with two
challenges or issues, one that computes the union of private subsets that each of the interacting users
hold, and another that tests the inclusion of an element held by one user in a subset held by another.
The existing system uses different techniques for data mining purpose like Apriori algorithm. The
Fast Distributed Mining (FDM) algorithm of Cheung et al. [2], which is an unsecured distributed
version of the Apriori algorithm. Proposed system offers enhanced privacy and data mining with
respect to the Encryption techniques and Association rule with Fp-Growth Algorithm in private
cloud (system contains different files of subjects with respect to their branches). Due to this above
techniques the expected effect on this system is that, it is simpler and more efficient in terms of
communication cost and combinational cost. Due to these techniques it will affect the parameter like
time consumption for execution, length of the code is decrease, find the data fast, extracting hidden
predictive information from large databases and the efficiency of this proposed system should
increase by the 20%.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents, pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be analyzed and executed as accurately as possible. The proposed model structures the unstructured data from social media in a structured form so that data can be queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering algorithm gives proper platform to extract value from massive amount of data and recommendation for user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in giving proper matching output to user query. The K-Means algorithm is one which works on clustering data using vector space model. It can be an appropriate method to produce recommendation for user
In the recent years the scope of data mining has evolved into an active area of research because of the previously unknown and interesting knowledge from very large database collection. The data mining is applied on a variety of applications in multiple domains like in business, IT and many more sectors. In Data Mining the major problem which receives great attention by the community is the classification of the data. The classification of data should be such that it could be they can be easily verified and should be easily interpreted by the humans. In this paper we would be studying various data mining techniques so that we can find few combinations for enhancing the hybrid technique which would be having multiple techniques involved so enhance the usability of the application. We would be studying CHARM Algorithm, CM-SPAM Algorithm, Apriori Algorithm, MOPNAR Algorithm and the Top K Rules.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...ijsrd.com
Data mining can be defined as the process of uncovering hidden patterns in random data that are potentially useful. The discovery of interesting association relationships among large amounts of business transactions is currently vital for making appropriate business decisions. Association rule analysis is the task of discovering association rules that occur frequently in a given transaction data set. Its task is to find certain relationships among a set of data (itemset) in the database. It has two measurements: Support and confidence values. Confidence value is a measure of rule’s strength, while support value corresponds to statistical significance. There are currently a variety of algorithms to discover association rules. Some of these algorithms depend on the use of minimum support to weed out the uninteresting rules. Other algorithms look for highly correlated items, that is, rules with high confidence. Traditional association rule mining techniques employ predefined support and confidence values. However, specifying minimum support value of the mined rules in advance often leads to either too many or too few rules, which negatively impacts the performance of the overall system. This work proposes a way to efficiently mine association rules over dynamic databases using Dynamic Matrix Apriori technique and Multiple Support Apriori (MSApriori). A modification for Matrix Apriori algorithm to accommodate this modification is proposed. Experiments on large set of data bases have been conducted to validate the proposed framework. The achieved results show that there is a remarkable improvement in the overall performance of the system in terms of run time, the number of generated rules, and number of frequent items used.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. The analysis of such big data brings ahead business competition to the next level of innovation and productivity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...ijsrd.com
Data mining can be defined as the process of uncovering hidden patterns in random data that are potentially useful. The discovery of interesting association relationships among large amounts of business transactions is currently vital for making appropriate business decisions. Association rule analysis is the task of discovering association rules that occur frequently in a given transaction data set. Its task is to find certain relationships among a set of data (itemset) in the database. It has two measurements: Support and confidence values. Confidence value is a measure of rule’s strength, while support value corresponds to statistical significance. There are currently a variety of algorithms to discover association rules. Some of these algorithms depend on the use of minimum support to weed out the uninteresting rules. Other algorithms look for highly correlated items, that is, rules with high confidence. Traditional association rule mining techniques employ predefined support and confidence values. However, specifying minimum support value of the mined rules in advance often leads to either too many or too few rules, which negatively impacts the performance of the overall system. This work proposes a way to efficiently mine association rules over dynamic databases using Dynamic Matrix Apriori technique and Multiple Support Apriori (MSApriori). A modification for Matrix Apriori algorithm to accommodate this modification is proposed. Experiments on large set of data bases have been conducted to validate the proposed framework. The achieved results show that there is a remarkable improvement in the overall performance of the system in terms of run time, the number of generated rules, and number of frequent items used.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. The analysis of such big data brings ahead business competition to the next level of innovation and productivity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Hybrid Algorithm for Analog Circuit Design AutomationNoralina A.
Analog components play important role in VLSI system especially in mixed-signal system where the analog components and digital components are integrated on the same chip. Despite their importance, design automation for analog circuit still lags behind that of digital circuits. With challenging analog circuit design problems and a few analog circuit design engineers, there are economic reasons for automating the analog design process.
A Hybrid Algorithm Using Apriori Growth and Fp-Split Tree For Web Usage Mining iosrjce
Internet is the most active and happening part of everyone’s life today. Almost every business or
service or organization has its website and performance of the site is an important issue. Web usage mining
based on web logs is an important methodology for optimizing website’s performance over the internet.
Different mining techniques like Apriori method, FP Tree methodology, K-Means method etc. have been
proposed by different researchers in order to make the data mining more effective and efficient. Many people
have modeled Apriori or FP Tree in their own way to increase data mining productiveness. Wu proposed
Apriori Growth as a hybrid of Apriori and FP Tree algorithm and improved FP Tree by mining using Apriori
and removed the complexity involved in FP Growth mining. Lee proposed FP Split Tree as a variant of FP Tree
and reduced the complexity by scanning the database only once against twice in FP Tree method. This research
proposes a new hybrid algorithm of FP Split and Apriori growth which combines the positives of both the
algorithms to create a new technique which provides with a better performance over the traditional methods.
The new proposed algorithm was implemented in java language on web logs obtained from IIS server and the
computational results of the proposed method performs better than traditional FP Tree method, Apriori
Method.
Reconstructing movement traces throug a hybrid map matching algorithmcdc2013workshop
Kevin Baker, Pascal Brackman, Philippe De Maeyer, Rik Van de Walle
University Ghent, Belgium; RouteYou, Belgium
Topic: “Reconstructing movement traces through a hybrid map-matching algorithm”
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Task Scheduling using Hybrid Algorithm in Cloud Computing Environmentsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
Available with notes:
http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
(Data Day 2016)
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
A recommender system-using novel deep network collaborative filteringIAESIJAI
The recommendation model aims to predict the user’s preferred items among million through analyzing the user-item relations; furthermore, Collaborative Filtering has been utilized as one of the successful recommendation approaches in last few years; however, it has the issue of sparsity. This research work develops a deep network collaborative filtering (DeepNCF), which incorporates graph neural network (GNN), and novel network collaborative filtering (NCF) for performance enhancement. At first user-item dual network is constructed, thereafter-custom weighted dual mode modularity is developed for edge clustering. Furthermore, GNN is utilized for capturing the complex relation between user and item. DeepNCF is evaluated considering the two distinctive. The experimental analysis is carried out on two datasets for Amazon and movielens dataset for recall@20 and recall@50 and the normalized discounted cumulative gain (NDCG) metric is evaluated for Amazon Dataset for NDCG@20 and NDCG@50. The proposed method outperforms the most relevant research and is accurate enough to give personalized recommendations and diversity.
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
The implementation measures the classification accuracy on benchmark datasets after combining SIS and ANNs. In order to put a number on the gains made by using SIS as a strategic tool in data mining, extensive experiments and analyses are carried out. The predicted results of this investigation will have implications for both theoretical and applied settings. Predictive models in a wide variety of disciplines may benefit from the enhanced classification accuracy enabled by SIS inside ANNs. An invaluable resource for scholars and practitioners in the fields of AI and data mining, this study adds to the continuing conversation about how to maximize the efficacy of machine learning methods.
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...IJTET Journal
Abstract—Web mining is the amalgamation of information accumulated by traditional data mining methodologies and techniques with information collected over the World Wide Web. A Recommendation system is a profound application that comforts the user in a decision-making process, where they lack of personal experience to choose an item from the confound set of alternative products or services. The key challenge in the development of recommender system is to overcome the problems like single level recommendation and static recommendation, which are exists in the real world e-services. The goal is to achieve and enhance predicting algorithm to discover the frequent items, which are feasible to be purchasable. At this point, we examine the prior buying patterns of the customers and use the knowledge thus procured, to achieve an item set, which co-ordinates with the purchasing mentality of a particular set of customers. Potential recommendation is concerned as a link structure among the items within E-commerce website, which supports the new customers to find related products in a hurry. In Existing system, a fuzzy set consists of user preference and item features alone, so the recommendations to the customers are irrelevant and anonymous. In this paper, we suggest a recommendation technique, which practices the wild spreading and data sharing competency of a huge customer linkage and also this method follows a fuzzy tree- structured model, in which fuzzy set techniques are utilized to express user preferences and purchased items are in a clustered form to develop a user convenient recommendations. Here, an incremental association rule mining is employed to find interesting relation between variables in a large database.
Data Mining is an important aspect for any business. Most of the management level decisions are based on the process of Data Mining. One of such aspect is the association between different sale products i.e. what is the actual support of a product respected to the other product. This concept is called Association Mining. According to this concept we define the process of estimating the sale of one product respective to the other product. We are proposing an association rule based on the concept of Hardware support. In this concept we first maintain the database and compare it with systolic array after this a pruning process is being performed to filter the database and to remove the rarely used items. Finally the data is indexed according to hashing technique and the decision is performed in terms of support count. Krishan Rohilla | Shabnam Kumari | Reema"Data Mining based on Hashing Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: http://www.ijtsrd.com/papers/ijtsrd82.pdf http://www.ijtsrd.com/computer-science/data-miining/82/data-mining-based-on-hashing-technique/krishan-rohilla
Machine learning based recommender system for e-commerceIAESIJAI
Nowadays, e-commerce is becoming an essential part of business for many reasons, including the simplicity, availability, richness and diversity of products and services, flexibility of payment methods and the convenience of shopping remotely without losing time. These benefits have greatly optimized the lives of users, especially with the technological development of mobile devices and the availability of the Internet anytime and anywhere. Because of their direct impact on the revenue of e-commerce companies, recommender systems are considered a must in this field. Recommender systems detect items that match the customer's needs based on the customer's previous actions and make them appear in an interesting way. Such a customized experience helps to increase customer engagement and purchase rates as the suggested items are tailored to the customer's interests. Therefore, perfecting recommendation systems that allow for more personalized and accurate item recommendations is a major challenge in the e-marketing world. In our study, we succeeded in developing an algorithm to suggest personal recommendations to customers using association rules via the Frequent-Pattern-Growth algorithm. Our technique generated good results with a high average probability of purchasing the next product suggested by the recommendation system.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
An Efficient Compressed Data Structure Based Method for Frequent Item Set Miningijsrd.com
Frequent pattern mining is very important for business organizations. The major applications of frequent pattern mining include disease prediction and analysis, rain forecasting, profit maximization, etc. In this paper, we are presenting a new method for mining frequent patterns. Our method is based on a new compact data structure. This data structure will help in reducing the execution time.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
A Model for Encryption of a Text Phrase using Genetic Algorithmijtsrd
"In any organization it is an essential task to protect the data from unauthorized users. Information Systems hardware, software, networks, and data resources need to be protected and secured to ensure quality, performance, and integrity. Security management deals with the accuracy, integrity, and safety of information resources. When effective security measures are in place, they can reduce errors, fraud, and losses. In the current work, the authors have proposed a model for encryption of a text phrase employing genetic algorithm. The entropy inherently available in genetic algorithm is exploited for introducing chaos in a text phrase thereby rendering it unreadable. The no of cross over points and mutation points decides the strength of the algorithm. The prototype of the model is implemented for testing the operational feasibility of the model and the few test cases are presented Dr. Poornima G. Naik | Mr. Pandurang M. More | Dr. Girish R. Naik ""A Model for Encryption of a Text Phrase using Genetic Algorithm"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | Fostering Innovation, Integration and Inclusion Through Interdisciplinary Practices in Management , March 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23063.pdf
Paper URL: https://www.ijtsrd.com/computer-science/data-processing/23063/a-model-for-encryption-of-a-text-phrase-using-genetic-algorithm/dr-poornima-g-naik"
Applying Soft Computing Techniques in Information RetrievalIJAEMSJORNAL
There is plethora of information available over the internet on daily basis and to retrieve meaningful effective information using usual IR methods is becoming a cumbersome task. Hence this paper summarizes the different soft computing techniques available that can be applied to information retrieval systems to improve its efficiency in acquiring knowledge related to a user’s query.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4
A new hybrid algorithm for business intelligence recommender system
1. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
DOI : 10.5121/ijnsa.2014.6204 43
A NEW HYBRID ALGORITHM FOR BUSINESS
INTELLIGENCE RECOMMENDER SYSTEM
P.Prabhu1
and N.Anbazhagan2
1
Directorate of Distance Education, Alagappa University, Karaikudi, Tamilnadu, INDIA
2
Department of Mathematics, Alagappa University, Karaikudi, Tamilnadu, INDIA
ABSTRACT
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
KEYWORDS
Business Intelligence, Frequent Itemset , k-means Clustering, Data Mining, Decision Making,
Recommender system, E-commerce.
1. INTRODUCTION
Data Base Management System (DBMS) and Data Mining (DM) are two emerging technologies
in this information world. Knowledge is obtained through the collection of information.
Information is enriched in today’s business world. In order to maintain the information, a new
systematic way has been used such as database. In this database, there are collection of data
organized in the form of tuples and attributes. In order to obtain knowledge from a collection of
data, business intelligence methods are used. Data Mining is the powerful new technology with
great potential that help the business environments to focus on only the essential information in
their data warehouse. Using the data mining technology, it is easy for decision making by
improving the business intelligence.
Frequent itemset mining is to find all the frequent itemsets that satisfy the minimum support and
confidence threshold. Support and Confidence are two measures used to find the interesting
frequent itemsets. In this paper, frequent itemset mining can be used to search for frequent item
set in the data warehouse. Based on the result, these frequent itemsets are grouped into clusters to
identify the similarity of objects.
Cluster Analysis is an effective method of analyzing and finding useful information in terms of
grouping of objects from large amount of data. To group the data into clusters, many algorithms
have been proposed such as k-means algorithm, Fuzzy C means, Evolutionary Algorithm and EM
Method. These clustering algorithms groups the data into classes or clusters so that object within
2. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
44
a cluster exhibit same similarity and dissimilar to other clusters. Thus based on the similar-ity
and dissimilarity, the objects are grouped into clusters.
In this paper, a new hybrid algorithm is for business intelligence recommender system based on
knowledge of users and frequent items. This algorithms works in three phases namely pre-
processing, modelling and obtaining intelligence. First, the users are filtered based on the user’s
profile and knowledge such as needs and preferences defined in the form of rules. This poses
selection of features and data reduction from dataset. Second, these filtered users are then
clustered using k-means clustering algorithm as a modelling phase. Third, identifies nearest
neighbour for active users and generates recommendations by finding most frequent items from
identified cluster of users. This algorithm is experimentally tested with e-commerce application
for better decision making by recommending top n products to the active users.
2. RELATED WORKS
Alexandre et al, [1] presented a framework for mining association rules from transactions
consisting of categorical items where the data has been randomized to preserve privacy of
individual transactions. They analyzed the nature of privacy breaches and proposed a class of
randomization operators that are much more effective than uniform randomization in limiting the
breaches.
Jiaqi Wang et al, [4] stated that Support vector machines (SVM) have been applied to build
classifiers, which can help users make well-informed business decisions. The paper speeds up the
response of SVM classifiers by reducing the number of support vectors. It was done by the K-
means SVM (KMSVM) algorithm proposed in the paper. The KMSVM algorithm combines the
K-means clustering technique with SVM and requires one more input parameter to be
determined: the number of clusters..
M. H. Marghny et al, [5], stated that Clustering analysis plays an important role in scientific
research and commercial application. In the article, they proposed a technique to handle large
scale data, which can select initial clustering center purposefully using Genetic algorithms (GAs),
reduce the sensitivity to isolated point, avoid dissevering big cluster, and overcome deflexion of
data in some degree that caused by the disproportion in data partitioning owing to adoption of
multi-sampling.
Wenbin Fang et al, [11], presented two efficient Apriori implementations of Frequent Itemset
Mining (FIM) that utilize new-generation graphics processing units (GPUs). The implementations
take advantage of the GPU's massively multi-threaded SIMD (Single Instruction, Multiple Data)
architecture. Both implementations employ a bitmap data structure to exploit the GPU's SIMD
parallelism and to accelerate the frequency counting operation.
Ravindra Jain, [8], explained that data clustering was a process of arranging similar data into
groups. A clustering algorithm partitions a data set into several groups such that the similarity
within a group was better than among groups. In the paper a hybrid clustering algorithm based on
K-mean and K-harmonic mean (KHM) was described. The result obtained from proposed hybrid
algorithm was much better than the traditional K-mean & KHM algorithm.
David et al, [3], described a clustering method for unsupervised classification of objects in large
data sets. The new methodology combines the mixture likelihood approach with a sampling and
sub sampling strategy in order to cluster large data sets efficiently. The method was quick and
reliable and produces classifications comparable to previous work on these data using supervised
clustering.
3. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
45
Risto Vaarandi, [9], stated that event logs contained vast amounts of data that can easily
overwhelm a human. Therefore, mining patterns from event logs was an important system
management task. The paper presented a novel clustering algorithm for log file data sets which
helps one to detect frequent patterns from log files, to build log file profiles, and to identify
anomalous log file lines.
R. Venu Babu and K. Srinivas [10] presented the literature survey on cluster based collaborative
filter and an approach to construct it. In modern E-Commerce it is not easy for active users to find
the best suitable goods of their interest as more and more information is placed on line (like
movies, audios, books, documents etc...). So in order to provide most suitable information of high
value to active users of an e-commerce business system, a customized recommender system is
required. Collaborative Filtering has become a popular technique for reducing this information
overload. While traditional collaborative filtering systems have been a substantial success, there
are several problems that researchers and commercial applications have identified: the early rater
problem, the sparsity problem, and the scalability problem.
3. NEW HYBRID ALGORITHM
In this business world, there exists a lot of information. It is necessary to maintain the
information for decision making in business environment. The decision making consists of two
kinds of data such as OnLine Analytical Processing (OLAP) and OnLine Transactional
Processing (OLTP).The former contains historical data about the business from the beginning
itself and the later contains only day-to-day transactions on business. Based on these two kinds of
data, decision making process can be carried out by means of a new hybrid algorithm based on
frequent itemsets mining and clustering using k-means algorithm and knowledge of users in order
to improve the business intelligence. The proposed new hybrid algorithm design is shown in
figure 1.
Figure 1. A New Hybrid algorithm design
The step involved in the proposed Hybrid algorithm is given below:
• Identifying the dataset
• Choose the consideration columns/features
• Filtering objects by defining the rules
• Identifying frequent items
• Cluster objects using k-means clustering
• Find nearest neighbour of active user
• Generate recommendation dataset for active user.
4. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
46
3.1. Identifying the dataset
To maintain the data systematically and efficiently, database and data warehouse technologies are
used. The data warehouse not only deals with the business activities but also contains the
information about the user that deals with the business. The representation of the data set is
shown below:
D=Σ (A) = {a1, a2,…………, an} (1)
Where (A) is the collection of all attributes, a1, a2, are the attribute list that deals with the dataset.
Upon collecting the data, the dataset contains the data as follows:
Here, aij is the data elements in the dataset, where i= 0,1,…n and j= 0,1,…m.
3.2. Choosing the considering columns/features
Upon the dataset has been identified, the next step of the proposed work is to choose the
consideration column or filtering columns/features. That is, from the whole dataset, the
columns/subset of features to be considered for our work has been chosen. This includes the
elimination of the irrelevant column in the dataset. The irrelevant column/feature may the one
which provide less information about the dataset.
CC = Σ (A`) = Σ (A) – Σ (Au) (2)
Σ (A`) = {a1, a2, … ,an} – {au1, au2,….,aun} (3)
CC = D`=∑൫A′
൯ ൌ ሼ a′ଵ,
a′ଶ,a′ଷ, … , a′୫ሽ (4)
Where, CC denotes the consideration column, which will be represented as Σ (A`),Σ (A)
represents the set of all attributes in the chosen dataset, Σ (Au) represents the set of all features to
be eliminated to get the subset of features. The consideration column of the dataset can be
represented as follows:
a00 a01 a02…………a0n
a10 a11 a12…………a1n
.
.
.
a a a ………..
a`00 a`01 a`02……… a`0n
a`10 a`11 a`12……… a`1n
.
.
.
a` a` a` ………
5. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
47
Here, a`ij is the data elements in the new resultant dataset with consideration column, where i = 0,
1,…, n and j = 0,1,…m.
3.3. Filtering objects by defining rules
From the consideration dataset, the objects can be grouped under stated conditions that are
defined in terms of rules. That is, for each column that is considered, specify the rule to extract
the necessary domain from the original dataset. This rule is considered to be the threshold value.
The domain can be chosen by identifying the frequent items from the dataset. Rule can be defined
as;
R ൌ ሼx/x א D′, x 20 and x 60ሽ (5)
∑ሺRሻ ൌ ∑ሺRሻ ڇ σሺሺa୧୨ሺA′
ሻ/D′ሻ 20 and ሺa୧୨ሺA′ሻ/D′ሻ 60)
Where,
R denotes Rule
σ denotes the selection
ڇ denotes join
a୧୨ሺA′
ሻ denotes the attribute from the dataset D′
D′ denotes dataset with selected attributes.
3.4 Identifying frequent items
The frequent items can be identified by analyzing the repeated value in the consideration
column
FIS = value(S) > (SUP(S) and/or CONF(S)) (6)
Where, FIS represents the identified frequent itemset. Value(S) is the frequent items in the
column S, satisfying the SUP(S) and CONF(S).SUP(S) is defined as the percentage of objects in
the dataset which contain the item set. CONF(S) is defined as SUP(X U Y) / SUP (X). (i.e.,) the
confidence on the frequent item set can be determined by combining the X and Y values from the
dataset and then neglecting the X value to obtain the frequent item. Any objects that satisfy the
criteria are selected and counted. This can be carried out by:
Cn = Ƞ (aij (A) / D`) (7)
It counts the number of domains, aij of the attribute list, A from the Dataset, D. From the
counted value, Cn, we can determine the frequent item set that has been occurred in the
dataset using the threshold value T.
Cn (aij (A)) > T (8)
It shows that the domain aij of attribute A satisfies the threshold value T specified.
3.5 Clustering objects/users using k-means clustering
Upon forming the new dataset D``, the objects in D`` are clustered based on similarity of objects
using k-means clustering. k-means clustering is a method of classifying or grouping objects into k
clusters (where k is the number of clusters). The clustering is performed by minimizing the sum
6. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
48
of squared distances between the objects and the corresponding centroid. The resultant consists of
cluster of objects with their labels/classes.
3.6 Find the nearest neighbour of active user
In order to find the nearest neighbours of the active user, similarity of the active user between
cluster centroids are calculated based on distance measure. Then, select cluster that have the
highest similarity among other clusters.
3.7 Generate recommendation dataset for active user
Recommendations are generated for the active user based on the selected cluster of users
purchased most frequent items generated from specified threshold T. This gives intelligence to the
users and business for better decision making.
3.8 New Hybrid Algorithm
Algorithm: New Hybrid Algorithm
Input:
The number of clusters k .
Dataset D with n objects.
Output: A set of clusters Ck.
Begin
Identify the dataset D = Σ (A) ={a1, a2,…………, an}attributes/objects.
Outline the Consideration Column (CC) from D.
CC = D`=∑൫A′
൯ ൌ ሼ a′ଵ,
a′ଶ,a′ଷ, … , a′୫ሽ
repeat
Formulate the rules for identifying the similar objects.
∑ሺRሻ ൌ ∑ሺRሻ ڇ σሺa୧୨ሺA′
ሻ/D′ሻ, where i ൌ 1 to n , j ൌ 1 to m .
S = f(X) / D, where S is the sample set containing identified column
FIS = value(S) > (SUP(X) and/or (SUP(X U Y) / SUP (X))),
where FIS is the frequent itemsets identified.
Cn (aij (A`)) > T from FIS, where T specifies the threshold value.
Generate the Resultant Dataset, D``
until no further partition is possible in CC.
Identify the k initial mean vectors (centroids) from the objects of D``.
Repeat
Compute the distance, Ƞ between the object ai and the centroids cj.
Assign objects to cluster with min{∂( cjk)} of all clusters
Recalculate the k new centroids cj from the new cluster formed
Until reaching convergence
Find the nearest neighbour of active object
Generate recommendation from most frequent items of nearest neighbour
End
4. EXPERIMENTAL SETUP
The proposed hybrid algorithm provides solution for recommender systems. This methodology
can be verified through various experimental setups. In this work E-Commerce dataset is used for
testing the proposed algorithm for recommending products purchase by the users. This algorithm
7. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
49
help active users find items they want to buy from a business. The E-commerce business, using
recommender systems are Amazon.com, CDNOW.com, Drugstore.com, eBay, MovieFinder.com
and Reel.com. The Table 1 shows the description of sample dataset.
Table 1 Dataset Description
Key Element Description
Dataset Name E-Commerce Synthetic dataset 1
Original Attribute-list Age, Gender, Occupation, Salary, Date,
Time, item ,rate, amount, rating
Consideration Column Age, Gender, Occupation, Salary,
Amount
Rules defined Age >=21 and <= 25
Occupation = “Teacher”
Amount > 25000
5 RESULTS AND DISCUSSION
Based on the identified frequent item set mining, it is clear that the user with age between 21 and
25, frequently accessing the site. And so, the frequent item set being the user with age less than
25 and greater than 21. From the original dataset, proposed method identified the products
purchased by the users belonging to these age group is shown in table 2.
Table 2 Frequent Items Identified
Age/
user-id
Item idFrequently Items
21 1010 Computer
22 2001 Jewels
23 3001 Books
24 5010 Sports
25 4101 Shoes
The identified frequent items based on the defined rules are counted to form the resultant dataset
D``. Now k-means clustering is applied to group the users based on similarities. The table 3
shows the initial centroids.
Table 3 Initial Centroids
Cluster Age Mean
Vector
(Centroid)
1 21 (1,1)
2 23 (5,7)
The resulting objects found in the cluster -1 are 21 and 21. The objects found in the cluster-2 rare
23, 24 and 25. The table 4 shows the distance between the object to their centroid of each clusters.
8. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
50
Table 4 The distance between the objects to the centroid of each cluster.
Age/user
id
Distance between
the object to the
centroid in
Cluster-1
Distance between
the object to the
centroid in
Cluster-2
21 1.8 5
22 1.8 2
23 5.4 3
24 4.0 2.2
25 2.1 1.4
Each individual’s distance to its own cluster mean should be smaller that the distance to the other
cluster’s mean. Thus the mean of the object 21, 22 belongs to cluster-1 which is nearer to cluster-
1, whereas the mean of the object 23, 24, 25 belongs to cluster-2, which is nearer to cluster-2.
Thus there is no relocation occur in this example. From the obtained results, it is clear that the
users are grouped under 2 clusters named as cluster-1 and cluster-2.
When the active user enters into the site, the first step is to verify the active user details to
identify, on which cluster the user can be fall on. This can be done by analyzing the nearest
neighbour of active user with existing clusters. Based on the distance, the active user can be
easily classified and identified their position on the clusters. In our experiment, if the active user
belongs to age 22, then they fall under cluster-1 and then we conclude that this user can have
more probability to purchase either computer or jewels. Thus the upcoming user can now be
recommended and redirected to the web page containing the details of computers and jewels.
Through this kind of redirection, it is easy for the active user to save the time to search for their
desired product. Thus, by grouping the similar behavior users into a cluster and based on the
cluster result, the active user can be recommended and redirected to that products web
page.Hence this intelligence provides active users / business for better decision making by
recommending the top products.
The figure 2 shows sample cluster of users generated using proposed clustering approach for k =5
with 41 users of synthetic dataset 2.
0
5
10
15
20
5
10
15
20
0
5
10
15
20
Feature 1
Clustering for k = 5
Feature 2
Feature3
9. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
51
Figure 2. Sample cluster of users for k =5
The table 5 shows the allocation of users in the clusters for k =5 using proposed method.
Table 5 Allocated user Id’s in the clusters for k =5 of synthetic dataset-2.
Cluster
Id
Allocated
user id’s
Total Centroid Centroid values
1 21,23,39,40,46,
52,55
7 C1 17.42 16.45 16.41
2 20,22,24,27,30,
31,34,44,45,54,
56.57,59
13 C2 4.88 6.67 6.16
3 41,50 2 C3 11.90 13.94 12.80
4 28,29,32,42,48,
49
6 C4 9.59 11.87 11.40
5 25,26,33,35,36,
37,38,43,47,51,
53,58,60
13 C5 8.40 9.82 8.51
Table 6 shows the sample of identified frequent items purchased by the users.
Table 6 Sample of Identified frequent items
Support
50% 60% 70%
User Id Item id Item id Item id
20 101,310,401 101,310 101
21 102,308 308 308
22 105 105 105
23 219,207 219 219
24 101,309 101,309 101
- - - -
- - - -
41 101,209 101,209 209
- - - -
40 103,310,311 103,311 311
60 123 123 123
The experimental results of frequent items identified using various support count is presented.
The active users, identified clusters and recommended items using k=10 and support =50% are
tabulated in Table 7.
Table 7 Recommended items of active user
Active
user
Identified
Cluster
Id
Allocated
User id’s
Recommended Item Id’s
Support
50%
Support
60%
Support
70%
AU1 3 41,50 101,209,103,
310,311
101,209,
103,311
209,311
In this example, Cluster Id 3 is identified as neighbour for the active user 1 AU1. Hence frequent
items (101,209,103,310,311) from the objects in the cluster Id 2 are recommended for the active
10. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.2, March 2014
52
user. This gives intelligence to the active users for selecting their items based on their profile and
preferences and improves business activities.
Also, we can compare the performance of our proposed algorithm with the existing
methodologies like [10] with various metrics like precision,recall and silhoutte index. This can
be carried out by finding quality with various number of neighbours,iterations and clusters. The
performance of the proposed method performs better than existing methods.
6. CONCLUSIONS AND FUTURE WORK
Business intelligence is a new technology for extracting information for a business from its user
databases. In this paper we presented and evaluated new hybrid algorithm for improving business
intelligence for better decision making by recommending products purchase by the user. The
performance of the methodology can be verified by undertaking many experimental setups. The
results obtained from the experiments shows that the methodology performs well. This algorithm
can be tested with many real-world datasets with different metrics as a future work.
REFERENCES
[1] Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke, “Privacy
Preserving Mining of Association Rules”, IBM Almaden Research Center, USA, Copyright 2002.
[2] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl,Item-Based Collaborative Filtering
Recommendation Algorithms,Procedings of the 10 th WWW International Conference, , Hong
Kong,ACM 1-58113-348-0/01/0005.p285-295 May 1-5, 2001.
[3] David M. Rocke and Jian Dai, “Sampling and Subsampling for Cluster Analysis in Data Mining: With
Applications to Sky Survey Data”, Data Mining and Knowledge Discovery, 7, 215-232, 2003
[4] Jiaqi Wang, Xindong Wu, Chengqi Zhang, “Support vector machines based on K-means clustering
for real-time business intelligence systems”, Int. J. Business Intelligence and Data Mining, Vol. 1, No.
1, 2005.
[5] M. H. Marghny, Rasha M. Abd El-Aziz, Ahmed I. Taloba, “An Effective Evolutionary Clustering
Algorithm: Hepatitis C Case Study”, Computer Science Department, Egypt, International Journal of
Computer Applications (0975-8887), Volume 34-No.6, Nov. 2011.
[6] P.Prabhu,’ Method for Determining Optimum Number of Clusters for Clustering Gene Expression
Cancer Dataset’, International Journal of Advance Research in Computer Science pg 315(Volume 2
No. 4, July-August 2011).
[7] P.Prabhu, N.Anbazhagan, ’Improving the performance of k-means clustering for high dimensional
dataset’, International Journal of Computer Science and Engineering, Vol 3. No.6. Pg 2317-2322,
June 2011.
[8] Ravindra Jain, “A Hybrid Clustering Algorithm for Data Mining”, CCSEA, SEA, CLOUD, DKMP,
CS & IT 05, pp. 387-393, 2012.
[9] Risto Vaarandi, “A Data Clustering Algorithm for Mining Patterns from Event Logs”, Proceedings
IEEE Workshop on IP Operations and Management, 2003.
[10] R. Venu Babu, K. Srinivas: A New Approach for Cluster Based Collaborative Filters. International
Journal of Engineering Science and Technology Vol. 2(11), , 6585-6592, 2010.
[11] Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, Qiong Luo, “Frequent Itemset Mining on
Graphics Processors”, Proceedings of the Fifth International Workshop on Data Management on New
Hardware,. June 2009
[12] Zan Huang, Daniel Zeng and Hsinchun Chen,A Comparision of Collaborative-Filtering
Recommendation Algorithms for E-commerce, IEEE INTELLIGENT SYSTEMS, IEEE Computer
Society,p68-78,2007.