Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
Novel modelling of clustering for enhanced classification performance on gene...IJECEIAES
Gene expression data is popularized for its capability to disclose various disease conditions. However, the conventional procedure to extract gene expression data itself incorporates various artifacts that offer challenges in diagnosis a complex disease indication and classification like cancer. Review of existing research approaches indicates that classification approaches are few to proven to be standard with respect to higher accuracy and applicable to gene expression data apart from unaddresed problems of computational complexity. Therefore, the proposed manuscript introduces a novel and simplified model capable using Graph Fourier Transform, Eigen Value and vector for offering better classification performance considering case study of microarray database, which is one typical example of gene expressiondata. The study outcome shows that proposed system offers comparatively better accuracy and reduced computational complexity with the existing clustering approaches.
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Classification By Clustering Based On Adjusted ClusterIOSR Journals
This document summarizes a research paper that proposes a new technique called "Classification by Clustering" (CbC) to define decision trees based on cluster analysis. The technique is tested on two large HR datasets. CbC involves running a clustering algorithm on the dataset without using the target variable, calculating the target variable distribution in each cluster, setting a threshold to classify entities, fine-tuning the results by weighting important attributes, and testing the results on new data. The paper finds that CbC can provide meaningful decision rules even when conventional decision trees fail to do so, and in some cases CbC performs better. A new evaluation measure called Weighted Group Score is also introduced to assess models when conventional measures cannot be used
11.software modules clustering an effective approach for reusabilityAlexander Decker
This document summarizes previous work on using clustering techniques for software module classification and reusability. It discusses hierarchical clustering and non-hierarchical clustering methods. Previous studies have used these techniques for software component classification, identifying reusable software modules, course clustering based on industry needs, mobile phone clustering based on attributes, and customer clustering based on electricity load. The document provides background on clustering analysis and its uses in various domains including software testing, pattern recognition, and software restructuring.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This document presents a novel approach to anomaly detection in link mining based on applying mutual information. It adapts the CRISP-DM methodology for link mining and applies it to a case study using co-citation data. The methodology includes data description, preprocessing, transformation, exploration, modeling through graph mapping and hierarchical clustering, and evaluation. Mutual information is used to interpret the semantics of anomalies identified in clusters. The case study identifies collective and community anomalies and confirms mutual information can validate clustering results by showing strong links within clusters but independence between objects in one cluster.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
Novel modelling of clustering for enhanced classification performance on gene...IJECEIAES
Gene expression data is popularized for its capability to disclose various disease conditions. However, the conventional procedure to extract gene expression data itself incorporates various artifacts that offer challenges in diagnosis a complex disease indication and classification like cancer. Review of existing research approaches indicates that classification approaches are few to proven to be standard with respect to higher accuracy and applicable to gene expression data apart from unaddresed problems of computational complexity. Therefore, the proposed manuscript introduces a novel and simplified model capable using Graph Fourier Transform, Eigen Value and vector for offering better classification performance considering case study of microarray database, which is one typical example of gene expressiondata. The study outcome shows that proposed system offers comparatively better accuracy and reduced computational complexity with the existing clustering approaches.
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Classification By Clustering Based On Adjusted ClusterIOSR Journals
This document summarizes a research paper that proposes a new technique called "Classification by Clustering" (CbC) to define decision trees based on cluster analysis. The technique is tested on two large HR datasets. CbC involves running a clustering algorithm on the dataset without using the target variable, calculating the target variable distribution in each cluster, setting a threshold to classify entities, fine-tuning the results by weighting important attributes, and testing the results on new data. The paper finds that CbC can provide meaningful decision rules even when conventional decision trees fail to do so, and in some cases CbC performs better. A new evaluation measure called Weighted Group Score is also introduced to assess models when conventional measures cannot be used
11.software modules clustering an effective approach for reusabilityAlexander Decker
This document summarizes previous work on using clustering techniques for software module classification and reusability. It discusses hierarchical clustering and non-hierarchical clustering methods. Previous studies have used these techniques for software component classification, identifying reusable software modules, course clustering based on industry needs, mobile phone clustering based on attributes, and customer clustering based on electricity load. The document provides background on clustering analysis and its uses in various domains including software testing, pattern recognition, and software restructuring.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This document presents a novel approach to anomaly detection in link mining based on applying mutual information. It adapts the CRISP-DM methodology for link mining and applies it to a case study using co-citation data. The methodology includes data description, preprocessing, transformation, exploration, modeling through graph mapping and hierarchical clustering, and evaluation. Mutual information is used to interpret the semantics of anomalies identified in clusters. The case study identifies collective and community anomalies and confirms mutual information can validate clustering results by showing strong links within clusters but independence between objects in one cluster.
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...ijistjournal
Classification is widely used technique in the data mining domain, where scalability and efficiency are the immediate problems in classification algorithms for large databases. We suggest improvements to the existing C4.5 decision tree algorithm. In this paper attribute oriented induction (AOI) and relevance analysis are incorporated with concept hierarchy’s knowledge and HeightBalancePriority algorithm for construction of decision tree along with Multi level mining. The assignment of priorities to attributes is done by evaluating information entropy, at different levels of abstraction for building decision tree using HeightBalancePriority algorithm. Modified DMQL queries are used to understand and explore the shortcomings of the decision trees generated by C4.5 classifier for education dataset and the results are compared with the proposed approach.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Internet becomes the most popular surfing environment which increases the
service oriented data size. As the data size grows, finding and retrieving the most
similar data from the large volume of data would become more difficult task. This
problem is focused in the various research methods, which attempts to cluster the
large volume of data. In the existing research method Clustering-based Collaborative
Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar
kind of data together, so that retrieval time cost can be reduced considerably.
However, existing research methods cannot find the similar reviews accurately which
needs to be focused more for efficient and accurate recommendation system. This is
ensured in the proposed research method by introducing the novel research technique
namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR).
In this research method, initially k means algorithm is used to cluster the similar
movie reviewer together, so that recommendation process can be done in the easier
way. In order to handle the large volume of data this research work adapts the map
reduce framework which will divide the entire data into subsets which will assigned
on separate nodes with individual key values. After clustering, the clustered outcome
is merged together using inverted index procedure in which similarity between movies
would be calculated. Here collaborative filtering is applied to remove the movies that
are not relevant to input. Finally recommendations of movies are made in the accurate
way by using the logistic regression method. The overall evaluation of the proposed
research method is done in Hadoop from which it can be proved that the proposed
research technique can lead to provide better outcome than the existing research
techniques
The development of data mining is inseparable from the recent developments in information technology that enables the accumulation of large amounts of data. For example, a shopping mall that records every sales transaction of goods using various POS (point of sales). Database data from these sales could reach a large storage capacity, even more being added each day, especially when the shopping center will develop into a nationwide network. The development of the internet at the moment also has a share large enough in the accumulation of data occurs. But the rapid growth of data accumulation it has created conditions that are often referred to as "data rich but information poor" because the data collected can not be used optimally for useful applications. Not infrequently the data set was left just seemed to be a "grave data". There are several techniques used in data mining which includes association, classification, and clustering. In this paper, the author will do a comparison between the performance of the technical classification methods naïve Bayes and C4.5 algorithms.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...ijistjournal
Classification is widely used technique in the data mining domain, where scalability and efficiency are the immediate problems in classification algorithms for large databases. We suggest improvements to the existing C4.5 decision tree algorithm. In this paper attribute oriented induction (AOI) and relevance analysis are incorporated with concept hierarchy’s knowledge and HeightBalancePriority algorithm for construction of decision tree along with Multi level mining. The assignment of priorities to attributes is done by evaluating information entropy, at different levels of abstraction for building decision tree using HeightBalancePriority algorithm. Modified DMQL queries are used to understand and explore the shortcomings of the decision trees generated by C4.5 classifier for education dataset and the results are compared with the proposed approach.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses distance similarity measures that can be used for data mining classification and clustering techniques. It proposes a novel distance similarity measure called "Supervised & Unsupervised learning" that uses Euclidean distance similarity to partition training data into clusters. It then builds decision trees on each cluster to improve classification performance. The document also discusses using these measures for other applications like image processing, where k-means clustering can be used to segment images into clusters of similar pixel intensities. In conclusion, it states these similarity measures can help analyze complex datasets for business analysis purposes.
The document discusses a link mining methodology adapted from the CRISP-DM process to incorporate anomaly detection using mutual information. It applies this methodology in a case study of co-citation data. The methodology involves data description, preprocessing, transformation, exploration, modeling, and evaluation. Hierarchical clustering identified 5 clusters, with cluster 1 showing strong links and cluster 5 weak links. Mutual information validated the results, showing cluster 5 had the lowest mutual information, indicating independent variables. The case study demonstrated the approach can interpret anomalies semantically and be used with real-world data volumes and inconsistencies.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
Applying the big bang-big crunch metaheuristic to large-sized operational pro...IJECEIAES
In this study, we present an investigation of comparing the capability of a big bang-big crunch metaheuristic (BBBC) for managing operational problems including combinatorial optimization problems. The BBBC is a product of the evolution theory of the universe in physics and astronomy. Two main phases of BBBC are the big bang and the big crunch. The big bang phase involves the creation of a population of random initial solutions, while in the big crunch phase these solutions are shrunk into one elite solution exhibited by a mass center. This study looks into the BBBC’s effectiveness in assignment and scheduling problems. Where it was enhanced by incorporating an elite pool of diverse and high quality solutions; a simple descent heuristic as a local search method; implicit recombination; Euclidean distance; dynamic population size; and elitism strategies. Those strategies provide a balanced search of diverse and good quality population. The investigation is conducted by comparing the proposed BBBC with similar metaheuristics. The BBBC is tested on three different classes of combinatorial optimization problems; namely, quadratic assignment, bin packing, and job shop scheduling problems. Where the incorporated strategies have a greater impact on the BBBC's performance. Experiments showed that the BBBC maintains a good balance between diversity and quality which produces high-quality solutions, and outperforms other identical metaheuristics (e.g. swarm intelligence and evolutionary algorithms) reported in the literature.
This document provides a systematic review of educational data mining (EDM) techniques and their applications. It discusses how EDM can be used to extract hidden information from large student data repositories using clustering, classification, prediction, and recommendation algorithms. These algorithms help group similar students, categorize students, predict student outcomes, and suggest courses. The document also reviews literature applying these EDM techniques and outlines future work on semantic and opinion mining to improve adaptive learning systems.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
The document discusses a prototype-based credal classification approach for classifying patterns with incomplete data. It aims to provide accurate and efficient classification results for patterns with missing attribute values. The approach uses hierarchical clustering to identify class prototypes from the training data. It then applies an evidential reasoning technique called credal classification to determine the most likely class for each pattern after imputing missing values. Experimental results show the proposed approach performs better in terms of time and memory efficiency compared to other classification methods for incomplete patterns.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
This document discusses using fuzzy logic to optimize the placement of Thyristor Controlled Series Capacitors (TCSC) devices for congestion management in deregulated electricity markets. TCSC devices can control power flows and enhance transmission capacity. The document models TCSC devices and transmission lines. It develops fuzzy logic rules based on bus voltage and power loss to determine the optimal bus to place TCSC devices. The method was tested on the IEEE 14-bus system to demonstrate its effectiveness for static congestion management.
Research on Automobile Exterior Color and Interior Color MatchingIJERA Editor
This document summarizes research on matching exterior and interior colors for automobiles. It conducted surveys to determine color preferences of women in their 20s. Based on the results, it created 27 color combinations for the exterior, seats, and front panel. It analyzed the combinations using multiple regression and identified those with the highest ratings. It found the combination of red purple exterior, yellow red seats, and purple front panel had the highest average rating. It also identified other combinations with similar colors that received high ratings. Finally, it analyzed relationships between color combinations and different fashion styles using multiple regression.
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...ijistjournal
Classification is widely used technique in the data mining domain, where scalability and efficiency are the immediate problems in classification algorithms for large databases. We suggest improvements to the existing C4.5 decision tree algorithm. In this paper attribute oriented induction (AOI) and relevance analysis are incorporated with concept hierarchy’s knowledge and HeightBalancePriority algorithm for construction of decision tree along with Multi level mining. The assignment of priorities to attributes is done by evaluating information entropy, at different levels of abstraction for building decision tree using HeightBalancePriority algorithm. Modified DMQL queries are used to understand and explore the shortcomings of the decision trees generated by C4.5 classifier for education dataset and the results are compared with the proposed approach.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Internet becomes the most popular surfing environment which increases the
service oriented data size. As the data size grows, finding and retrieving the most
similar data from the large volume of data would become more difficult task. This
problem is focused in the various research methods, which attempts to cluster the
large volume of data. In the existing research method Clustering-based Collaborative
Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar
kind of data together, so that retrieval time cost can be reduced considerably.
However, existing research methods cannot find the similar reviews accurately which
needs to be focused more for efficient and accurate recommendation system. This is
ensured in the proposed research method by introducing the novel research technique
namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR).
In this research method, initially k means algorithm is used to cluster the similar
movie reviewer together, so that recommendation process can be done in the easier
way. In order to handle the large volume of data this research work adapts the map
reduce framework which will divide the entire data into subsets which will assigned
on separate nodes with individual key values. After clustering, the clustered outcome
is merged together using inverted index procedure in which similarity between movies
would be calculated. Here collaborative filtering is applied to remove the movies that
are not relevant to input. Finally recommendations of movies are made in the accurate
way by using the logistic regression method. The overall evaluation of the proposed
research method is done in Hadoop from which it can be proved that the proposed
research technique can lead to provide better outcome than the existing research
techniques
The development of data mining is inseparable from the recent developments in information technology that enables the accumulation of large amounts of data. For example, a shopping mall that records every sales transaction of goods using various POS (point of sales). Database data from these sales could reach a large storage capacity, even more being added each day, especially when the shopping center will develop into a nationwide network. The development of the internet at the moment also has a share large enough in the accumulation of data occurs. But the rapid growth of data accumulation it has created conditions that are often referred to as "data rich but information poor" because the data collected can not be used optimally for useful applications. Not infrequently the data set was left just seemed to be a "grave data". There are several techniques used in data mining which includes association, classification, and clustering. In this paper, the author will do a comparison between the performance of the technical classification methods naïve Bayes and C4.5 algorithms.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...ijistjournal
Classification is widely used technique in the data mining domain, where scalability and efficiency are the immediate problems in classification algorithms for large databases. We suggest improvements to the existing C4.5 decision tree algorithm. In this paper attribute oriented induction (AOI) and relevance analysis are incorporated with concept hierarchy’s knowledge and HeightBalancePriority algorithm for construction of decision tree along with Multi level mining. The assignment of priorities to attributes is done by evaluating information entropy, at different levels of abstraction for building decision tree using HeightBalancePriority algorithm. Modified DMQL queries are used to understand and explore the shortcomings of the decision trees generated by C4.5 classifier for education dataset and the results are compared with the proposed approach.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses distance similarity measures that can be used for data mining classification and clustering techniques. It proposes a novel distance similarity measure called "Supervised & Unsupervised learning" that uses Euclidean distance similarity to partition training data into clusters. It then builds decision trees on each cluster to improve classification performance. The document also discusses using these measures for other applications like image processing, where k-means clustering can be used to segment images into clusters of similar pixel intensities. In conclusion, it states these similarity measures can help analyze complex datasets for business analysis purposes.
The document discusses a link mining methodology adapted from the CRISP-DM process to incorporate anomaly detection using mutual information. It applies this methodology in a case study of co-citation data. The methodology involves data description, preprocessing, transformation, exploration, modeling, and evaluation. Hierarchical clustering identified 5 clusters, with cluster 1 showing strong links and cluster 5 weak links. Mutual information validated the results, showing cluster 5 had the lowest mutual information, indicating independent variables. The case study demonstrated the approach can interpret anomalies semantically and be used with real-world data volumes and inconsistencies.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
Applying the big bang-big crunch metaheuristic to large-sized operational pro...IJECEIAES
In this study, we present an investigation of comparing the capability of a big bang-big crunch metaheuristic (BBBC) for managing operational problems including combinatorial optimization problems. The BBBC is a product of the evolution theory of the universe in physics and astronomy. Two main phases of BBBC are the big bang and the big crunch. The big bang phase involves the creation of a population of random initial solutions, while in the big crunch phase these solutions are shrunk into one elite solution exhibited by a mass center. This study looks into the BBBC’s effectiveness in assignment and scheduling problems. Where it was enhanced by incorporating an elite pool of diverse and high quality solutions; a simple descent heuristic as a local search method; implicit recombination; Euclidean distance; dynamic population size; and elitism strategies. Those strategies provide a balanced search of diverse and good quality population. The investigation is conducted by comparing the proposed BBBC with similar metaheuristics. The BBBC is tested on three different classes of combinatorial optimization problems; namely, quadratic assignment, bin packing, and job shop scheduling problems. Where the incorporated strategies have a greater impact on the BBBC's performance. Experiments showed that the BBBC maintains a good balance between diversity and quality which produces high-quality solutions, and outperforms other identical metaheuristics (e.g. swarm intelligence and evolutionary algorithms) reported in the literature.
This document provides a systematic review of educational data mining (EDM) techniques and their applications. It discusses how EDM can be used to extract hidden information from large student data repositories using clustering, classification, prediction, and recommendation algorithms. These algorithms help group similar students, categorize students, predict student outcomes, and suggest courses. The document also reviews literature applying these EDM techniques and outlines future work on semantic and opinion mining to improve adaptive learning systems.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
A Survey Ondecision Tree Learning Algorithms for Knowledge DiscoveryIJERA Editor
Theimmense volumes of data are populated into repositories from various applications. In order to find out desired information and knowledge from large datasets, the data mining techniques are very much helpful. Classification is one of the knowledge discovery techniques. In Classification, Decision trees are very popular in research community due to simplicity and easy comprehensibility. This paper presentsan updated review of recent developments in the field of decision trees.
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
The document discusses a prototype-based credal classification approach for classifying patterns with incomplete data. It aims to provide accurate and efficient classification results for patterns with missing attribute values. The approach uses hierarchical clustering to identify class prototypes from the training data. It then applies an evidential reasoning technique called credal classification to determine the most likely class for each pattern after imputing missing values. Experimental results show the proposed approach performs better in terms of time and memory efficiency compared to other classification methods for incomplete patterns.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Data Warehouses are structures with large amount of data collected from heterogeneous sources to be
used in a decision support system. Data Warehouses analysis identifies hidden patterns initially unexpected
which analysis requires great memory and computation cost. Data reduction methods were proposed to
make this analysis easier. In this paper, we present a hybrid approach based on Genetic Algorithms (GA)
as Evolutionary Algorithms and the Multiple Correspondence Analysis (MCA) as Analysis Factor Methods
to conduct this reduction. Our approach identifies reduced subset of dimensions p’ from the initial subset p
where p'<p where it is proposed to find the profile fact that is the closest to reference. Gas identify the
possible subsets and the Khi² formula of the ACM evaluates the quality of each subset. The study is based
on a distance measurement between the reference and n facts profile extracted from the warehouse.
This document discusses using fuzzy logic to optimize the placement of Thyristor Controlled Series Capacitors (TCSC) devices for congestion management in deregulated electricity markets. TCSC devices can control power flows and enhance transmission capacity. The document models TCSC devices and transmission lines. It develops fuzzy logic rules based on bus voltage and power loss to determine the optimal bus to place TCSC devices. The method was tested on the IEEE 14-bus system to demonstrate its effectiveness for static congestion management.
Research on Automobile Exterior Color and Interior Color MatchingIJERA Editor
This document summarizes research on matching exterior and interior colors for automobiles. It conducted surveys to determine color preferences of women in their 20s. Based on the results, it created 27 color combinations for the exterior, seats, and front panel. It analyzed the combinations using multiple regression and identified those with the highest ratings. It found the combination of red purple exterior, yellow red seats, and purple front panel had the highest average rating. It also identified other combinations with similar colors that received high ratings. Finally, it analyzed relationships between color combinations and different fashion styles using multiple regression.
Scientific principles and concepts expressed through the laws, equations and formulas are the bedrock for the prediction of the deign-in functionality performance of any engineering creation. However, there are not equivalent when the in-service functionability predictions have to be made. Hence, Mirce Mechanics has been created at the MIRCE Akademy to fulfil the roll. The main purpose of this paper is to present the development and application of Mirce Functionability Equation which is the bedrock for the prediction of the in-service motion of functionability performance of maintainable systems.
Degradation and Microbiological Validation of Meropenem Antibiotic in Aqueous...IJERA Editor
Aqueous UV, UV/H2O2, UV/TiO2 and UV/TiO2/H2O2mediateddegradation/oxidation of the carbapenem antibiotic, meropenem (MERO) was experimentally studied. Degussa P-25 titanium dioxide was used as photocatalyst and UV-light source was used for activation of TiO2.The nanosized titanium dioxide was immobilized on the glass support for improving the efficiency and economics of the photocatalytic processes. The immobilized film of titanium dioxide has been characterized, using X-ray diffraction (XRD) andscanning electron microscopy (SEM).The study of antibiotic degradation was conducted in the specific Batch Photocatalytic Reactor. MERO standard solution was used at 500 μg/ml concentration, which degraded up to 99% of antibiotics. Microbiological assay showed that the loss of antibacterial activity is directly proportional to the time of UV-irradiation. The experiment also showed that the UV-irradiation itself causes the degradation of antibiotics, but in very slow manner in comparison to the photocatalysis process. The experimental study showed that UV/TiO2/H2O2 system is effective and efficient for the treatment of antibiotic waste.
The document presents a method for solving nonlinear time delay control systems using Fourier series. The key steps are:
1) Time functions in the system are expanded as truncated Fourier series with unknown coefficients.
2) Operational matrices of integration, delay, and product are presented and used to evaluate the Fourier coefficients for the solution.
3) This reduces solving the time-delay control system to solving a set of algebraic equations for the Fourier coefficients.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document presents a study on the design and structural analysis of high speed helical gears using ANSYS. Helical gears with different numbers of teeth were modeled in Pro/Engineer and imported into ANSYS for finite element analysis. Bending stresses were calculated using the modified Lewis beam method and AGMA equations, and compared to ANSYS results which had around 6% error. Contact stresses calculated using AGMA equations were also compared to ANSYS results, with around 1% error. The study found that gear design with minimum teeth and maximum helix angle is preferred based on material strength considerations.
Efficient Facial Expression and Face Recognition using Ranking MethodIJERA Editor
Expression detection is useful as a non-invasive method of lie detection and behaviour prediction. However, these facial expressions may be difficult to detect to the untrained eye. In this paper we implements facial expression recognition techniques using Ranking Method. The human face plays an important role in our social interaction, conveying people's identity. Using human face as a key to security, the biometrics face recognition technology has received significant attention in the past several years. Experiments are performed using standard database like surprise, sad and happiness. The universally accepted three principal emotions to be recognized are: surprise, sad and happiness along with neutral.
Rain Water Harvesting and Impact of Microbial Pollutants on Ground Water Rese...IJERA Editor
Developing countries are under heavy stress due to continuous depletion of ground water reserves. The urban
areas are developing and growing very fast due to population growth, increase in commercial and trade
activities, national and international tourism development as trade. The local migration of rural population due
to better job opportunities. Civic amenities are also the reason for population explosion in urban areas and thus
there is increase in the demand of basic needs like water, shelter and power. Due to the overall consumption of
water in urban and rural areas which has increased many fold in the recent past, causing depletion of water subsurface
reserves due to difference in natural recharge of reservoirs and the corresponding water demand. The
ground water is an integral part of the environment and there has been a lack of adequate attention to water
conservation, water use and reuse, ground water recharge, and ecosystem sustainability. To meet with the
challenge of under ground water shortage, lowering level of water table, efforts are being made to recharge the
aquifer system by the Rain Water Harvesting (R.W.H.). This noble act needs serious thought and follow up to
achieve the aim of recharging ground water free from pollutants like pesticide, bacteria and seepage causing
infection and pollution of the existing pure source of potable water. A study has therefore undertaken to assess
the possible bacterial intrusion through the rain water penetration at the deeper water bearing aquifers.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Shortest Tree Routing With Security In Wireless Sensor NetworksIJERA Editor
We propose STR to resolve the main reasons of overall network performance degradation of ZTR, which are the detour path problem and the traffic concentration problem. Second, we prove that the 1-hop neighbor nodes used by STR improve the routing path efficiency and alleviate the traffic load concentrated on tree links in ZTR. Third, we analyze the performance of ZTR, STR, and AODV by differentiating the network conditions such as network density, ZigBee network constraints, traffic types, and the network traffic. For modification security purpose we are also encrypting the data packets during transmission. So that the intermediate nodes are not able to view the data during transmission. For Encryption process, we are using RC4 Algorithm. Short cut tree routing is used for minimizing the routing path from source to destination.
Growth and Characterization of Barium doped Potassium Hydrogen Phthalate Sing...IJERA Editor
The Non Linear Optical materials have acquired new significance with the advent of a large number of devices
utilizing solid state Laser sources. Potassium Hydrogen Phthalate one of the Non Linear Optical material having
superior non linear optical properties has been exploited for variety of application. In the present work, KHP
single crystals were grown by slow evaporation technique with Barium metal ion as a dopant. The grown
crystals were subjected to powder XRD analysis and the result shows that the Ba2+ ions does not alter the crystal
structure, but it enter into the crystal lattice of pure KHP. The optical transparency of the grown crystal was
studied by UV-Visible spectroscopy, the molecular structure was confirmed by FTIR analysis and its thermal
stability by TG/DTA analysis. The improved SHG efficiency of barium doped Potassium Hydrogen Phthalate
crystal could enhance the nonlinearity behaviour. In addition to this, the electrical parameter such as dielectric
constant was studied in detail.
This document summarizes a study on using ferric oxide (Fe2O3) as an adsorbent to remove color from dye wastewater. Batch experiments were conducted with synthetic wastewater containing anthraquinone blue dye. The effects of pH, adsorbent dosage, dye concentration, and adsorption isotherms were evaluated. Maximum dye removal efficiency of 94% was achieved at pH 2 with 0.3 g of Fe2O3 adsorbent dosage and an initial dye concentration of 125 ppm. Equilibrium data fitted well to Freundlich, Langmuir, and Temkin isotherm models, indicating favorable adsorption of dye onto Fe2O3.
This document provides an overview of UNICEF Ireland and their 2009 fundraising campaign. [1] It outlines UNICEF's history and mission to protect children's rights and help them survive and thrive. [2] The 2009 campaign aimed to increase awareness of UNICEF's work, engage existing and new donors, and raise funds to support programs in over 150 countries. [3] Direct mail, television, radio, and community engagement were used, and results included increased average donations and new donors added to the database.
A Survey Analysis on CMOS Integrated Circuits with Clock-Gated Logic StructureIJERA Editor
Various circuit design techniques has been presented to improve noise tolerance of the proposed CGS logic families. Noise in deep submicron technology limits the reliability and performance of ICs. The ANTE (Average Noise Threshold Energy) metric is used for the analysis of noise tolerance of proposed CGS. A 2-input NAND and NOR gate is designed by the proposed technique. Simulation results for a 2-input NAND gate at clock gated logic show that the proposed noise tolerant circuit achieves 1.79X ANTE improvement along with the reduction in leakage power. Continuous scaling of technology towards the manometer range significantly increases leakage current level and the effect of noise. This research can be further extended for performance optimization in terms of power, speed, area and noise immunity.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document presents results from an experimental study on the effects of elevated temperature and aggressive chemical environments on the compressive strength of M30 grade concrete. Concrete cubes were tested at temperatures from 200°C to 1000°C and exposed to sodium sulfate, sulfuric acid, and sodium chloride solutions for 30 and 60 days. Testing showed that compressive strength decreased with increasing temperature and exposure time, with rapid decreases above 400°C. Sulfuric acid exposure led to gypsum crystal deposition on cube surfaces.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
This document presents a proposed framework called MBKEM (Mini Batch K-means with Entropy Measure) for clustering heterogeneous categorical data. MBKEM uses an entropy distance measure within a mini batch k-means algorithm. The framework is evaluated using secondary data from a public survey. Evaluation metrics show MBKEM outperforms other clustering algorithms with high accuracy, v-measure, adjusted rand index, and Fowlkes-Mallow's index. MBKEM also has faster average cluster generation time than other methods. The proposed framework provides an improved solution for clustering heterogeneous categorical data.
This document summarizes an article that proposes a novel cost-free learning (CFL) approach called ABC-SVM to address the class imbalance problem. The approach aims to maximize the normalized mutual information of the predicted and actual classes to balance errors and rejects without requiring cost information. It optimizes misclassification costs, SVM parameters, and feature selection simultaneously using an artificial bee colony algorithm. Experimental results on several datasets show the method performs effectively compared to sampling techniques for class imbalance.
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
Curse of Dimensionality in Paradoxical High Dimensional Clinical Datasets � A...ijcnes
The document summarizes techniques for representing paradoxical high-dimensional clinical datasets. It discusses how hierarchical clustering with a dendrogram structure can be used to group such datasets into clusters. Different distance measures that can be used for clustering are defined, including minimum, maximum, mean, and average distance. It is proposed to use hierarchical divisive clustering initially to determine the number of clusters, followed by iterative relocation to improve the clustering. Dimensionality reduction techniques like principal component analysis are also suggested to reduce the effects of the curse of dimensionality. The paper surveys various clustering approaches that have been applied to medical data mining and handling of high-dimensional data.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Privacy preservation techniques in data miningeSAT Journals
Abstract In this paper different privacy preservation techniques are compared. Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning the training data are analyzed by classification algorithm. In classification test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable the rules can be applied to the new data tuples . For a fraud detection application, this would include complete records of both fraudulent and valid activities determined on a record-by-record basis. The classifier-training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier Index Terms: Data Mining, Privacy Preservation, Clustering, Classification Techniques, Naive Bayes.
This document discusses various privacy preservation techniques in data mining. It summarizes classification, clustering, and association rule learning as common privacy preservation approaches. For classification, it describes decision trees, k-nearest neighbors, artificial neural networks, support vector machines, and naive Bayes models. It provides advantages and disadvantages of these techniques. The document concludes that privacy preservation techniques have emerged to allow for efficient and effective data mining while protecting sensitive data.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
Student Performance Evaluation in Education Sector Using Prediction and Clust...IJSRD
Data mining is the crucial steps to find out previously unknown information from large relational database. various technique and algorithm are their used in data mining such as association rules, clustering and classification and prediction techniques. Ease of the techniques contains particular characteristics and behaviour. In this paper the prime focus on clustering technique and prediction technique. Now a days large amount of data stored in educational database increasing rapidly. The database for particular set of student was collected. The clustering and prediction is made on some detailed manner and the results were produce. The K-means clustering algorithm is used here. To find nearest possible a cluster a similar group the turning point India is the performance in higher education for all students. This academic performance is influenced by various factor, therefore to identify the difference between high learners and slow learner students it is important for student performance to develop predictive data mining model.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
An approach for improved students’ performance prediction using homogeneous ...IJECEIAES
Web-based learning technologies of educational institutions store a massive amount of interaction data which can be helpful to predict students’ performance through the aid of machine learning algorithms. With this, various researchers focused on studying ensemble learning methods as it is known to improve the predictive accuracy of traditional classification algorithms. This study proposed an approach for enhancing the performance prediction of different single classification algorithms by using them as base classifiers of homogeneous ensembles (bagging and boosting) and heterogeneous ensembles (voting and stacking). The model utilized various single classifiers such as multilayer perceptron or neural networks (NN), random forest (RF), naïve Bayes (NB), J48, JRip, OneR, logistic regression (LR), k-nearest neighbor (KNN), and support vector machine (SVM) to determine the base classifiers of the ensembles. In addition, the study made use of the University of California Irvine (UCI) open-access student dataset to predict students’ performance. The comparative analysis of the model’s accuracy showed that the best-performing single classifier’s accuracy increased further from 93.10% to 93.68% when used as a base classifier of a voting ensemble method. Moreover, results in this study showed that voting heterogeneous ensemble performed slightly better than bagging and boosting homogeneous ensemble methods.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
The document discusses a proposed students' performance prediction system using multi-agent data mining techniques. It aims to predict student performance with high accuracy and help low-performing students. The system uses ensemble classifiers like Adaboost.M1 and LogitBoost and compares their prediction accuracy to the single classifier C4.5 decision tree. Experimental results showed SAMME boosting improved prediction accuracy over C4.5 and LogitBoost.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
This document compares two approaches for handling incomplete data and generating decision rules: 1) Rough set theory, which fills missing values and performs attribute reduction, and 2) Random tree classification in data mining, which ignores missing values. It uses a heart disease dataset with missing values to test the approaches in ROSE2 and WEKA. The results show that random tree classification ignoring missing values produces more accurate decision rules than rough set theory filling missing values.
The document proposes a hierarchical and grid-based clustering method called HGD Cluster for clustering distributed data across decentralized nodes. HGD Cluster uses hierarchical and grid-based clustering techniques to cluster datasets dispersed among large numbers of nodes in a distributed environment. It aims to efficiently perform dynamic distributed clustering and allow nodes to maintain summarized views of datasets when nodes are asynchronous and data is dynamic.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
1. Dr. C. Chandrasekar et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 4( Version 1), April 2014, pp.209-211
www.ijera.com 209 | P a g e
Clustering Large Databases Using Gmm
Dr. C. Chandrasekar*, M. Srisankar**
* (Assistant Professor, Department of Computer Applications, Sree Narayana Guru College, Coimbatore – 105)
** (Research Scholar, Department of Computer Science, Bharathiar University, Coimbatore - 46)
ABSTRACT
Innovation of such clusters of data is essential in illuminating main links in categorical data regulatory networks.
There are lot of problems exists in the previous clustering methods especially while grouping the data with
mixed data types. This experiment analyzes those existing methods and comes with the new approach for
clustering the mixed data items. There are many methods occur for clustering the parallel kind of data, whereas
only very few methods exists for clustering mixed data items and it leads to the need of better clustering
technique for classification of mixed data. For investigative data, the clustering with the help of Gaussian
mixture models is widely used.
Keywords - Clustering methods, High dimensional data, K-means clustering, GMM, Adult data set, Categorical
data
I. INTRODUCTION
In knowledge discovery, a basis data mining
technique used is data clustering. Most widely used
methods in data mining is clustering and its core is
grouping the whole data based on its parallel
measures that based on some distance measure. The
problem of clustering has become increasingly
important in recent years. Data mining offers a suite
of algorithms, each addressing a different task and in
the process elucidating a unique facet of the data [1].
Of the many facets of data mining, we are
particularly interested in clustering problems, i.e. the
process of finding similarities in the data and then
grouping similar data into identifiable clusters.
Cluster analysis or clustering is the task of grouping a
set of objects in such a way that objects in the same
group (called a cluster) are more similar (in some
sense or another) to each other than to those in other
groups (clusters). Most techniques of clustering
comprise document grouping, scientific data analysis
and customer/market segmentation. Usually,
clustering involves the classification of the granting
data that includes n points in m dimension into k
clusters. The clustering must be such that the data
points in the corresponding cluster that should be
highly similar to one another.The problems exists in
the clustering techniques are: Identifying the similar
measure to find the similarity among various data, it
is complex to find out the suitable methods for
identifying the identical data in unsupervised way
and originate a description that can differentiate the
data of a cluster in an efficient manner. It is a vital
task of experimental data mining, and a usual method
for mathematical data study, used in many areas,
such as machine learning, image analysis,
information retrieval and pattern recognition
II. BACKGROUND STUDY
Efficient collaborative method for mixed
numeric and categorical data is recommended by
earlier researchers[4]. Most of the existing clustering
algorithms focus on numerical data whose inbuilt
geometric characteristics can be put-upon obviously
to outline distance functions between data points and
a massive or enormous data exist in the databases is
categorical, where attribute values will not be
reasonably arranged as numerical values. Since the
differences in the features of the categories of data,
force to build up standard functions for mixed data[2]
was not successful. In previous method novel divide-
and-conquer method to overcome this issue. In the
beginning, the real mixed dataset is segmented with
different sub-datasets and next, accessible well
predictable clustering method developed for various
types of datasets is applied to produce equivalent
clusters. Finally, the clustering results on the
categorical and numeric dataset are combined as a
categorical dataset, on which that clustering method
is used to produce the final result. The vital
contribution in this experimental study is to present
an algorithm for the mixed features clustering
complications, in which previous clustering
algorithms like k-means[5] as shown in fig 1. can be
effortlessly incorporated. The majority of existing
clustering algorithms encounter serious scalability
and/or accuracy related problems when used on
databases with a large number of records and/or
attributes. Only few methods can handle numeric,
nominal, and mixed data.
RESEARCH ARTICLE OPEN ACCESS
2. Dr. C. Chandrasekar et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 4( Version 1), April 2014, pp.209-211
www.ijera.com 210 | P a g e
Fig 1. Clustering high dimensional data using k-
means
III. METHODOLOGY
Clustering in high dimensional spaces is
frequently challenging as theoretical results [3]
doubted the meaning of closest matching in high
dimensional spaces. When the data items are definite
or mixed, the similarity is difficult to find by
Euclidean distance method. With the enormous data
collected from different fields such as banks, or
health sector, universities, colleges, schools web-log
data and biological sequence data which are definite
data need good clustering method. It is highly
complicated to group the unconditional data into
important category with good distance measure,
acquiring sufficient data similarity and to use
combination with an effective clustering algorithm
and in with mixed numeric and categorical data, only
very few methods available. There are many methods
occur for clustering the parallel kind of data, whereas
only very few methods exists for clustering mixed
data items and it leads to the need of better clustering
technique for classification of mixed data. The
clusters identified in the present study differ
somewhat from those from earlier studies. Previous
methods often cluster adult dataset into groups of
persons by age, salary, married, un-married and
country cognitive performance. Even when it
compared adult dataset test performance to that of
male and female category, the category in the present
study produced few performances that would be
considered clinically impaired. For investigative data,
the clustering with the help of Gaussian mixture
models is widely used.
Fig 2. GMM Clustering
IV. EXPERIMENTAL RESULT
The investigation study of the proposed
technique was accepted with the support of Adult
Data Set. The number of clusters answered for the
proposed method is lesser when compared to the
other approaches and the outliers were not found by
using the proposed method. The Proposed algorithm
utilizes an active sampling mechanism and compels
at most a single examine through the data. The
capability to generalize our identifications to other
instances is reduced in many approaches. One
possible merit identified in the operational definition
of age, salary, married, unmarried and the country
used in this study as in fig 4. We determine the high
quality of the clustering solutions that was found,
their explanatory power, and proposed model’s good
scalability. GMM method has good accuracy, uses a
single tunable parameter, and can successfully
function with partial memory resources.
4.1 GMM Method
Gaussian Mixture Models[6] are one among
the significant statistically developed approaches for
clustering. The concept of clustering, and see how
one form of clustering in which it assumed that
individual data points are produced by choosing
one of a set of multivariate Gaussians and then
sampling from them. This can be a distinct related to
operation and examine how to learn such a thing
from data, and we discover that an optimization. This
optimization method is called Expectation
Maximization (EM).
4.2 Adult data set
The adult data set from the UCI repository
contains massive data with both numeric and nominal
values. There are a total of 18,322 records, split in a
2:1 ratio to form a train and test set. The target class
indicates whether a person by age, income, married
and unmarried. After challenging the nominal
3. Dr. C. Chandrasekar et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 4( Version 1), April 2014, pp.209-211
www.ijera.com 211 | P a g e
attributes into binary values, the training data were
clustered into 14 clusters. Then these clusters were
labeled with the predominant class. On the test data,
k-Meansaccuracy was 77.2% whereas Gaussian
Mixture Model results with accuracy rate of 91%.The
distance-based k- Means algorithm produced clusters
that did not separate well the rare class from the
dominant class.
Fig 3. Scatter Plot of Multimodal Data
Fig 4. Plotting Multimodal Data with age, salary,
married,
V. CONCLUSION
This research focus on effective clustering
method for mixed category data. There are various
methods available for clustering categorical, but
those techniques resulted with some drawbacks. To
overcome this issue, the clustering in mixed data can
be performed based on many approaches. But these
approaches took more time consumption for
classification. To this issue, a new modified Gaussian
Mixture Model technique is used in this study based
on adult dataset. The experiment is performed with
the help of Adult Data Set and it can be examine that
the improved classification result is achieved by the
proposed technique when compared to the previous
experimental methods. The study through the paper
suggests that the proposed approach can be used to
cluster the mixed data items with better accuracy of
classification.
REFERENCES
[1] U. Fayyad, G. Piatetsky-Shapiro,Psmthy… -
citeulike.org. Advances in Knowledge
Discovery and Data Mining. AAAI Press,
Menlo Park:CA, 1996.
[2] Hari Prasad, D and M. Punithavilli, 2010a.
A review on data clustering algorithms for
mixed data. Global J. Comput. Sci.
Technol., 10: 43-48.
[3] K. Beyer, J. Goldstein, R. Ramakrishnan,
and U. Shaft. When is nearest neighbors
meaningful. In Proc. of the Int. Conf.
Database Theories, pages 217–235, 1999.
[4] Reddy, M.V.J. and B. Kavitha, 2010.
Efficient ensemble algorithm for mixed
numeric and categorical data. Proceedings of
the IEEE International Conference on
Computational Intelligence and Computing
Research, Dec. 28-29, IEEE Xplore Press,
Coimbatore, pp: 1-4. DOI:
10.1109/ICCIC.2010.5705738.
[5] Huang, Z. (1998). Extensions to the K-
means Algorithm for Clustering Large
Datasets with Categorical Values. Data
Mining and Knowledge Discovery, 2, p.
283-304 (Publitemid 128695480)
[6] K. Zhang and J. T. Kwok. Simplifying
mixture models through function
approximation. In B. Scholkopf, J. Platt, and
T. Hoffman, editors, Advances in Neural
Information Processing Systems 17, pages
1577–1584, Cambridge, MA, 2007. MIT
Press.