This document provides a literature review of different clustering techniques. It begins by defining clustering and describing the main categories of clustering methods: hierarchical, partitioning, density-based, grid-based, and model-based. It then summarizes some examples of algorithms for each category in 1-2 sentences. For hierarchical methods, it discusses BIRCH, CURE, and CHAMELEON. For partitioning methods, it mentions k-means clustering and k-medoids. For density-based methods, it lists DBSCAN, OPTICS, DENCLUE. For grid-based methods, it lists CLIQUE, STING, MAFIA, WAVE CLUSTER, O-CLUSTER, ASGC, and
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A comparative study of clustering and biclustering of microarray dataijcsit
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they
coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can
be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of
utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes
that are coexpressed under clusters of conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate
this problem by finding suboptimal solutions. In this paper, we make a new survey on clustering and
biclustering of gene expression data, also called microarray data.
Mammogram image segmentation using rough clusteringeSAT Journals
Abstract The mammography is the most effective procedure to diagnosis the breast cancer at an early stage. This paper proposes mammogram image segmentation using Rough K-Means (RKM) clustering algorithm. The median filter is used for pre-processing of image and it is normally used to reduce noise in an image. The 14 Haralick features are extracted from mammogram image using Gray Level Co-occurrence Matrix (GLCM) for different angles. The features are clustered by K-Means, Fuzzy C-Means (FCM) and Rough K-Means algorithms to segment the region of interests for classification. The result of the segmentation algorithms compared and analyzed using Mean Square Error (MSE) and Root Means Square Error (RMSE). It is observed that the proposed method produces better results that the existing methods. Keywords— Mammogram, Data mining, Image Processing, Feature Extraction, Rough K- Means and Image Segmentation
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A comparative study of clustering and biclustering of microarray dataijcsit
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they
coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can
be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of
utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes
that are coexpressed under clusters of conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate
this problem by finding suboptimal solutions. In this paper, we make a new survey on clustering and
biclustering of gene expression data, also called microarray data.
Mammogram image segmentation using rough clusteringeSAT Journals
Abstract The mammography is the most effective procedure to diagnosis the breast cancer at an early stage. This paper proposes mammogram image segmentation using Rough K-Means (RKM) clustering algorithm. The median filter is used for pre-processing of image and it is normally used to reduce noise in an image. The 14 Haralick features are extracted from mammogram image using Gray Level Co-occurrence Matrix (GLCM) for different angles. The features are clustered by K-Means, Fuzzy C-Means (FCM) and Rough K-Means algorithms to segment the region of interests for classification. The result of the segmentation algorithms compared and analyzed using Mean Square Error (MSE) and Root Means Square Error (RMSE). It is observed that the proposed method produces better results that the existing methods. Keywords— Mammogram, Data mining, Image Processing, Feature Extraction, Rough K- Means and Image Segmentation
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Survey on traditional and evolutionary clustering approacheseSAT Journals
Abstract Clustering deals with grouping up of similar objects. Unlike classification, clustering tries to group a set of objects and find whether there is some relationship between the objects whereas in classification a set of predefined classes will be known and it is enough to find which class a object belongs. Simply, classification is a supervised learning technique and clustering is an unsupervised learning technique. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances. It naturally requires different techniques to the classification and association learning methods .Clustering has many applications in various fields. In Software engineering it helps in reverse engineering, software maintenance and for re-building systems. It aims at breaking a larger problem into small pieces of understanding elements. There are many approaches available to carry out clustering. Since clustering has no particular methodology there are many methods available for carrying out clustering. There are many traditional as well as evolutionary methods available for carrying out clustering. In this paper various types of the above mentioned methods are described and some of them are compared. Each method has its own advantage and they can be used according to the needs of the user. Keywords: Clustering, Classification, Software Engineering, Traditional, Evolutionary.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELScscpconf
Uncertainty is a pervasive in real world environment due to vagueness, is associated with the
difficulty of making sharp distinctions and ambiguity, is associated with situations in which the
choices among several precise alternatives cannot be perfectly resolved. Analysis of large
collections of uncertain data is a primary task in the real world applications, because data is
incomplete, inaccurate and inefficient. Representation of uncertain data in various forms such
as Data Stream models, Linkage models, Graphical models and so on, which is the most simple,
natural way to process and produce the optimized results through Query processing. In this
paper, we propose the Uncertain Data model can be represented as Possibilistic data model
and vice versa for the process of uncertain data using various data models such as possibilistic
linkage model, Data streams, Possibilistic Graphs. This paper presents representation and
process of Possiblistic Linkage model through Possible Worlds with the use of product-based
operator.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Some forms of N-closed Maps in supra Topological spacesIOSR Journals
In this paper, we introduce the concept of N-closed maps and we obtain the basic properties and
their relationships with other forms of N-closed maps in supra topological spaces.
Simulation of IEEE 802.16e Physical LayerIOSR Journals
Abstract : Growth in technology has led to unprecedented demand for high speed Internet access. IEEE
802.16e (Mobile WiMAX) is a wireless communication standard with high data transfer rates and good
performance. It not only is efficient as compared to its counterpart technologies today (Wi-Fi and 3G), but also
lays the foundation for 4G mobile communication. In 4G wireless communication systems, bandwidth is a
precious resource, and service providers are continuously met with the challenge of accommodating more users
within a limited allocated bandwidth. To increase data rate of wireless medium with higher performance,
Mobile WiMAX uses Orthogonal Frequency Division Multiple Access (OFDMA). This paper describes the
simulation of the physical layer of IEEE 802.16e using Simulink in Matlab 7.0 (R2010a). The system
performance is evaluated considering the Signal to noise ratio (SNR) and Bit error rate (BER) parameters.
Keywords: 802.16e, OFDMA, Mobile WiMAX.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Survey on traditional and evolutionary clustering approacheseSAT Journals
Abstract Clustering deals with grouping up of similar objects. Unlike classification, clustering tries to group a set of objects and find whether there is some relationship between the objects whereas in classification a set of predefined classes will be known and it is enough to find which class a object belongs. Simply, classification is a supervised learning technique and clustering is an unsupervised learning technique. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances. It naturally requires different techniques to the classification and association learning methods .Clustering has many applications in various fields. In Software engineering it helps in reverse engineering, software maintenance and for re-building systems. It aims at breaking a larger problem into small pieces of understanding elements. There are many approaches available to carry out clustering. Since clustering has no particular methodology there are many methods available for carrying out clustering. There are many traditional as well as evolutionary methods available for carrying out clustering. In this paper various types of the above mentioned methods are described and some of them are compared. Each method has its own advantage and they can be used according to the needs of the user. Keywords: Clustering, Classification, Software Engineering, Traditional, Evolutionary.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELScscpconf
Uncertainty is a pervasive in real world environment due to vagueness, is associated with the
difficulty of making sharp distinctions and ambiguity, is associated with situations in which the
choices among several precise alternatives cannot be perfectly resolved. Analysis of large
collections of uncertain data is a primary task in the real world applications, because data is
incomplete, inaccurate and inefficient. Representation of uncertain data in various forms such
as Data Stream models, Linkage models, Graphical models and so on, which is the most simple,
natural way to process and produce the optimized results through Query processing. In this
paper, we propose the Uncertain Data model can be represented as Possibilistic data model
and vice versa for the process of uncertain data using various data models such as possibilistic
linkage model, Data streams, Possibilistic Graphs. This paper presents representation and
process of Possiblistic Linkage model through Possible Worlds with the use of product-based
operator.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Some forms of N-closed Maps in supra Topological spacesIOSR Journals
In this paper, we introduce the concept of N-closed maps and we obtain the basic properties and
their relationships with other forms of N-closed maps in supra topological spaces.
Simulation of IEEE 802.16e Physical LayerIOSR Journals
Abstract : Growth in technology has led to unprecedented demand for high speed Internet access. IEEE
802.16e (Mobile WiMAX) is a wireless communication standard with high data transfer rates and good
performance. It not only is efficient as compared to its counterpart technologies today (Wi-Fi and 3G), but also
lays the foundation for 4G mobile communication. In 4G wireless communication systems, bandwidth is a
precious resource, and service providers are continuously met with the challenge of accommodating more users
within a limited allocated bandwidth. To increase data rate of wireless medium with higher performance,
Mobile WiMAX uses Orthogonal Frequency Division Multiple Access (OFDMA). This paper describes the
simulation of the physical layer of IEEE 802.16e using Simulink in Matlab 7.0 (R2010a). The system
performance is evaluated considering the Signal to noise ratio (SNR) and Bit error rate (BER) parameters.
Keywords: 802.16e, OFDMA, Mobile WiMAX.
Effects of Harness Running, Sand Running, Weight - Jacket Running and Weight ...IOSR Journals
Abstract: Background: The purpose of the study was to find the effects of Harness Running, Sand Running,
Weight-Jacket Running and Weight training on the performance of Dribbling and kicking among the Burdwan
Distrist School going soccer players.
Method-100 male students from the different schools of the Burdwan distrist were randomly selected as subjects
and there age were 14-18 years served as Harness Running group (HRG), second group served as Sand
Running group (SRG), third group served as Weight-Jacket Running group (WJRG), fourth group served as
Weight training group (WTG) and the fifth group served as Control group (CTG).Ten weeks training were given
for experiment accordingly. The control group was not given any training except of their routine. The selected
subjects were measure of the following soccer skills of Kicking and Dribbling. ANCOVA was calculation for
statistical treatment.
Finding: From the finding implies that the Weight Jacket Group was more effective in decreasing the time taken
and increasing the distance then all other training programs after ten weeks of training on Dribbling and
kicking.
Conclusions: In dribbling Weight Jacket Group showed higher adjusted post-test mean difference with Control
Group in comparison to other three training groups which is 0.8, higher than the critical difference 0.51
required being significant at 0.05 levels.
In kicking Weight Jacket Running Group showed higher adjusted post-test mean difference with Control Group
in comparison to other three training groups which are 2.50, higher than the critical difference 1.60 required
being significant at 0.05 levels.
Keywords: Harness Running, Sand Running, Weight-Jacket Running, Weight training, agility, Dribbling and
kicking
Motor Fitness of Rural Primary School Girls In Comparison To Boys IOSR Journals
Abstract: Difference between male and female in physical, physiological, motor, psychological, social and
emotional dimensions have been confirmed by many researchers time to time (Tanner:1978; Overman &
Williams, 2004; Linda, 2005). The causes have been identified as genetic, social and cultural. But, it has also
been reported that sex difference does not become prominent before puberty (Gustafsson & Lindenfors, 2008).
The purpose of the study was to compare motor fitness status of boys and girls belonging to primary school at a
rural setting. 118 boys and girls (9-10 years) were selected as subjects from Bardhaman district, West Bengal.
Speed, Cardio-respiratory Endurance, Muscular Strength-Endurance, Flexibility, Agility, Coordination and
Anaerobic Power were chosen motor fitness variables for the study. Results of the present study revealed that
in speed, coordination, power and agility no significant difference exists (p<0.05)>0.05) between the groups. In cardio-respiratory endurance, boys were better than the girls while girls had higher scores than boys in flexibility and abdominal muscular strengthendurance.
Keywords: Motor fitness, primary level, sex difference.
Wireless and uninstrumented communication by gestures for deaf and mute based...IOSR Journals
Abstract: The fact that technology is advancing as per Moore’s law, the attention towards deaf and mute individuals with hi-tech technology is not much. Deaf and mute have to communicate through sign language even for pithy things. And also many people did not understand this language. Now-a-days gesture is becoming an increasingly popular means of interacting with computers. This paper sheds light of an proposed potential idea relying on latest technology named Wi-See which was developed in Washington, US. This technology actually uses our conventional Wi-Fi signals for home automation by gesture recognition. So, depending upon this hi-tech technology, my modified application idea is towards deaf and dumb, especially, one who cannot speak, but knows English language for communication. Since wireless signals do not require line-of-sight and can traverse through walls, proposed idea can be very useful to expressed views by speechless people without requiring instrumentation of the human body with sensing devices. The whole idea is based on Doppler shift in frequency of Wi-Fi signals. Instead of controlling home appliances as by Wi-See, this idea extends its view for speech or words through speakers installed. Each successive pattern of English alphabet generated by Doppler shift by gestures in air, can be recorded and matched with predefined pattern, which when processed, be outputed through speaker as combined letter word ,inspired by English digital dictionary having prediction and correction algorithm. Keywords: Wi-Fi, Wi-See, Doppler shift, Gestures, Communication
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
A Comparative Study on Recovery Pulse Rate after 12 Minute Run and Walk TestIOSR Journals
Abstract: The sports are a world-wide phenomenon today. In the world history, sports was a popular
organization and important as today. It has been an interesting aspect for human amusement and a cultural
phenomenon at great magnitude. It has got mass participation, as it attracts people either for recreations,
physical fitness or performance.The effectiveness of using the heart rate (HR) as an indicator of exercise
intensity to monitor of all games and sports and any physical activity. However, recently new regulations and a
trend towards a more conditional games have prompted a need to revise field study procedures and demand the
increased specialization of games’ concerned. Yoga has been become increasingly popular in the world as a
method to reduce stress and as a means of exercise and fitness training also for recovery purpose as well. The
purpose of the study was to compare the differences on recovery pulse rate among, Yoga Nidra group ,Savasana
group and Control group, (25 of each group). For the purpose of the study 75 male B.P.Ed students from
P.G.G.I.P.E Banipur North 24 pgs, West Bengal were selected as the subjects for this study. The age of the
subjects was between 22-25 years. Recovery Pulse Rate was only the variable of the study. ‘ANOVA’ was
applied to calculate the collected data at 0.05 level of significance and to indentify the significance differences
among the means critical difference was used as a Post-hoc test. The result showed that there was no
significant difference between Control group and Savasana group but significance difference were observed
between Control group and Yoga Nidra group, and between Savasana group and Yoga Nidra group.
Key Words: Recovery pulse rate, Yoga Nidra group, Savasana group, Control group
Hepatoprotective Activity of Chara Parpam in Ccl4 Induced RatsIOSR Journals
Siddha system of medicine provides most frequently and to the extent possible and promising therapy for the relief of signs and symptoms of liver disorder over the generations. Their high therapeutic quality and lack of toxicity are exceptional. The present experimental work was to evaluate the hepatoprotective properties of Siddha herbo-mineral formulation Chara Parpam by CCl4 induced hepatotoxicity in albino rats. Two doses of Chara Parpam (5 mg/kg and 10 mg/kg) were administered to rats. Protection of hepatocytes was evaluated by estimate the level of ALT, AST, ALP, serum bilirubin, total protein, serum albumin, sodium and potassium during the exposure of CCL4 on wistar albino rats and to evaluate the effect of different doses of Chara Parpam against hepatotoxicity induced by CCL4. Liver histology was performed 24 hours after the administration of trial drug Chara Parpam. The result indicated that the concentration of ALT, AST, and ALP, released by hepatocytes were significantly reduced in the presence of Chara Parpam. The cytoprotective effects of the Chara Parpam are dose-dependent. Through this work, we demonstrate for the first time the direct protection of liver cells by administration of Chara Parpam confirming its hepatoprotective properties.
Submerged fermentation of laccase producing Streptomyces chartreusis using bo...IOSR Journals
Response surface methodology was engaged for the optimization of diverse nutritional and physical parameters for laccase production by Streptomyces chartreusis strain NBRC 12753 in the submerged fermentation process. Screening of production parameters was executed using Plackett–Burman design and the variables with statistically momentous effects on laccase production were recognized. Variables such as Cupric sulphate, Pyrogallol and Yeast extract were selected for further optimization studies using Box-Behnken design. The multiple regression coefficients (R2) had a value of 0.9606, indicating that the model could explain up to 96.06 % of the variability of the response. This methodology facilitated analysis of the experimental data to establish the optimum conditions for the process and understand the contribution of individual factors to evaluate the response under optimal conditions. Thus application of Box-Behnken approach appears to have potential usage in process application.
Square Microstrip Antenna with Dual Probe for Dual Polarization in ISM BandIOSR Journals
Abstract: This paper presents the design of antenna operating in ISM band at 2.4 GHz. The designed square patch antenna is dual polarized with two rectangle shaped slot inserted on the patch. The FR4 dielectric material is used for the antenna consist of Dual probe feed with ground plane. HFSS software is used for the simulation which shows the result for isolation as 28 dB, antenna gain of 5.96 dB and bandwidth 222MHz. Keywords: Dual feed, Dual polarization, ISM Band, Probe Feed, Square MSA
IOSR Journal of Applied Physics (IOSR-JAP) is an open access international journal that provides rapid publication (within a month) of articles in all areas of physics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in applied physics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Efficient Utilization of Bandwidth in Location Aided RoutingIOSR Journals
Abstract : Earlier work on routing MANETs developed several routing protocols, which finds available route from source to destination without taking into the consideration of Band width availability for data transfer, and they frequently fails to discover stable routes between source and destination. As a result of that there is a large numbers of discarding of data packets as well as overloading of packets as the consequences of that large wastage of band width. EUBLAR (Efficient Utilization of band width in Location Aided Routing) protocol is introduced in this proposed work, which is capable of calculating the available band width of all the intermediate nodes between source and destination. In this proposed protocol find the minimum available band width of all the intermediate nodes between source and destination and then according to that band width sends the data packets over that path. The EUBLAR can effectively utilized the wastage of band width and every single band width can be used for data transfer can be used over entirely configured network. In this way we can increase the quality of service of the Ad- hoc network in terms of bandwidth. Keywords: Ad Hoc Networks, Global Positioning System, Maximum & Minimum slopes, Minimum available Bandwidth, Time to Live
Effect of astaxanthin on ethylene glycol induced nephrolithiasisIOSR Journals
Nephrolithiasis is one of the most common and painful of urological disorders with a high prevalence rate. The role of calcium oxalate crystals, which are the predominant component of kidney stones in generating oxidative stress, have been clearly demonstrated in previous studies. Astaxanthin, found in marine organisms is a dietary xanthophyll carotenoid with enhanced antioxidative properties and pharmacological effects. In the present study, we have investigated the effect of this natural antioxidant, at a daily dose of 25mg/kg in experimental calcium oxalate nephrolithiasis in male Wistar rats. Liver function markers, hepatic antioxidants, albumin creatinine ratios, renal calcium content and changes in body and kidney weight have been studied to evaluate the effect of this carotenoid in vivo. The effect of citrate, a component of most pharmaceutical drugs for management of nephrolithiasis has also been evaluated for the purpose of comparison with astaxanthin treatment. Astaxanthin is seen to exert a protective effect on the liver and kidney tissues in ethylene glycol treated rats by improving the liver function, restoring the activity of the hepatic antioxidant enzymes, decreasing the albumin creatinine ratios and calcium levels and maintaining the organ to body weight ratio. Our results also indicate that astaxanthin administration is more beneficial than citrate treatment
Implementation of error correcting methods to the asynchronous Delay Insensit...IOSR Journals
Abstract: This Paper provides an approach for reducing delay and area in asynchronous communication. A new class of error correcting Delay Insensitive (ie., unordered) codes is introduced for global asynchronous communication.It simultaneously provides timing-robustness and fault tolerance for the codes.A systematic and weighted code is targeted. The proposed error correcting unordered (ECU) code, called zero-sum can provide 1-bit correction.The extensions to the zero-sum code are given.The zero_sum⁺ code provides 3-bit error detection,or it can provide 2-bit detection and 1-bit correction.The zero_sum* code support 2-bit correction,while still guaranteeing 2-bit detection under different strategies of weight assignments. Zero_sum* code provides 2-bit correction coverage (50 % to 70%) of all 2-bit errors. The proposed method reduces delay occurred, due to the transfer of corrupted bits in a packet on the channel by the removal of timer and also reduces the area with the proposed Completion Detector (CD). Keywords : Asynchronous communication, Four phase protocol , error-correcting codes, delay insensitive and unordered.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
automatic classification in information retrievalBasma Gamal
automatic classification in information retrieval-automatic classification of documents
Chapter 3 from IR_VAN_Book
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN B.Sc., Ph.D., M.B.C.S.
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
Currently, there are two techniques used for large-scale gene-expression profiling; microarray and
RNA-Sequence (RNA-Seq).This paper is intended to study and compare different clustering algorithms that used
in microarray data analysis. Microarray is a DNA molecules array which allows multiple hybridization
experiments to be carried out simultaneously and trace expression levels of thousands of genes. It is a highthroughput
technology for gene expression analysis and becomes an effective tool for biomedical research.
Microarray analysis aims to interpret the data produced from experiments on DNA, RNA, and protein
microarrays, which enable researchers to investigate the expression state of a large number of genes. Data
clustering represents the first and main process in microarray data analysis. The k-means, fuzzy c-mean, selforganizing
map, and hierarchical clustering algorithms are under investigation in this paper. These algorithms
are compared based on their clustering model.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
Clustering Algorithm Based On Correlation Preserving IndexingIOSR Journals
Abstract: Fast retrieval of the relevant information from the databases has always been a significant issue.
Different techniques have been developed for this purpose; one of them is Data Clustering. In this paper Data
Clustering is discussed along with the applications of Data Clustering and Correlation Preserving Indexing. We
proposed a CPI (Correlation Preserving Indexing) algorithm and relate it to structural differences between the
data sets.
Keywords: Data Clustering, Data Mining, Clustering techniques, Correlation Preserving Indexing.
Step by step operations by which we make a group of objects in which attributes
of all the objects are nearly similar, known as clustering. So, a cluster is a collection of
objects that acquire nearly same attribute values. The property of an object in a cluster is
similar to other objects in same cluster but different with objects of other clusters.
Clustering is used in wide range of applications like pattern recognition, image processing,
data analysis, machine learning etc. Nowadays, more attention has been put on categorical
data rather than numerical data. Where, the range of numerical attributes organizes in a
class like small, medium, high, and so on. There is wide range of algorithm that used to
make clusters of given categorical data. Our approach is to enhance the working on well-
known clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new
approach named “High Accuracy Clustering Algorithm for Categorical datasets”.
Similar to Literature Survey On Clustering Techniques (20)
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath New York Community Day in-person eventDianaGray10
UiPath Community Day is a unique gathering designed to foster collaboration, learning, and networking with automation enthusiasts. Whether you're an automation developer, business analyst, IT professional, solution architect, CoE lead, practitioner or a student/educator excited about the prospects of artificial intelligence and automation technologies in the United States, then the UiPath Community Day is definitely the place you want to be.
Join UiPath leaders, experts from the industry, and the amazing community members and let's connect over expert sessions, demos and use cases around AI in automation as we highlight our technology with a special speaker on Document Understanding.
📌Agenda
3:00 PM Registrations
3:30 PM Welcome note and Introductions | Corina Gheonea (Senior Director of Global UiPath Community)
4:00 PM Introduction to Document Understanding
How to build and deploy Document Understanding process
Where would Document Understanding be used.
Demo
Q&A
4:45 PM Customer/Partner showcase
Accelirate
Intro to Accelirate and history with UiPath
Why are we excited about the new AI features of UiPath?
Customer highlight
a. Document Understanding – BJs Case Study
b. Document Understanding + generative AI
5.30 PM Networking
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
КАТЕРИНА АБЗЯТОВА «Ефективне планування тестування ключові аспекти та практ...QADay
Lviv Direction QADay 2024 (Professional Development)
КАТЕРИНА АБЗЯТОВА
«Ефективне планування тестування ключові аспекти та практичні поради»
https://linktr.ee/qadayua
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. IOSR Journal of Computer Engineering (IOSRJCE)
ISSN: 2278-0661 Volume 3, Issue 1 (July-Aug. 2012), PP 01-12
www.iosrjournals.org
www.iosrjournals.org 1 | Page
Literature Survey On Clustering Techniques
B.G.Obula Reddy1, Dr. Maligela Ussenaiah2
1Associate Professor, Lakireddy Balireddy College of Engineering L.B.Reddy Nagar,Mylavaram, Krishna
(Dist):521 230
2Assistant Professor, Computer Science, Vikrama Simhapuri University Nellore Nellore, Andhra Pradesh, India
Abstract: Clustering is the assignment of data objects (records) into groups (called clusters) so that data
objects from the same cluster are more similar to each other than objects from different clusters. Clustering
techniques have been discussed extensively in similarity search, Segmentation, Statistics ,Machine Learning
,Trend Analysis, Pattern Recognition and classification. Clustering methods can be classified in to i)partition
methods2)Hierarchical methods,3)Density Based methods 4)Grid based methods5)Model Based methods. in this
paper, i would like to give review about clustering methods by taking some example for each classification. I
am also providing comparative statement by taking constraints i.e Data type, Cluster Shape, Complexity, Data
Set, Measure, Advantages and Disadvantages.
Keywords: clustering; Partition, Hierarchical, Density, grid, Model
I. Introduction
Cluster analysis is unsupervised learning method that constitutes a cornerstone of an intelligent data analysis
process. It is useful for the exploration of inter-relationships among a collection of patterns, by organizing into
homogeneous clusters. It is called as unsupervised learning because no a priori labeling of some patterns is
available to use in categorizing others and inferring the cluster structure of the whole data. Intra-connectivity is
a measure of the density. A high intra-connectivity means a good clustering arrangement because the instances
grouped within the same cluster are highly dependent on each other. an Inter-connectivity is a measure of the
connectivity between distinct clusters. A low degree of interconnectivity is advantageous because it indicates
that individual clusters are largely independent of each other.
Every instance in the data set can be represented using the same set of attributes. The attributes are categorical.
To stimulate a hypothesis from a given data set, a learning system requires to make assumptions about the
hypothesis to be learned. These assumptions are called as biases. Since every learning algorithm uses some
biases, it reacts well in some domains where its biases are appropriate while it performs poorly in other domains.
The problem with clustering methods is that the interpretation of the clusters may be difficult.The algorithms
will always assign the data to clusters even if there were no clusters in the data.
Cluster analysis is a difficult problem because many factors 1. effective similarity measures, 2.criterion
functions, 3. algorithms are come into play in devising a well tuned clustering technique for a given clustering
problems. Moreover, it is well known that no clustering method can adequately handle all sorts of cluster
structures i.e shape, size and density. Sometimes the quality of the clusters that are found can be improved by
preprocessing the given data. It is not uncommon to try to find noisy values and eliminate them by a
preprocessing step. A common technique is to use postprocessing steps to try to fix up the clusters that have
been found.
Clustering is not a recent invention, nor is its relevance to computational toxicology a recent application. Its
theory, however, is often lost within black box treatments used by QSAR programs. Clustering, in the general
sense, is the grouping of objects together based on their similarity, while excluding objects which are
dissimilar .one of the first application of cluster analysis to drug discovery was by Harrison, who asserted that
locales exist within the chemical space which favor biological activity. Consequently, these localities form
clusters of structurally similar compounds. This idea that structure confers activity is also the fundamental
premise of all QSAR analyses. The basic framework for compound clustering consists of three main steps: the
computation of structural features, the selection of a difference metric, and the application of the clustering
algorithm.
Generally clustering algorithms can be categorized into hierarchical clustering methods, partitioning clustering
methods, density-based clustering methods, grid-based clustering methods, and model-based clustering methods.
An excellent survey of clustering techniques can be found in (Jain et al., 1999). Section 1 deals with hierarchical
methods by taking examples as BIRCH, CURE and CHAMELEON, section 2 deals with partitioning methods
by taking examples as K-means Clustering, k-Medoids, Section3 deals with Density-based Clustering by taking
examples as DBSCAN,OPTICS,DENCLUE, Section 4 deals with grid-based methods by taking examples as
CLIQUE,STING,MAFIA,WAVE CLUSTER,O-CLUSTER,ASGC,AFR, Section 5 deals with model-based
methods by taking examples as RBMN,SOM and Ensembles of Clustering Algorithms .
2. Literature Survey On Clustering Techniques
www.iosrjournals.org 2 | Page
II. Hierarchical methods:
A hierarchical clustering algorithm divides the given data set into smaller subsets in hierarchical
manner. The hierarchical methods group the data instances into a tree of clusters. There are two major methods
are available under this category i.e agglomerative method, which forms the clusters in a bottom-up fashion
until all data instances belong to the same cluster, divisive method, in which splits up the data set into smaller
cluster in a top-down fashion until each of cluster contains only one instance. Both the divisive algorithms and
agglomerative algorithms can be represented by dendrograms.
Hierarchical clustering techniques use various constraints to decide locally at each step which clusters
should be joined . For agglomerative hierarchical techniques, the principle is typically to merge the “closest”
pair of clusters, where “close” is defined by a specified measure of cluster closeness. There are three definitions
of the closeness between two clusters i.e single link, complete link and average link. The single link similarity
of two clusters is the similarity between the two most similar instances. Single link is good for handling non
elliptical shapes. The complete link is less susceptible to noise and outliers and has trouble with convex shapes.
The following are some of the hierarchical clustering algorithms are: Balanced Iterative Reducing and
Clustering using Hierarchies – BIRCH, Clustering Using REpresentatives – CURE and CHAMELEON.
2.1 CURE:-
CURE is a hierarchical clustering algorithm, that employs the features of both the centroid based
algorithms and the all point algorithms .CURE[7] obtains a data sample from the given database. The
algorithm divides the data sample into groups and identifies some representative points from each group of the
data sample. In the first phase, the algorithm considers a set of widely spaced points from the given datasets. In
the next phase of the algorithm the selected dispersed points are moved towards the centre of the cluster by a
specified value of a factor α. As a result of this process, some randomly shaped clusters are obtained from the
datasets. In the process it identifies and eliminates outliers. In the next phase of the algorithm, the
representative points of the clusters are checked for proximity with a threshold value and the clusters that are
next to each other are grouped together to form the next set of clusters. In this hierarchical algorithm, the value
of the factor α may vary between 0 and 1. The utilization of the shrinking factor alpha by the CURE
overcomes the limitations of the centroid based, all-points approaches. As the representative points are moved
through the clustering space, the ill effects of outliers are reduced by a greater extent. Thus the feasibility of
CURE is enhanced by the shrinking factor α. The worst case time complexity of CURE is determined to be
o(n2logn).
Figure overview of CURE implementation
A random sample of data objects is drawn from the given datasets. Partial clusters are obtained by
partitioning the sample dataset and outliers are identified and removed in this stage. Final refined clusters are
formed from the partial cluster set.
2.2 BIRCH:-
The clustering algorithm BIRCH is a main memory based algorithm, i.e., the clustering process is
carried out with a memory constraint. BIRCH‟s incremental clustering is based on the concept of clustering
feature and CF tree. A clustering feature is a triple that contains the summary of the information of each cluster.
Given N d-dimensional points or objects in a cluster:{xt} where i=1,2, …, N, the Clustering feature (CF) as a
vector of the cluster can be stated as,
CF=(N.LS,SS)
where N is the number of points in the cluster, LS is the linear sum on N points, i.e., , and
SS is the square sum of the data points. i.e..
3. Literature Survey On Clustering Techniques
www.iosrjournals.org 3 | Page
A clustering feature tree (CF tree) contains the CFs that holds the summary of clusters. A CF tree is a
height balanced tree that has two parameters namely, a branching factor, B, and threshold, T. The
representation of a non-leaf node can be stated as where,
i = 1,2, …., B,
A pointer to its ith child node
of the subcluster represented by the I th child
The non-leaf node is provides a representation for a cluster and the contents of the node represents all
of the subclusters. In the same manner a leaf-node‟s contents represents all of its subclusters and has to confirm
to a threshold value for T. The BIRCH clustering algorithm is implemented in four phases. In phase1, the initial
CF is built from the database based on the branching factor B and the threshold value T. Phase2 is an optional
phase in which the initial CF tree would be educed in size to obtain a smaller CF tree. Global clustering of the
data points is performed in phase3 from either the initial CF tree or the smaller tree of phase2. As has been
shown in the evaluation good clusters can be obtained from phase3 of the algorithm. If it is required to improve
the quality of the clusters, phase4 of the algorithm would be needed in the clustering process. The execution of
Phase1 of BIRCH begins with a threshold value T. The procedure reads the entire set of data points in this
phase, selects the data points based on a distance function. The selected data points are stored in the nodes of
the CF tree. The data points that are closely spaced are considered to be clusters and are those selected. The data
points that are widely placed are considered to be outliers and thus are discarded from clustering.In this
clustering process, if the threshold limit is exceeded before the complete scan of the database, the value is
increased and a much smaller tree with all the chosen data points is built. An optimum value for threshold T is
necessary in order to get good quality clusters from the algorithm. If it is required to fine tune the quality of the
clusters, further scans of the database is recommended through phase4 of the algorithm. Time complexity(Worst
case) of this algorithm is O(n). The time needed for the execution of the algorithm varies linearly to the dataset
size.
2.3 CHAMELEON:-
In agglomerative hierarchical approaches the major disadvantage is that they are based on a static, user
specified inter connectivity model,either under estimates or over estimates the inter connectivity of objects and
clusters. This type limitation is overcome by the algorithm CHAMELEON. CHAMELEON makes use of a
sparse graph, where the nodes represent data objects; weights in the edges represent similarities between the
data objects. CHAMELEON‟s sparse graph implementation lets it to scale to large databases in an effective
manner. This implementation of sparse graph is based on the frequently used k-nearest neighbor graph
representation. CHAMELEON identifies the similarity between a pair of clusters named as Ci and Cj by
evaluating their relative interconnectivity RI (Ci , Cj ) and relative closeness RC (Ci, Cj ). When the values of
both RI (Ci , Cj ) and RC (Ci , Cj) are high for any two of clusters, CHAMELEON‟s agglomerative algorithm
merges those two clusters.
where,
edge-cut of cluster containing both Ci and Cj
- min-cut bisector indicating internal interconnectivity of cluster C
min-cut bisector indicating internal interconnectivity of cluster Cj
The relative closeness of two clusters Ci and Cjis stated as follows:
where,
- average edge weight of min-cut bisector of cluster Ci
- average edge weight of min-cut bisector of cluster C
- average edge weight edges connecting vertices of cluster Ci with that of cluster Cj
CHAMELEON agglomerative hierarchical approach implements the algorithm in two separate phases. In the
first phase, dynamic modeling of the data objects is done by clustering these objects into subclusters. In the
second phase, a dynamic modeling framework is employed on the data objects to merge the subclusters in a
4. Literature Survey On Clustering Techniques
www.iosrjournals.org 4 | Page
hierarchical manner to get good quality cluster. The dynamic framework model can be implemented by two
different methods. In the first method it is checked that the values of relative inter-connectivity and relative
closeness between a pair of cluster cross a user-specified threshold value. For this purpose, these two
parameters should satisfy the following conditions:
In the second method, CHAMELEON chooses a pair of clusters that maximizes a function that is given by,
where α is user-specified parameter that takes the values between 0 and 1.
III. Partitioning Methods:
Partitioning methods are divided into two subcategories, one is centroid and other is medoids
algorithms.Centroid algorithms represent eachcluster by using the gravity centre of the instances. The medoid
algorithms represents each cluster by means of the instances closest to gravity centre. The well-known centroid
algorithm is the k-means. The k-means method partitions the data set into k subsets such that all points in a
given subset are closest to the same centre.
In detail, it randomly selects k of the instances to represent the clusters. Based up on the selected
attributes, the remaining instances are assigned to their closer centers. K-means then computes the new centers
by taking the mean of all data points belonging to the same cluster. The process is iterated until there is no
change in the gravity centers. If k cannot beknown ahead of time, various values of k can bee valuated until the
most suitable is found. The effectiveness of this method as well as of others relies heavily on the objective
function used in measuring the distance between instances.
In detail, it randomly selects k of the instances to represent the clusters. Based on selected attributes,
all remaining instances are assigned to their closer centre. K-means then computes new centers by taking the
mean of all data points belonging to the same cluster. The process is iterated until there is no change in gravity
centers. If k cannot be known ahead of time, various values of k can bee valuated until the most suitable one is
found. The effectiveness of this method and others relies heavily on the objective function used in measuring
the distance between instances.
The difficulty in finding a distance measure,it works well with all types of data. There are several
procedures to define the distance between instances .Generally, the k-means algorithm has the following
important characteristics : 1. It is efficient in processing large data sets, 2. It terminates at a local optimum, 3.
The clusters having spherical in shapes,4. It is sensitive to noise. However, there are variants of the k-means
clustering procedure, which gets around this problem. Choosing the proper initial centroids is the key step of the
basic K-means procedure.
The k-modes algorithm is a recent partitioning algorithm and uses the simple matching coefficient
measure to deal with categorical attributes. The k-prototypes algorithm , through the definition ofa combined
dissimilarity measure, prototypes algorithm further integrates the k-means and k-modes algorithms to allow for
clustering instances described by mixed attributes. More recently, in another generalization of conventional k-
means clustering algorithm presented. This new one applicable to ellipse-shaped data clusters as well as ball-
shaped ones without dead-unit problem, and also performs correct clustering without pre-determining the exact
cluster number.
Traditional clustering approaches generate partitions; each pattern belongs to one cluster. The clusters
in a hard clustering are disjoint. Fuzzy clustering extends this presentation to associate each pattern with every
cluster using a membership function. Larger membership values specify higher confidence in the assignment of
the pattern to the cluster. widely used one algorithm is the Fuzzy C-Means (FCM) algorithm, which is based on
k-means. FCM attempts to find the most characteristic point in each cluster, this can be considered as the center
of the cluster, then the grade of membership for each instance in the clusters.
Other soft clustering algorithms have been developed and most of them are based on the Expectation-
Maximization (EM) algorithm. They assume an underlying probability model with parameters that describe the
probability that an instance belongs to a certain cluster. The procedure in this algorithm is to start with initial
guesses for the mixture model parameters. These values are used to calculate the cluster probabilities for each
instance. These probabilities are in turn used tore estimate the parameters and the process is repeated.
A drawback of such algorithms is that they tend to be computationally more expensive. Another
problem found in the previous approach is called over fitting. This problem might be caused by two reasons.
one hand, a large number of clusters maybe specified. And on the other, the distributions of probabilities have
too many parameters. In this process, one possible solution is to adopt a fully Bayesian approach, in which
every parameter has aprior probability distribution.
5. Literature Survey On Clustering Techniques
www.iosrjournals.org 5 | Page
How do partitioning algorithms work?
Construct a partition of a data set containing n objects into a set of k clusters, so to minimize a criterion
(e.g., sum of squared distance)
The goal is, given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means, k-medoids, K-means Clustering etc..,
1. Pick a number (K) of cluster centers (at random)
2. Assign every item to its nearest cluster center (e.g. using Euclidean distance)
3. Move each cluster center to the mean of its assigned items
4. Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
2.1 K-means Clustering:
Partitioned clustering approach
a) Each cluster is associated with a centroid(center point)
b) Each point is assigned to the cluster with the closest centroid
c) Number of clusters, K, must be specified
The basic algorithm is very simple
1: Select K points as the initial centroids.
2: repeat
3: from K clusters by assigning all points to the closest centroid.
4: Recompute the centroid o F each cluster.
5: Until the centroid don‟t change
K-means example (step 1):
K-means example (step 2):
K-means example (step 3):
K-means example (step 4):
6. Literature Survey On Clustering Techniques
www.iosrjournals.org 6 | Page
K-means example (step 5):
K-means example (step 6):
2.2 k-Medoids:
The k-Medoids: in k-medoids algorithm, Rather than calculate the mean of the items in each cluster, a
representative item, or medoid, is chosen for each cluster at each iteration.
Medoids for each cluster are calculated by finding object I within the cluster that minimizes
where Ci is the cluster containing object i and d(i; j) is the distance between objects i and j.
The k-medoids algorithm can be summarized as follows:
1. Choose k objects at random to be the initial cluster medoids.
2. Assign each object to the cluster associated with the closest medoid.
3. Recalculate the positions of the k medoids.
4. Repeat Steps 2 and 3 until the medoids become fixed.
Step 3 could be performed by calculating for each object i from scratch at each iteration. However,
many objects remain in the same cluster from one iteration of the algorithm to the next. Improvements in speed
can be obtained by adjusting the sums whenever an object leaves or enters a cluster. Step 2 can also be made
more efficient in terms of speed, for larger values of k. For each object, an array of the other objects, sorted on
distance, is maintained. The closest medoid can be found by scanning through this array until a medoid is
found, rather than comparing the distance of every medoid.
IV. Density-based Clustering:
Density-based clustering algorithms try to find clusters based on density of data points in a region. The
main idea of density-based clustering is that for each instance of a cluster the neighborhood of a given radius
(Eps) has to contain at least a minimum number of instances (MinPts). One of the most well known density
based clustering algorithms is the DBSCAN . DBSCAN separate data points into three classes :
• Core points. These are points that are at the interior of a cluster.
• Border points. A border point is a point that is not a core point,
• Noise points. A noise point is any point that is not a core point or a border point.
3.1 DBSCAN: Density-Based Spatial Clustering Of Applications With Noise
7. Literature Survey On Clustering Techniques
www.iosrjournals.org 7 | Page
DBSCAN is a density-based clustering algorithm and an R*- Tree is implemented for the process. The
basic concept of this algorithm is, in a given cluster within the specified radius of the neighborhood of every
point in a cluster, there must exist a minimum number of points. The density attributed to the points in the
neighborhood of a point in a cluster has to cross beyond a threshold value. Based on a distance function the
shape of the neighborhood is obtained and is expressed as dist(p, q) between points p and q
Figure. p and q are density-connected, connected by o
Density-connectivity is depicted in Figure.
The DBSCAN algorithm identifies the clusters of data objects based on the density-reachability and
density-connectivity of the core and border points present in a cluster. The primary operation of the algorithm
can be stated as: Given the parameters Eps and MinPts, a cluster can be identified by a two phase method. It
can be specified as, 1) select an arbitrary data point from the database that satisfies the core point condition as
a seed 2) fetch all the data points that are density-reachable from the seed point forming a cluster including the
seed.
The algorithm requires the user to know the parameters Eps and MinPts for each cluster at least one
point from the corresponding cluster. Since this is not feasible for every cluster, the algorithm uses global
values for these two parameters. DBSCAN begins the clustering with an arbitrary data point p and retrieves the
data points that are density-reachable from p with respect to Eps and MinPts.
This approach leads to the following inferences,
1) if p is a core point this method results in a cluster that is relevant to Eps and MinPts,
2) if p is a border point, no points are density-reachable from p and the algorithm scans the next data point in
the database
.
3.2 OPTICS:
OPTICS is a clustering algorithm that identifies the implicit clustering in a given dataset and is a
density-based clustering approach. Unlike the other density-based clustering algorithm DBSCAN which
depends on a global parameter setting for cluster identification, OPTICS uses a multiple number of parameter
settings. In that context the OPTICS is an extended work of DBSCAN algorithm. DBSCAN algorithm requires
two parameters namely, ε the radius of the neighborhood from a given representative data object and MinPts the
threshold value for the occurrence of the number of data objects in the given neighborhood. OPTICS is
implemented on the concept of Density-based Cluster Ordering which is an extension of DBSCAN algorithm.
Density-based Cluster Ordering works on the principle that sparsely populated cluster for a higher ε value
contains highly populated clusters for a lower value of ε. Multiple number of distance parameter ε have been
utilized to process the data objects. OPTICS ensures good quality clustering by maintaining the order in which
the data objects are processed, i.e., high density clusters are given priority over lower density clusters. The
cluster information in memory consists of two values for every processed object. one is the core-distance and
other is reachability distance.
Figure. illustrates the concept of core distance and reachability distances. The reachability distances
are r(p1) and r(p2) for data objects p1 and p2 respectively. The reachability distances are evaluated with
respect to the Eps neighbouhood. The coredistance of a data object p is the shortest distance Eps‟ between p
and a data object in its Eps-neighbourhood and so p is a core object with respect to Eps‟ if this neighbour is
contained in NEps(p). Also the reachability-distance of a data object p with respect to another data object o is
the shortest distance such that p is directly density-reachable from o if o is a core object. Thus OPTICS
produces an ordering of the given database. Along with ordering OPTICS also stores core-distance and
reachability distance of each data object, thereby resulting in better quality clusters. The OPTICS clustering
algorithm provides an efficient cluster ordering with a set of ordering of the data objects with reachability-
values and core-values. OPTICS implements pixel oriented visualization techniques for large multidimensional
8. Literature Survey On Clustering Techniques
www.iosrjournals.org 8 | Page
data sets. OPTICS utilizes automatic techniques to identify start and end of cluster structures to begin with and
later groups them together to determine a set of nested clusters.
3.3 DENCLUE (DENsity-based CLUstEring) :
It is a clustering method based on a set of density distribution functions. The method is implemented
based on the following ideas: (1) the influence of each data point can be formally modeled using a mathematical
function, can be called an influence function i.e., which describes the impact of a data point within its
neighborhood; (2) the overall density of the data space can be modeled analytically as the sum of the influence
function applied to all data points; and (3) clusters can then be determined mathematically by identifying
density attractors.
V. Grid-Based Methods:
Grid-based clustering algorithms first quantize the clustering space into a finite number of cells and
then perform the required operations on the quantized space. Cells that are contain more than certain number of
points are treated as dense and the dense cells are connected to form the clusters. the following are some of the
grid-based clustering algorithms are: STatistical INformation Grid-based method – STING , WaveCluster, and
CLustering In QUEst – CLIQUE. STING first divides the spatial area into several levels of rectangular cells in
order to form a hierarchical structure.
4.1 CLIQUE (CLUSTERING IN QUEST)
Moreover, empirical evaluation shows that CLIQUE scales linearly with the number of instances.it has
good scalability as the number of attributes is increased. The other clustering methods, WaveCluster does not
require users to give the number of clusters. WaveCluster uses a wavelet transformation to transform the
original feature space. In wavelet transform, Complication with an appropriate function results in a transformed
space where the natural clusters in the data become obvious. It is a very powerful process, however, it is not
efficient in high dimensional space.
It makes use of concepts of density and grid based methods. In first step, CLIQUE partitions the „ n ‟
dimensional data space S into non overlapping rectangular units (grids). The units are obtained by partitioning
every dimension into ξ intervals of equal length. if ξ is an input parameter, selectivity of a unit is defined as the
total data points contained in it. A unit 'u' is dense if selectivity (u) is greater than γ, where Y is the density
threshold is an another input parameter. A unit is the subspace is the intersection of an interval from each of the
K attributes. The cluster is a maximal set of connected opaque units. u1, u2 are the two K-dimensional units are
connected if they have a common face. The opaque units are then connected to form clusters. It uses apriori
algorithm to find dense units. The opaque units are identified by using a fact that if a K dimension unit (a1,
b1)* (a2, b2) …….(ak, bk) is dense, then any k-1 dimension unit (a1,b1) * (a2,b2)….(aik-1,bik-1) is also dense
where (ai, bi) is the interval of the unit in the i th dimension.
Given a set of data points and the input parameters ξ and γ CLIQUE is able to find clusters in all
subspaces of the original data space and present a minimal description of each cluster in the form of a DNF
expression. Steps involved in CLIQUE is i) identification of sub spaces (dense Units) that contain cluster ii)
merging of dense units to form cluster & iii) Generation of minimal description for the clusters.
4.2 STING:(A STatistical INformation Grid Approach to spatial Data Mining):
Spatial data mining is the extraction of implied knowledge, spatial relation and discovery of interesting
characteristics and patterns that are not explicitly represented in the databases. STING[9] is a grid based multi
resolution clustering technique in which the spatial area is divided into rectangular cells (using latitude and
longitude) and employs a hierarchical structure. Several levels of such rectangular cells represents different
levels of resolution. Each cell is partitioned in to child cells at lower level. A cell in level 'i' corresponds to
union of its children at level i + 1. Each cell (except the leaves) has 4 children & each child corresponds to one
quadrant of the parent cell. Statistical information regarding the attributes in each grid cell (such as, mean,
Standard Deviation maximum & minimum values) is pre computed and stored. Statistical parameters of higher
level cells can easily be computed from the parameters of lower level cells. For each cell, there are attribute
independent parameters and attribute dependant parameters. i. Attribute independent parameter: count ii.
Attribute dependant parameters
M: Mean of all values in the cell;
S: Standard deviation of all values in this cell
Min : minimum value of the attribute in this cell; Max : minimum value of the attribute in this cell;
Distribution : Type of distribution the attribute value follows. The distribution types are normal, uniform
exponential & none.Value of distribution may either be assigned by the user or obtained by hypothesis tests
such as Χ2 test. When data are loaded into database, parameters count,m,s, min,max of the bottom level cells
are calculated directly . First, a layer is determined from which the query processing process is to start. This
9. Literature Survey On Clustering Techniques
www.iosrjournals.org 9 | Page
layer may consist of small number of cells. Each cell in this layer we check the relevancy of cell by computing
confidence internal. Irrelevant cells are removed and this process is repeated until the bottom layer is reached.
4.3 MAFIA: (Merging of Adaptive Intervals Approach to Spatial Data Mining)
MAFIA proposes adaptive grids for fast subspace clustering and introduces a scalable parallel
framework on shared nothing architecture to handle massive data sets [4]. Most of the grid based algorithms
uses uniform grids whereas MAFIA uses adaptive grids. MAFIA[10]proposes a technique for adaptive
computation of the finite intervals (bins) in each dimension, which are merged to explore clusters in higher
dimensions. Adaptive grid size reduces the computation and improves the clustering quality by concentrating
on the portions of the data space which have more points and thus likelihood of having clusters. Performance
results show MAFIA is 40 to 50 times faster than CLIQUE, due to the use of adaptive grids. MAFIA introduces
parallelism to obtain a highly scalable clustering algorithm for large data sets. MAFIA proposes an adaptive
interval size to partition the dimension depending on the distribution of data in the dimension. Using a
histogram constructed by one pass of the data initially, MAFIA determines the minimum number of bins for a
dimension. Contiguous bins with similar histogram values are combined to form larger bins. The bins and cells
that have low density of data will be pruned limiting the eligible candidate dense units, thereby reducing the
computation. Since the boundaries of the bins will also not be rigid, it delineates cluster boundaries more
accurately in each dimension. It improves the quality of the clustering results.
4.4 Wave Cluster :
Wave Cluster is a multi resolution clustering algorithm, it used to find clusters for very large spatial
databases.
Given a set of spatial objects Oi, 1≤ i ≤ N, the goal of the algorithm is to detect clusters. It first
summarizes the data by imposing a multi dimensional grid structure on to the data space. The main idea is to
transform the original feature by applying wavelet transform and then find the dense regions in the new space.
A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub
bands. The first step of the wavelet cluster algorithm is to quantize the feature space. In the second step,
discrete wavelet transform is applied on the quantized feature space and hence new units are generated. Wave
cluster connects the components in two set of units and they are considered as cluster. Corresponding to each
resolution γ of wavelet transform there would be set of clusters cr. In the next step wave cluster labels the units
in the feature space that are included in the cluster.
4.5 O-Cluster: (Orthogonal partitioning CLUSTERing)
This clustering method combines a novel partitioning active sampling technique with an axis parallel
strategy to identify continuous areas of high density in input space. O cluster is a method that builds upon the
contracting projection concept introduced by optigrid. O cluster makes two major contributions 1) It uses
statistical test to validate the quality of a cutting plane. 2) It can operate on a small buffer containing a random
sample from the original data set. The partitions that do not have ambiguities are frozen and the data points
associated with them are removed from the active buffer. O cluster operates repeatedly. It evaluates possible
splitting points for all projections in a partition. it can selects the best one, and splits the data into new
partitions. The algorithm continuous by searching for good cutting planes inside the newly created partitions. it
can creates a hierarchical tree structure that can translates the input space into rectangular regions. The process
stages are (1) Load data buffer (2) compute histograms for active partitions (3) Find “best” splitting points for
active partitions (4) Flag ambiguous and “frozen” partitions (5) Split active partitions (6) Reload buffer. O
cluster can functions optimally for large data sets with many records & high dimensionality.
4.7 Adaptive Mesh Refinement (AMR) :
Adaptive Mesh Refinement is a type of multi resolution algorithm that achieves high resolution in
localized regions. Instead of using a single resolution mesh grid, AMR clustering algorithm adaptively creates
different ISSN: 0975-5462 3443Ilango et. al. / International Journal of Engineering Science and Technology
Vol. 2(8), 2010, 3441-3446 resolution grids based on the regional density. The algorithm considers each leaf as
the centre of an individual cluster and recursively assigns the membership for the data objects located in the
parent nodes until the root node is reached. AMR Clustering algorithm detect nested clusters at different levels
of resolutions. AMR is a technique that starts with a coarse uniform grid covering the entire computational
volume. It automatically refines certain regions by adding finer sub grids. Newly child grids are created from
the connected parent grid cells whose attributes, density for instance, exceed given threshold. Refinement is
performed on each grid separately and recursively until all regions are captured with the desired accuracy.
10. Literature Survey On Clustering Techniques
www.iosrjournals.org 10 | Page
Figure-1 shows an example of AMR tree in which each tree node uses a different resolution mesh. The
root grid with the coarsest granularity covers the entire domain. In which it contains two sub grids, grid 1 and
grid 2. Grid 2 at level 1 also contains two sub grids discovered using a finer mesh. The deeper the node is
discovered in the tree and the finer the mesh is used.
AMR clustering procedure connects the grid based and density based approaches through AMR
techniques and hence preserves the advantages of both algorithms.
VI. Model-Based Clustering Methods:
` AutoClass uses the Bayesian approach, starting from random initialization of parameters,
incrementally adjusts them in an attempt to find their maximum likelihood estimates. Moreover, it is assumed
that, in addition to the predictive attributes, there is hidden variable. This unobserved variable reflects the
cluster membership for every case in the data set. The data-clustering problem is also an example of supervised
learning from incomplete data due to the existence of such a hidden variable.
Their approach for learning has been called RBMNs. Another model based method is the SOM net.
The S.O.M net can be thought of as two layers neural network. Each neuron represented by n-dimensional
weight vector, m = (m1, … , mn), where n is equal to the dimension of input vectors. The neurons of the S.O.M
are themselves cluster centers; but to accommodate interpretation the map units can be combined to form bigger
clusters. The S.O.M is trained iteratively.
In each training step, one sample vector x from the input data set chosen randomly.The distance
between it and all the weight vectors of the SOM is calculated using a distance measure. After finding the Best-
Matching Unit , the weight vectors of the SOM are updated so that the Best-Matching Unit is moved closer to
the input vector in the input space. The topological neighbors of the B.M.U are also treated in a similar way.
The important property of the SOM is that it is very robust.
The outlier can be easily detected from the map, since its distance in input space from other units are
large. The S.O.M can deal with missing data values. Many applications require the clustering of large amounts
of high dimensional data sets. However, most automated clustering techniques do not work effectively and/or
efficiently on high dimensional data.
VII. Ensembles of Clustering Algorithms:
The theoretical foundation of combining multiple clustering algorithms is still in early stages.
combining multiple clustering algorithms is a more challenging problem than combining multiple
classifiers.The reason is that impede the study of clustering combination has been identified as various
clustering algorithms produce different results.The main reason is due to different clustering field, combining
the clustering results directly with integration rules such as sum, product, median . Cluster ensembles can be
formed in different ways.i.e.,(1) the use of a number of different clustering techniques . (2) The use of a single
technique many times with different initial conditions. (3) The usage of different partial subsets of features or
patterns. In a split-and-merge strategy is followed.
The first step is to decompose complex data into small compact clusters. The K-means algorithm
works this purpose. Data partitions present in these clusterings are mapped into a new similarity matrix between
patterns based on a voting mechanism. This matrix is independent of data sparseness.it is used to extract the
natural clusters using the single link algorithm. Recently, the idea of combining multiple different clustering
algorithms of a set of data patterns based on a Weighted Shared nearest neighbors Graph WSnnG is introduced
in.
comparative statement of various clustering techniques:
Clustering
Technique
Examples Data Type Cluster Shape Complexity Data Set Measure Advantages Disadvantages
CURE Numerical Arbitrary O(N2
) Large Similarity
Measure
Attempt to address the
scalability problem and
improve the quality of
clustering results
These methods do not scale
well with the number of data objects
11. Literature Survey On Clustering Techniques
www.iosrjournals.org 11 | Page
Hierarchic
al Methods
BIRICH Numerical Spherical O(N)(Time) Large Feature
Tree
1.Suitable large scale data sets
in main memory 2.Minimize
number of scans 3.I/O costs
are very low
Suffers from identifying only
convex or spherical clusters of
uniform size
CHAMELEO
N
Discrete Arbitrary O(N2
) Large Similarity
Measure
1.Hierarchical clustering come
at the cost of lower efficiency.
2.CHAMELEON strongly relies
on graph partitioning
implemented in the library
HMETIS
issue of scaling to large
data sets that cannot fit in
the main memory
ROCK Mixed Graph O(KN2
) Small size Similarity
Measure
1. ROCK employs link but
not distance when
merging points 2. It also
introduces some global
property and thus
provides better quality
The algorithm can breakdown
if the choice of parameters in
the static model is incorrect
with respect to the data set
being clustered, or if the model
is not adequate to capture the
characteristics of clusters
Partitionin
g Methods
K-means
Method
Numerical Spherical O(NKD)(Tim
e)
O(N+K)(Spac
e)
Large Mean 1. K-means is relatively
scalable and efficient in
processing large data sets
1. Different sized clusters 2.
Clusters of different Densities
3. Non-globular clusters
4. Wrong number of Clusters
5. Outliers and empty Clusters
K-Medoids
Method
Numerical Arbitrary O(TKN) Large Medoid 1.K-Medoids method is more
robust than k-Means in the
presence of noise and outliers
2. K-Medoids algorithm seems
to perform better for large data
sets
1.K-Medoids is more costly
that the k-Means method 2.
Like k-means, k-medoids
requires the user to specify k
3. It does not scale well for
large data sets
CLARA Numerical Arbitrary O(K(40+K2)
+-K
(N-K))+
Sample Medoid 1.Deals with larger data sets
than PAM
2.More efficient and scalable
than both PAM
1. The best k medoids may not
be selected during the sampling
process, in this case, CLARA
will never find the best
clustering 2. If the sampling is
biased we cannot have a good
clustering 3. Trade off-of
efficiency
CLARANS Numerical Arbitrary Quadratic in
total
performance
Sample Medoid 1. Experiments show that
CLARANS is more effective
than both PAM and CLARA
2. Handles outliers
1.The computational
complexity of LARANS is
O(n2), where n is the number
of objects 2. The clustering
quality depends on the
sampling method
Density
based
clustering
algorithm
DBSCAN Numerical Arbitrary O(NLogN)(Ti
me)
High
Dimensional
Density
Based
1. DBSCAN Algorithm
perform efficiently for low
dimensional data. 2. DBSCAN
is highly sensitive to user
parameters MinPts and Eps
1. The data set can‟t be
sampled as sampling would
effect the density measures
2.The algorithm is not
partitionable for multi
processing systems.
OPTICS Numerical Arbitrary O(NLOGN) Low
Dimensio
nal
Density
Based
1.it can discovers the
clustering groups with
irregular shape, uncertain
amount of noise 2. it can
discovers high density data
included in low density group
3.final clustering structure are
incentive to parameters.
1.Expect some kind of
density drop to detect
cluster borders 2.less
sensitive to outliers
DENCLUE Numerical Arbitrary O(N2
) Low
Dimensional
Density
Based
1 it has a solid mathematical
foundation and generalizes
various clustering methods 2.
it has good clustering
properties for data sets with
large amounts of noise
less sensitive to outliers
Grid-Based
Methods
CLIQUE Mixed Arbitrary O(Ck
+mk)k
– Highest
Dimensionali
ty m-
number of
input points
C-number of
clusters
High
Dimensional cosine
similarity,
the Jacquard
index
1.It automatically finds
subspaces of the highest
dimensionality such that high
density clusters exist in those
subspaces 2. It is quite
efficient 3. It is insensitive to
the order of records in input
and does not presume some
canonical data distribution
The accuracy of the clustering
result may be degraded at the
expense of simplicity of the
simplicity of this method
STING Numerical Rectangular O(K)
K – Number
Any size Statistical 1.The grid-based computation
is query-independent 2. the
grid structure facilitates
All Cluster boundaries are
either horizontal or vertical,
and no diagonal boundary is