This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of unsupervised learning and clustering algorithms. It begins with definitions of supervised vs. unsupervised learning. Clustering is introduced as an unsupervised learning technique that groups similar data points together without predefined labels. The k-means algorithm is described in detail, including its steps, strengths, and weaknesses. Hierarchical clustering is also covered at a high level. Various distance functions and approaches to representing clusters are discussed.
This document provides an overview of unsupervised learning and clustering algorithms. It introduces key concepts in unsupervised learning and how it differs from supervised learning. Clustering is defined as grouping similar data points together into clusters with the goal of maximizing inter-cluster distance and minimizing intra-cluster distance. The k-means algorithm and hierarchical clustering are described as two common clustering techniques. K-means partitions data into k clusters by iteratively assigning points to centroids while hierarchical clustering builds a nested sequence of clusters in a tree structure.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses key concepts in unsupervised learning and clustering such as k-means clustering and hierarchical clustering. K-means clustering partitions data into k clusters by minimizing distances between data points and cluster centroids. Hierarchical clustering builds nested clusters by successively merging or dividing clusters based on distance measures. Representing clusters and evaluating clustering results are also covered.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
Clustering is an unsupervised machine learning technique that groups similar objects together. There are several clustering techniques including partitioning methods like k-means, hierarchical methods like agglomerative and divisive clustering, density-based methods, graph-based methods, and model-based clustering. Clustering is used in applications such as pattern recognition, image processing, bioinformatics, and more to split data into meaningful subgroups.
Cluster analysis is an unsupervised learning technique used to group unlabeled data points so that objects in the same cluster are more similar to each other than objects in different clusters. There are various clustering algorithms that differ in how they define clusters and find them efficiently. Popular cluster models include connectivity models based on distance, centroid models that represent each cluster by a single mean, and density models that define clusters as connected dense regions of data. Clustering is used in applications like market segmentation, social network analysis, and image segmentation.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of unsupervised learning and clustering algorithms. It begins with definitions of supervised vs. unsupervised learning. Clustering is introduced as an unsupervised learning technique that groups similar data points together without predefined labels. The k-means algorithm is described in detail, including its steps, strengths, and weaknesses. Hierarchical clustering is also covered at a high level. Various distance functions and approaches to representing clusters are discussed.
This document provides an overview of unsupervised learning and clustering algorithms. It introduces key concepts in unsupervised learning and how it differs from supervised learning. Clustering is defined as grouping similar data points together into clusters with the goal of maximizing inter-cluster distance and minimizing intra-cluster distance. The k-means algorithm and hierarchical clustering are described as two common clustering techniques. K-means partitions data into k clusters by iteratively assigning points to centroids while hierarchical clustering builds a nested sequence of clusters in a tree structure.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses key concepts in unsupervised learning and clustering such as k-means clustering and hierarchical clustering. K-means clustering partitions data into k clusters by minimizing distances between data points and cluster centroids. Hierarchical clustering builds nested clusters by successively merging or dividing clusters based on distance measures. Representing clusters and evaluating clustering results are also covered.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
Clustering is an unsupervised machine learning technique that groups similar objects together. There are several clustering techniques including partitioning methods like k-means, hierarchical methods like agglomerative and divisive clustering, density-based methods, graph-based methods, and model-based clustering. Clustering is used in applications such as pattern recognition, image processing, bioinformatics, and more to split data into meaningful subgroups.
Cluster analysis is an unsupervised learning technique used to group unlabeled data points so that objects in the same cluster are more similar to each other than objects in different clusters. There are various clustering algorithms that differ in how they define clusters and find them efficiently. Popular cluster models include connectivity models based on distance, centroid models that represent each cluster by a single mean, and density models that define clusters as connected dense regions of data. Clustering is used in applications like market segmentation, social network analysis, and image segmentation.
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This chapter discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. Partitioning methods like k-means and k-medoids aim to partition observations into k clusters by optimizing some objective function. Hierarchical clustering builds a hierarchy of clusters based on distance between observations. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into finite number of cells that form clusters.
This document summarizes chapter 10 of the book "Data Mining: Concepts and Techniques" which discusses cluster analysis. The chapter covers basic concepts of cluster analysis including partitioning, hierarchical, density-based and grid-based methods. It describes popular partitioning algorithms like k-means and k-medoids, and notes that k-means can be sensitive to outliers while k-medoids uses medioids which are less sensitive to outliers. The chapter also discusses evaluating clustering quality and major considerations for cluster analysis.
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate
The document discusses K-means clustering, an unsupervised learning algorithm that partitions data into K clusters by minimizing variance. It works by iteratively assigning data points to the nearest centroid and updating centroid positions based on cluster means until convergence. The example shows initializing random centroids, assigning points to the closest centroid to form initial clusters, then recalculating centroids as the means of points in each cluster. This iterative process refines clusters with each update.
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
Currently, there are two techniques used for large-scale gene-expression profiling; microarray and
RNA-Sequence (RNA-Seq).This paper is intended to study and compare different clustering algorithms that used
in microarray data analysis. Microarray is a DNA molecules array which allows multiple hybridization
experiments to be carried out simultaneously and trace expression levels of thousands of genes. It is a highthroughput
technology for gene expression analysis and becomes an effective tool for biomedical research.
Microarray analysis aims to interpret the data produced from experiments on DNA, RNA, and protein
microarrays, which enable researchers to investigate the expression state of a large number of genes. Data
clustering represents the first and main process in microarray data analysis. The k-means, fuzzy c-mean, selforganizing
map, and hierarchical clustering algorithms are under investigation in this paper. These algorithms
are compared based on their clustering model.
The document describes different types of clustering algorithms, including partitioning, hierarchical, density-based, and grid-based methods. Partitioning methods like k-means and k-medoids aim to partition objects into k clusters by optimizing an objective function. Hierarchical clustering builds a hierarchy of clusters based on distance, either through an agglomerative (bottom-up) or divisive (top-down) approach. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into a finite number of cells that form a grid structure.
This document summarizes Chapter 10 of the book "Data Mining: Concepts and Techniques (3rd ed.)" which covers cluster analysis. The chapter introduces different types of clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. It discusses how to evaluate the quality of clustering results and highlights considerations for cluster analysis such as similarity measures, clustering space, and challenges like scalability and high dimensionality.
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
This document provides an overview and comparison of various clustering algorithms used in data mining. It discusses the key types of clustering algorithms: partition-based (such as k-means and k-medoids), hierarchical-based, density-based, and grid-based. For partition-based algorithms, it describes k-means and k-medoids in more detail. It also discusses hierarchical clustering approaches like agglomerative nesting. The document aims to provide insights into different clustering techniques for segmenting and grouping data in an unsupervised manner.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
The document provides a literature review of different clustering techniques. It begins by defining clustering and its applications. It then categorizes and describes several clustering methods including hierarchical (BIRCH, CURE, CHAMELEON), partitioning (k-means, k-medoids), density-based (DBSCAN, OPTICS, DENCLUE), grid-based (CLIQUE, STING, MAFIA), and model-based (RBMN, SOM) methods. For each method, it discusses the algorithm, advantages, disadvantages and time complexity. The document aims to provide an overview of various clustering techniques for classification and comparison.
Cluster analysis is an unsupervised learning technique used to group unlabeled data points into meaningful clusters. There are several approaches to cluster analysis including partitioning methods like k-means, hierarchical clustering methods like agglomerative nesting (AGNES), and density-based methods like DBSCAN. The quality of clusters is evaluated based on intra-cluster similarity and inter-cluster dissimilarity. Cluster analysis has applications in fields like pattern recognition, image processing, and market segmentation.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
Clustering is an unsupervised machine learning technique used to group unlabeled data points. There are two main approaches: hierarchical clustering and partitioning clustering. Partitioning clustering algorithms like k-means and k-medoids attempt to partition data into k clusters by optimizing a criterion function. Hierarchical clustering creates nested clusters by merging or splitting clusters. Examples of hierarchical algorithms include agglomerative clustering, which builds clusters from bottom-up, and divisive clustering, which separates clusters from top-down. Clustering can group both numerical and categorical data.
Big data Clustering Algorithms And StrategiesFarzad Nozarian
The document discusses various algorithms for big data clustering. It begins by covering preprocessing techniques such as data reduction. It then covers hierarchical, prototype-based, density-based, grid-based, and scalability clustering algorithms. Specific algorithms discussed include K-means, K-medoids, PAM, CLARA/CLARANS, DBSCAN, OPTICS, MR-DBSCAN, DBCURE, and hierarchical algorithms like PINK and l-SL. The document emphasizes techniques for scaling these algorithms to large datasets, including partitioning, sampling, approximation strategies, and MapReduce implementations.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
This chapter discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. Partitioning methods like k-means and k-medoids aim to partition observations into k clusters by optimizing some objective function. Hierarchical clustering builds a hierarchy of clusters based on distance between observations. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into finite number of cells that form clusters.
This document summarizes chapter 10 of the book "Data Mining: Concepts and Techniques" which discusses cluster analysis. The chapter covers basic concepts of cluster analysis including partitioning, hierarchical, density-based and grid-based methods. It describes popular partitioning algorithms like k-means and k-medoids, and notes that k-means can be sensitive to outliers while k-medoids uses medioids which are less sensitive to outliers. The chapter also discusses evaluating clustering quality and major considerations for cluster analysis.
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate
The document discusses K-means clustering, an unsupervised learning algorithm that partitions data into K clusters by minimizing variance. It works by iteratively assigning data points to the nearest centroid and updating centroid positions based on cluster means until convergence. The example shows initializing random centroids, assigning points to the closest centroid to form initial clusters, then recalculating centroids as the means of points in each cluster. This iterative process refines clusters with each update.
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
Currently, there are two techniques used for large-scale gene-expression profiling; microarray and
RNA-Sequence (RNA-Seq).This paper is intended to study and compare different clustering algorithms that used
in microarray data analysis. Microarray is a DNA molecules array which allows multiple hybridization
experiments to be carried out simultaneously and trace expression levels of thousands of genes. It is a highthroughput
technology for gene expression analysis and becomes an effective tool for biomedical research.
Microarray analysis aims to interpret the data produced from experiments on DNA, RNA, and protein
microarrays, which enable researchers to investigate the expression state of a large number of genes. Data
clustering represents the first and main process in microarray data analysis. The k-means, fuzzy c-mean, selforganizing
map, and hierarchical clustering algorithms are under investigation in this paper. These algorithms
are compared based on their clustering model.
The document describes different types of clustering algorithms, including partitioning, hierarchical, density-based, and grid-based methods. Partitioning methods like k-means and k-medoids aim to partition objects into k clusters by optimizing an objective function. Hierarchical clustering builds a hierarchy of clusters based on distance, either through an agglomerative (bottom-up) or divisive (top-down) approach. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into a finite number of cells that form a grid structure.
This document summarizes Chapter 10 of the book "Data Mining: Concepts and Techniques (3rd ed.)" which covers cluster analysis. The chapter introduces different types of clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. It discusses how to evaluate the quality of clustering results and highlights considerations for cluster analysis such as similarity measures, clustering space, and challenges like scalability and high dimensionality.
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
This document provides an overview and comparison of various clustering algorithms used in data mining. It discusses the key types of clustering algorithms: partition-based (such as k-means and k-medoids), hierarchical-based, density-based, and grid-based. For partition-based algorithms, it describes k-means and k-medoids in more detail. It also discusses hierarchical clustering approaches like agglomerative nesting. The document aims to provide insights into different clustering techniques for segmenting and grouping data in an unsupervised manner.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
The document provides a literature review of different clustering techniques. It begins by defining clustering and its applications. It then categorizes and describes several clustering methods including hierarchical (BIRCH, CURE, CHAMELEON), partitioning (k-means, k-medoids), density-based (DBSCAN, OPTICS, DENCLUE), grid-based (CLIQUE, STING, MAFIA), and model-based (RBMN, SOM) methods. For each method, it discusses the algorithm, advantages, disadvantages and time complexity. The document aims to provide an overview of various clustering techniques for classification and comparison.
Cluster analysis is an unsupervised learning technique used to group unlabeled data points into meaningful clusters. There are several approaches to cluster analysis including partitioning methods like k-means, hierarchical clustering methods like agglomerative nesting (AGNES), and density-based methods like DBSCAN. The quality of clusters is evaluated based on intra-cluster similarity and inter-cluster dissimilarity. Cluster analysis has applications in fields like pattern recognition, image processing, and market segmentation.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
Clustering is an unsupervised machine learning technique used to group unlabeled data points. There are two main approaches: hierarchical clustering and partitioning clustering. Partitioning clustering algorithms like k-means and k-medoids attempt to partition data into k clusters by optimizing a criterion function. Hierarchical clustering creates nested clusters by merging or splitting clusters. Examples of hierarchical algorithms include agglomerative clustering, which builds clusters from bottom-up, and divisive clustering, which separates clusters from top-down. Clustering can group both numerical and categorical data.
Big data Clustering Algorithms And StrategiesFarzad Nozarian
The document discusses various algorithms for big data clustering. It begins by covering preprocessing techniques such as data reduction. It then covers hierarchical, prototype-based, density-based, grid-based, and scalability clustering algorithms. Specific algorithms discussed include K-means, K-medoids, PAM, CLARA/CLARANS, DBSCAN, OPTICS, MR-DBSCAN, DBCURE, and hierarchical algorithms like PINK and l-SL. The document emphasizes techniques for scaling these algorithms to large datasets, including partitioning, sampling, approximation strategies, and MapReduce implementations.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
2. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
2
3. Supervised learning vs. unsupervised
learning
Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target attribute
in future data instances.
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
3
4. Clustering
Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different clusters.
Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised
This chapter focuses on clustering. 4
5. An illustration
The data set has three natural groups of data
points, i.e., 3 natural clusters.
5
6. What is clustering for?
Let us see some real-life examples
Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.
Example 2: In marketing, segment customers according to their
similarities
To do targeted marketing.
6
7. What is clustering for? (cont…)
Example 3: Given a collection of text documents, we want to organize
them according to their content similarities,
To produce a topic hierarchy
In fact, clustering is one of the most utilized data mining techniques.
It has a long history, and used in almost every field, e.g., medicine, psychology,
botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
In recent years, due to the rapid increase of online documents, text clustering
becomes important.
7
8. Aspects of clustering
A clustering algorithm
Partitional clustering
Hierarchical clustering
…
A distance (similarity, or dissimilarity) function
Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
The quality of a clustering result depends on the algorithm, the distance
function, and the application.
8
9. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
9
10. K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the
number of attributes (dimensions) in the data.
The k-means algorithm partitions the given data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
10
11. K-means algorithm
Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).
11
13. Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the
data points in Cj), and dist(x, mj) is the distance between data point x and
centroid mj.
13
k
j
C j
j
dist
SSE
1
2
)
,
(
x
m
x (1)
17. A disk version of k-means
K-means can be implemented with data on disk
In each iteration, it scans the data once.
as the centroids can be computed incrementally
It can be used to cluster large datasets that do not fit in main memory
We need to control the number of iterations
In practice, a limited is set (< 50).
Not the best method. There are other scale-up algorithms, e.g., BIRCH.
17
19. Strengths of k-means
Strengths:
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
Since both k and t are small. k-means is considered a
linear algorithm.
K-means is the most popular clustering algorithm.
Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
19
20. Weaknesses of k-means
The algorithm is only applicable if the mean is defined.
For categorical data, k-mode - the centroid is represented by most frequent
values.
The user needs to specify k.
The algorithm is sensitive to outliers
Outliers are data points that are very far away from other data points.
Outliers could be errors in the data recording or some special data points with
very different values.
20
22. Weaknesses of k-means: To deal with
outliers
One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
To be safe, we may want to monitor these possible
outliers over a few iterations and then decide to remove
them.
Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier
is very small.
Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
22
24. Weaknesses of k-means (cont …)
If we use different seeds: good results
24
There are some
methods to help
choose good
seeds
25. Weaknesses of k-means (cont …)
The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids
(or hyper-spheres).
25
+
26. K-means summary
Despite weaknesses, k-means is still the most popular algorithm due to its
simplicity, efficiency and
other clustering algorithms have their own lists of weaknesses.
No clear evidence that any other clustering algorithm performs better in
general
although they may be more suitable for some specific types of data or
applications.
Comparing different clustering algorithms is a difficult task. No one knows
the correct clusters!
26
27. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
27
28. Common ways to represent clusters
Use the centroid of each cluster to represent the cluster.
compute the radius and
standard deviation of the cluster to determine its spread in each dimension
The centroid representation alone works well if the clusters are of the hyper-
spherical shape.
If clusters are elongated or are of other shapes, centroids are not sufficient
28
29. Using classification model
All the data points in a
cluster are regarded to
have the same class
label, e.g., the cluster
ID.
run a supervised learning
algorithm on the data to
find a classification
model.
29
30. Use frequent values to represent
cluster
This method is mainly for clustering of categorical data (e.g., k-modes
clustering).
Main method used in text clustering, where a small set of frequent words
in each cluster is selected to represent the cluster.
30
31. Clusters of arbitrary shapes
Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
Using centroids are not suitable
(upper figure) in general
K-means clusters may be more
useful (lower figure), e.g., for
making 2 size T-shirts.
31
32. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
32
34. Types of hierarchical clustering
Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
merges the most similar (or nearest) pair of clusters
stops when all the data points are merged into a single
cluster (i.e., the root cluster).
Divisive (top down) clustering: It starts with all
data points in one cluster, the root.
Splits the root into a set of child clusters. Each child
cluster is recursively divided further
stops when only singleton clusters of individual data
points remain, i.e., each cluster with only a single point
34
35. Agglomerative clustering
It is more popular then divisive methods.
At the beginning, each data point forms a cluster (also called a node).
Merge nodes/clusters that have the least distance.
Go on merging
Eventually all nodes belong to one cluster
35
38. Measuring the distance of two clusters
A few ways to measure distances of two clusters.
Results in different variations of the algorithm.
Single link
Complete link
Average link
Centroids
…
38
39. Single link method
The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
It can find arbitrarily
shaped clusters, but
It may cause the
undesirable “chain
effect” by noisy points
39
Two natural clusters are
split into two
40. Complete link method
The distance between two clusters is the
distance of two furthest data points in the two
clusters.
It is sensitive to outliers because they are far
away
40
41. Average link and centroid methods
Average link: A compromise between
the sensitivity of complete-link clustering to outliers and
the tendency of single-link clustering to form long chains that do not
correspond to the intuitive notion of clusters as compact, spherical objects.
In this method, the distance between two clusters is the average distance of
all pair-wise distances between the data points in two clusters.
Centroid method: In this method, the distance between two clusters is the
distance between their centroids
41
42. The complexity
All the algorithms are at least O(n2). n is the number of data points.
Single link can be done in O(n2).
Complete and average links can be done in O(n2logn).
Due the complexity, hard to use for large data sets.
Sampling
Scale-up methods (e.g., BIRCH).
42
43. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
43
44. Distance functions
Key to clustering. “similarity” and “dissimilarity” can also commonly used
terms.
There are numerous distance functions for
Different types of data
Numeric data
Nominal data
Different specific applications
44
45. Distance functions for numeric
attributes
Most commonly used functions are
Euclidean distance and
Manhattan (city block) distance
We denote distance with: dist(xi, xj), where xi and xj are data points
(vectors)
They are special cases of Minkowski distance. h is positive integer.
45
h
h
jr
ir
h
j
i
h
j
i
j
i x
x
x
x
x
x
dist
1
2
2
1
1 )
)
(
...
)
(
)
((
)
,
(
x
x
46. Euclidean distance and Manhattan
distance
If h = 2, it is the Euclidean distance
If h = 1, it is the Manhattan distance
Weighted Euclidean distance
46
2
2
2
2
2
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist
x
x
|
|
...
|
|
|
|
)
,
( 2
2
1
1 jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist
x
x
2
2
2
2
2
2
1
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
r
j
i
j
i
j
i x
x
w
x
x
w
x
x
w
dist
x
x
47. Squared distance and Chebychev
distance
Squared Euclidean distance: to place progressively greater weight on data
points that are further apart.
Chebychev distance: one wants to define two data points as "different" if
they are different on any one of the attributes.
47
2
2
2
2
2
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist
x
x
|)
|
...,
|,
|
|,
max(|
)
,
( 2
2
1
1 jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist
x
x
48. Distance functions for binary and
nominal attributes
Binary attribute: has two values or states but no ordering relationships,
e.g.,
Gender: male and female.
We use a confusion matrix to introduce the distance functions/measures.
Let the ith and jth data points be xi and xj (vectors)
48
50. Symmetric binary attributes
A binary attribute is symmetric if both of its states (0 and 1) have equal
importance, and carry the same weights, e.g., male and female of the
attribute Gender
Distance function: Simple Matching Coefficient, proportion of mismatches
of their values
50
d
c
b
a
c
b
dist j
i
)
,
( x
x
52. Asymmetric binary attributes
Asymmetric: if one of the states is more important or more valuable than
the other.
By convention, state 1 represents the more important state, which is typically
the rare or infrequent state.
Jaccard coefficient is a popular measure
We can have some variations, adding weights
52
c
b
a
c
b
dist j
i
)
,
( x
x
53. Nominal attributes
Nominal attributes: with more than two states or values.
the commonly used distance measure is also based on the simple matching method.
Given two data points xi and xj, let the number of attributes be r, and the number
of values that match in xi and xj be q.
53
r
q
r
dist j
i
)
,
( x
x
54. Distance function for text documents
A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
To simplify: a document is usually considered a
“bag” of words in document clustering.
Sequence and position of words are ignored.
A document is represented with a vector just like a
normal data point.
It is common to use similarity to compare two
documents rather than distance.
The most commonly used similarity function is the cosine
similarity. We will study this later. 54
55. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
55
56. Data standardization
In the Euclidean space, standardization of attributes is
recommended so that all attributes can have equal
impact on the computation of distances.
Consider the following pair of data points
xi: (0.1, 20) and xj: (0.9, 720).
The distance is almost completely dominated by (720-
20) = 700.
Standardize attributes: to force the attributes to have
a common value range
56
,
700.000457
)
20
720
(
)
1
.
0
9
.
0
(
)
,
( 2
2
j
i
dist x
x
57. Interval-scaled attributes
Their values are real numbers following a linear scale.
The difference in Age between 10 and 20 is the same as that between 40 and
50.
The key idea is that intervals keep the same importance through out the scale
Two main approaches to standardize interval scaled attributes, range and
z-score. f is an attribute
57
,
)
min(
)
max(
)
min(
)
(
f
f
f
x
x
range
if
if
58. Interval-scaled attributes (cont …)
Z-score: transforms the attribute values so that
they have a mean of zero and a mean absolute
deviation of 1. The mean absolute deviation of
attribute f, denoted by sf, is computed as follows
58
,
|
|
...
|
|
|
|
1
2
1 f
nf
f
f
f
f
f m
x
m
x
m
x
n
s
,
...
1
2
1 nf
f
f
f x
x
x
n
m
.
)
(
f
f
if
if
s
m
x
x
z
Z-score:
59. Ratio-scaled attributes
Numeric attributes, but unlike interval-scaled attributes, their scales are
exponential,
For example, the total amount of microorganisms that evolve in a time t
is approximately given by
AeBt,
where A and B are some positive constants.
Do log transform:
Then treat it as an interval-scaled attribuete
59
)
log( if
x
60. Nominal attributes
Sometime, we need to transform nominal attributes to numeric
attributes.
Transform nominal attributes to binary attributes.
The number of values of a nominal attribute is v.
Create v binary attributes to represent them.
If a data instance for the nominal attribute takes a particular value, the value
of its binary attribute is set to 1, otherwise it is set to 0.
The resulting binary attributes can be used as numeric attributes, with
two values, 0 and 1.
60
61. Nominal attributes: an example
Nominal attribute fruit: has three values,
Apple, Orange, and Pear
We create three binary attributes called, Apple, Orange, and Pear in the
new data.
If a particular data instance in the original data has Apple as the value for
fruit,
then in the transformed data, we set the value of the attribute Apple to 1, and
the values of attributes Orange and Pear to 0
61
62. Ordinal attributes
Ordinal attribute: an ordinal attribute is like a nominal attribute, but its
values have a numerical ordering. E.g.,
Age attribute with values: Young, MiddleAge and Old. They are ordered.
Common approach to standardization: treat is as an interval-scaled attribute.
62
63. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
63
64. Mixed attributes
Our distance functions given are for data with all numeric attributes, or
all nominal attributes, etc.
Practical data has different types:
Any subset of the 6 types of attributes,
interval-scaled,
symmetric binary,
asymmetric binary,
ratio-scaled,
ordinal and
nominal
64
65. Convert to a single type
One common way of dealing with mixed attributes is to
Decide the dominant attribute type, and
Convert the other types to this type.
E.g, if most attributes in a data set are interval-scaled,
we convert ordinal attributes and ratio-scaled attributes to interval-scaled
attributes.
It is also appropriate to treat symmetric binary attributes as interval-scaled
attributes.
65
66. Convert to a single type (cont …)
It does not make much sense to convert a nominal attribute or an
asymmetric binary attribute to an interval-scaled attribute,
but it is still frequently done in practice by assigning some numbers to them
according to some hidden ordering, e.g., prices of the fruits
Alternatively, a nominal attribute can be converted to a set of
(symmetric) binary attributes, which are then treated as numeric
attributes.
66
67. Combining individual distances
This approach computes
individual attribute
distances and then
combine them.
67
r
f
f
ij
f
ij
r
f
f
ij
j
i
d
dist
1
1
)
,
(
x
x
68. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
68
69. How to choose a clustering algorithm
Clustering research has a long history. A vast
collection of algorithms are available.
We only introduced several main algorithms.
Choosing the “best” algorithm is a challenge.
Every algorithm has limitations and works well with
certain data distributions.
It is very hard, if not impossible, to know what
distribution the application data follow. The data may not
fully follow any “ideal” structure or distribution required
by the algorithms.
One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.
69
70. Choose a clustering algorithm (cont …)
Due to these complexities, the common practice is
to
run several algorithms using different distance functions
and parameter settings, and
then carefully analyze and compare the results.
The interpretation of the results must be based on
insight into the meaning of the original data
together with knowledge of the algorithms used.
Clustering is highly application dependent and to
certain extent subjective (personal preferences).
70
71. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
71
72. Cluster Evaluation: hard problem
The quality of a clustering is very hard to evaluate because
We do not know the correct clusters
Some methods are used:
User inspection
Study centroids, and spreads
Rules from a decision tree.
For text documents, one can read some documents in clusters.
72
73. Cluster evaluation: ground truth
We use some labeled data (for classification)
Assumption: Each class is a cluster.
After clustering, a confusion matrix is constructed. From the matrix, we
compute various measurements, entropy, purity, precision, recall and F-
score.
Let the classes in the data D be C = (c1, c2, …, ck). The clustering method
produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk.
73
77. A remark about ground truth evaluation
Commonly used to compare different clustering
algorithms.
A real-life data set for clustering has no class labels.
Thus although an algorithm may perform very well on
some labeled data sets, no guarantee that it will perform
well on the actual application data at hand.
The fact that it performs well on some label data
sets does give us some confidence of the quality of
the algorithm.
This evaluation method is said to be based on
external data or information.
77
78. Evaluation based on internal
information
Intra-cluster cohesion (compactness):
Cohesion measures how near the data points in a cluster are to the cluster
centroid.
Sum of squared error (SSE) is a commonly used measure.
Inter-cluster separation (isolation):
Separation means that different cluster centroids should be far away from one
another.
In most applications, expert judgments are still the key.
78
79. Indirect evaluation
In some applications, clustering is not the primary
task, but used to help perform another task.
We can use the performance on the primary task to
compare clustering methods.
For instance, in an application, the primary task is
to provide recommendations on book purchasing to
online shoppers.
If we can cluster books according to their features, we
might be able to provide better recommendations.
We can evaluate different clustering algorithms based on
how well they help with the recommendation task.
Here, we assume that the recommendation can be
reliably evaluated.
79
80. Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Summary
80
81. Summary
Clustering is has along history and still active
There are a huge number of clustering algorithms
More are still coming every year.
We only introduced several main algorithms. There
are many others, e.g.,
density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
Clustering is highly application dependent and to
some extent subjective.
81