SlideShare a Scribd company logo
Automated Clustering Project
MiklosVasarhelyi, Paul Byrnes, andYunsenWang
Presented by DenizAppelbaum
Motivation
 Motivation entails the development of a program that automatically performs
clustering and outlier detection for a wide variety of numerically represented data.
Outline of program features
 Normalizes all data to be clustered
 Creates normalized principal components from the normalized data
 Automatically selects the necessary normalized principal components for use in actual
clustering and outlier detection
 Compares a variety of algorithms based upon the selected set of normalized principal
components
 Adopts the top performing model based upon silhouette coefficient values to perform
the final clustering and outlier detection procedures
 Produces relevant information and outputs throughout the process
Data normalization
 Data normalization
 Converts each numerically represented dimension to be clustered into the range [0,1].
 A desirable procedure for preparing numeric attributes for clustering
Principal component analysis
 Principal component analysis (PCA) is a statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal
components.
 In this way, PCA can both reduce dimensionality as well as eliminate inherent
problems associated with clustering data whose attributes are correlated
 In the following slides, a random sample of 5,000 credit card customers is used to
demonstrate the automated clustering and outlier detection program
Principal component analysis
 PCA initially results in four principal
components being generated from
the original data
 Using a cumulative data variability
threshold of 80% (default
specification), three principal
components are automatically
selected for analysis – they explain
the vast majority of data variability
Principal component analysis
 Scatter plot of PC1 and PC2
 In this view, the top 2 principal
components are plotted for each object in
two-dimensional space.
 As can be seen, a small subset of records
appear significantly more distant/different
from the vast majority of objects.
Clustering exploration/simulation process - examples
 Ward method
 Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for
choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.
 Complete link method
 This method is also known as farthest neighbor clustering.The result of the clustering can be visualized
as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took
place.
 PAM (partitioning around medoids)
 The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift
algorithm; It is considered more stable than k-means, because it uses the median rather than mean
 K-means
 k-means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of the cluster.
Clustering exploration results
 The result shown below is based upon a simulation exercise, whereby all four
algorithms are automatically compared on the data set (i.e., a random sample of 5,000
records from the credit card customer data). In this particular case, the best model is
found to be a two-cluster solution using the complete link hierarchical method. This is
the final model and is used for subsequent clustering and outlier detection.
 Best clustering result:
 The silhouette value can theoretically range from -1 to +1, with higher values indicative
of better cluster quality in terms of both cohesion and separation.
Best Method Number Of Clusters SilhouetteValue
complete link hierarchical 2 0.753754205720575
Complete-link Hierarchical clustering (1/2)
 The 5,000 instances are on the
x-axis. In moving vertically from
the x-axis, one can begin to see
how the actual clusters are
formed.
Plot of PCs with cluster assignment labels (1/3)
 In this view, the top two principal
components (i.e., PC1 and PC2) are
plotted for each object in two-
dimensional space.
 In the graph, there are two clusters, one
dark blue and the other light blue.
 The small subset of three records appears
substantially more different from the
majority of objects.
Plot of PCs with cluster assignment labels (2/3)
 In this view, PC1 and PC3 are plotted for
each object in two-dimensional space.
 In the graph, the two clusters are again
shown.
 It is once again evident that the small
subset of three records appears more
different from the majority of other
objects.
Plot of PCs with cluster assignment labels (3/3)
 In this view, PC2 and PC3 are
plotted for each object in two-
dimensional space.
 Cluster differences appear less
prominent from this perspective.
Principal components 3D scatterplot
 Cluster one represents the majority
class (black) while cluster two
represents the rare class (red).
 In this view, one can clearly see the
subset of three records (in red)
appearing more isolated from the other
objects.
Cluster 1 outlier plot
 In this view, an arbitrary cutoff is
inserted at the 99.9th percentile (red
horizontal line) so as to provide for
efficient identification of very irregular
records.
 Objects further from the x-axis are
more questionable.
 While all objects distant from the x-
axis might be worth investigating,
points above the cutoff should be
viewed as particularly suspicious.
Conclusion of Process
 At the conclusion of outlier detection, an output file for each cluster containing the unique
record identifier, original variables, normalized variables, principal components, normalized
principal components, cluster assignments, and mahalanobis distance information can be
exported to facilitate further analyses and investigations.
 Cluster 2 – final output file of a subset of fields:
 Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2)
Very high incidence of late payments, and 3) Relatively high credit limits,
particularly given the account age and late payment issues.
Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md
32430 1 2500 1 3 2 5.83E-05
65470 1 8500 1 4 2 0.002371778
78772 1 2200 0 3 2 0.000442305

More Related Content

What's hot

DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
Ashwin Shenoy M
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
Krish_ver2
 
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
Zac Darcy
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Birch
BirchBirch
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
Abdullah Masoud
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
Mason Ziemer
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1
Hyun Wong Choi
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
A046010107
A046010107A046010107
A046010107
IJERA Editor
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
Editor IJCATR
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
 
Pillar k means
Pillar k meansPillar k means
Pillar k means
swathi b
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET Journal
 
Application of stochastic modelling in bioinformatics
Application of stochastic modelling in bioinformaticsApplication of stochastic modelling in bioinformatics
Application of stochastic modelling in bioinformatics
Spyros Ktenas
 
Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithm
Ashish Karki
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 
Clustering using kernel entropy principal component analysis and variable ker...
Clustering using kernel entropy principal component analysis and variable ker...Clustering using kernel entropy principal component analysis and variable ker...
Clustering using kernel entropy principal component analysis and variable ker...
IJECEIAES
 
Clustering
ClusteringClustering
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 

What's hot (20)

DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Birch
BirchBirch
Birch
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
A046010107
A046010107A046010107
A046010107
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Pillar k means
Pillar k meansPillar k means
Pillar k means
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather Prediction
 
Application of stochastic modelling in bioinformatics
Application of stochastic modelling in bioinformaticsApplication of stochastic modelling in bioinformatics
Application of stochastic modelling in bioinformatics
 
Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithm
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 
Clustering using kernel entropy principal component analysis and variable ker...
Clustering using kernel entropy principal component analysis and variable ker...Clustering using kernel entropy principal component analysis and variable ker...
Clustering using kernel entropy principal component analysis and variable ker...
 
Clustering
ClusteringClustering
Clustering
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 

Viewers also liked

Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
Rocco Oliveto
 
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
asd123456789123
 
A2DataDive workshop: Introduction to R
A2DataDive workshop: Introduction to RA2DataDive workshop: Introduction to R
A2DataDive workshop: Introduction to R
Open.Michigan
 
Preliminary Study of Engineering Self
Preliminary Study of Engineering SelfPreliminary Study of Engineering Self
Preliminary Study of Engineering Self
Dan Tetrick
 
Kent ro systems
Kent ro systemsKent ro systems
Kent ro systems
Aqua-Tech Service
 
Selected ion flow tube MS - Online quantitative VOC analysis
Selected ion flow tube MS - Online quantitative VOC analysisSelected ion flow tube MS - Online quantitative VOC analysis
Selected ion flow tube MS - Online quantitative VOC analysis
IS-X
 

Viewers also liked (6)

Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
 
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
 
A2DataDive workshop: Introduction to R
A2DataDive workshop: Introduction to RA2DataDive workshop: Introduction to R
A2DataDive workshop: Introduction to R
 
Preliminary Study of Engineering Self
Preliminary Study of Engineering SelfPreliminary Study of Engineering Self
Preliminary Study of Engineering Self
 
Kent ro systems
Kent ro systemsKent ro systems
Kent ro systems
 
Selected ion flow tube MS - Online quantitative VOC analysis
Selected ion flow tube MS - Online quantitative VOC analysisSelected ion flow tube MS - Online quantitative VOC analysis
Selected ion flow tube MS - Online quantitative VOC analysis
 

Similar to Automated Clustering Project - 12th CONTECSI 34th WCARS

An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
IJORCS
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
ijcsity
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
ijpla
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
Suman Mia
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
Waqas Tariq
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
CSCJournals
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
Happiest Minds Technologies
 
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATIONCONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
CSEIJJournal
 
Convolutional Neural Network based Retinal Vessel Segmentation
Convolutional Neural Network based Retinal Vessel SegmentationConvolutional Neural Network based Retinal Vessel Segmentation
Convolutional Neural Network based Retinal Vessel Segmentation
CSEIJJournal
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET Journal
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
IOSR Journals
 
F017132529
F017132529F017132529
F017132529
IOSR Journals
 
Az36311316
Az36311316Az36311316
Az36311316
IJERA Editor
 
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
ijcsit
 

Similar to Automated Clustering Project - 12th CONTECSI 34th WCARS (20)

An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
 
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATIONCONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
 
Convolutional Neural Network based Retinal Vessel Segmentation
Convolutional Neural Network based Retinal Vessel SegmentationConvolutional Neural Network based Retinal Vessel Segmentation
Convolutional Neural Network based Retinal Vessel Segmentation
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
 
F017132529
F017132529F017132529
F017132529
 
Az36311316
Az36311316Az36311316
Az36311316
 
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
 

More from TECSI FEA USP

12th CONTECSI USP - Guia para publicar Andre Jun Emerald
12th CONTECSI USP - Guia para publicar  Andre Jun Emerald12th CONTECSI USP - Guia para publicar  Andre Jun Emerald
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
TECSI FEA USP
 
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
TECSI FEA USP
 
12 contecsi Workshop Mendeley Ligia Capobianco
12 contecsi   Workshop Mendeley Ligia Capobianco12 contecsi   Workshop Mendeley Ligia Capobianco
12 contecsi Workshop Mendeley Ligia Capobianco
TECSI FEA USP
 
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
TECSI FEA USP
 
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
TECSI FEA USP
 
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI   Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
TECSI FEA USP
 
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
TECSI FEA USP
 
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI  Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
TECSI FEA USP
 
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
TECSI FEA USP
 
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
TECSI FEA USP
 
GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI
TECSI FEA USP
 
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 Co-production: an opportunity toward better digital governance - 12th CONTECSI  Co-production: an opportunity toward better digital governance - 12th CONTECSI
Co-production: an opportunity toward better digital governance - 12th CONTECSI
TECSI FEA USP
 
The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...
TECSI FEA USP
 
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
TECSI FEA USP
 
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 Big (huge) Data and a continuous and predictive audit: new evidence, new met... Big (huge) Data and a continuous and predictive audit: new evidence, new met...
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
TECSI FEA USP
 
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARSText Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARSO Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
TECSI FEA USP
 
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
TECSI FEA USP
 

More from TECSI FEA USP (20)

12th CONTECSI USP - Guia para publicar Andre Jun Emerald
12th CONTECSI USP - Guia para publicar  Andre Jun Emerald12th CONTECSI USP - Guia para publicar  Andre Jun Emerald
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
 
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
 
12 contecsi Workshop Mendeley Ligia Capobianco
12 contecsi   Workshop Mendeley Ligia Capobianco12 contecsi   Workshop Mendeley Ligia Capobianco
12 contecsi Workshop Mendeley Ligia Capobianco
 
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
 
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
 
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI   Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
 
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI  Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
 
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
 
GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI
 
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 Co-production: an opportunity toward better digital governance - 12th CONTECSI  Co-production: an opportunity toward better digital governance - 12th CONTECSI
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 
The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...
 
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
 
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 Big (huge) Data and a continuous and predictive audit: new evidence, new met... Big (huge) Data and a continuous and predictive audit: new evidence, new met...
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARSText Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
 
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARSO Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
 
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
 
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
 

Recently uploaded

9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Recently uploaded (20)

9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

Automated Clustering Project - 12th CONTECSI 34th WCARS

  • 1. Automated Clustering Project MiklosVasarhelyi, Paul Byrnes, andYunsenWang Presented by DenizAppelbaum
  • 2. Motivation  Motivation entails the development of a program that automatically performs clustering and outlier detection for a wide variety of numerically represented data.
  • 3. Outline of program features  Normalizes all data to be clustered  Creates normalized principal components from the normalized data  Automatically selects the necessary normalized principal components for use in actual clustering and outlier detection  Compares a variety of algorithms based upon the selected set of normalized principal components  Adopts the top performing model based upon silhouette coefficient values to perform the final clustering and outlier detection procedures  Produces relevant information and outputs throughout the process
  • 4. Data normalization  Data normalization  Converts each numerically represented dimension to be clustered into the range [0,1].  A desirable procedure for preparing numeric attributes for clustering
  • 5. Principal component analysis  Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.  In this way, PCA can both reduce dimensionality as well as eliminate inherent problems associated with clustering data whose attributes are correlated  In the following slides, a random sample of 5,000 credit card customers is used to demonstrate the automated clustering and outlier detection program
  • 6. Principal component analysis  PCA initially results in four principal components being generated from the original data  Using a cumulative data variability threshold of 80% (default specification), three principal components are automatically selected for analysis – they explain the vast majority of data variability
  • 7. Principal component analysis  Scatter plot of PC1 and PC2  In this view, the top 2 principal components are plotted for each object in two-dimensional space.  As can be seen, a small subset of records appear significantly more distant/different from the vast majority of objects.
  • 8. Clustering exploration/simulation process - examples  Ward method  Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.  Complete link method  This method is also known as farthest neighbor clustering.The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.  PAM (partitioning around medoids)  The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift algorithm; It is considered more stable than k-means, because it uses the median rather than mean  K-means  k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 9. Clustering exploration results  The result shown below is based upon a simulation exercise, whereby all four algorithms are automatically compared on the data set (i.e., a random sample of 5,000 records from the credit card customer data). In this particular case, the best model is found to be a two-cluster solution using the complete link hierarchical method. This is the final model and is used for subsequent clustering and outlier detection.  Best clustering result:  The silhouette value can theoretically range from -1 to +1, with higher values indicative of better cluster quality in terms of both cohesion and separation. Best Method Number Of Clusters SilhouetteValue complete link hierarchical 2 0.753754205720575
  • 10. Complete-link Hierarchical clustering (1/2)  The 5,000 instances are on the x-axis. In moving vertically from the x-axis, one can begin to see how the actual clusters are formed.
  • 11. Plot of PCs with cluster assignment labels (1/3)  In this view, the top two principal components (i.e., PC1 and PC2) are plotted for each object in two- dimensional space.  In the graph, there are two clusters, one dark blue and the other light blue.  The small subset of three records appears substantially more different from the majority of objects.
  • 12. Plot of PCs with cluster assignment labels (2/3)  In this view, PC1 and PC3 are plotted for each object in two-dimensional space.  In the graph, the two clusters are again shown.  It is once again evident that the small subset of three records appears more different from the majority of other objects.
  • 13. Plot of PCs with cluster assignment labels (3/3)  In this view, PC2 and PC3 are plotted for each object in two- dimensional space.  Cluster differences appear less prominent from this perspective.
  • 14. Principal components 3D scatterplot  Cluster one represents the majority class (black) while cluster two represents the rare class (red).  In this view, one can clearly see the subset of three records (in red) appearing more isolated from the other objects.
  • 15. Cluster 1 outlier plot  In this view, an arbitrary cutoff is inserted at the 99.9th percentile (red horizontal line) so as to provide for efficient identification of very irregular records.  Objects further from the x-axis are more questionable.  While all objects distant from the x- axis might be worth investigating, points above the cutoff should be viewed as particularly suspicious.
  • 16. Conclusion of Process  At the conclusion of outlier detection, an output file for each cluster containing the unique record identifier, original variables, normalized variables, principal components, normalized principal components, cluster assignments, and mahalanobis distance information can be exported to facilitate further analyses and investigations.  Cluster 2 – final output file of a subset of fields:  Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2) Very high incidence of late payments, and 3) Relatively high credit limits, particularly given the account age and late payment issues. Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md 32430 1 2500 1 3 2 5.83E-05 65470 1 8500 1 4 2 0.002371778 78772 1 2200 0 3 2 0.000442305