SlideShare a Scribd company logo
A PROJECT ON
CLUSTERCLUSTERCLUSTERCLUSTERINGINGINGING SHAPLEYSHAPLEYSHAPLEYSHAPLEY GALAXYGALAXYGALAXYGALAXY
DATADATADATADATASETSETSETSET
In partial fulfillment for the award of the degree
Of
Master of Science (Statistics)
Submitted By:
SRIJAN PAUL
Regn. No.: 2014003137
West Bengal State University
Year: 2014-2016
2
CONTENTS
1. Introduction ……………………………………………………… 3-5
1.1) Astronomical Background ……………………………… 3-4
1.2) Identity of the Dataset …………………………….......... 4-5
1.3) Target for This Dataset …………………………………... 5
1.4) Why Clustering …………………………………………... 5
2. Methodology …………………………………………………...… 5-8
2.1) Brief Idea of Different Hierarchical
Clustering Algorithms ……………………...……………….. 5-6
2.2) Single Linkage Clustering ........................................…... 6-7
2.3) Complete Linkage Clustering ……………….………….... 7
2.4) Average Linkage Clustering .....................................…... 7-8
3. Analysis …………………………………….………………….... 8-10
4. Methodology for Further Analysis ……………………………. 10-13
4.1) Model based Clustering ……………………………… 10-12
4.2) EM Algorithm for the Mixture of Gaussians …………... 13
5. Further Analysis ……………………………………………….. 13-19
6. Conclusion ………………………………………………………. 20
7. Appendix ………………………………………………………. 21-25
7.1) R Code for Cluster Analysis …………………………..... 21
7.2) R Code for Gaussian Mixture Model Analysis …….…. 21-25
8. Acknowledgement ……………………………………………….. 25
9. References ………………………………………………………... 26
3
1. INTRODUCTION
In Statistics multivariate statistical analysis is very common. Now
almost all datasets are multivariate datasets or high dimensional. In my
project the corresponding dataset is also multidimensional. My dataset is
astronomy related i.e. astronomical data relating to 4215 galaxies in space.
The dataset is called Shapley Galaxy dataset.
1.1) Astronomical Background:-
The distribution of galaxies in space is strongly clustered. The Milky
Way Galaxy resides in its Local Group which lies on the outskirts of the
Virgo Cluster of galaxies, which in turn is part of the Local Supercluster.
Similar structures of galaxies are seen at greater distances, and collectively
the phenomenon is known as the Large Scale Structure (LSS) of the
Universe. The clustering is hierarchical, nonlinear, and anisotropic. The
latter property is manifested as galaxies concentrating in huge flattened,
curved superclusters surrounding "voids", resembling a collection of soap
bubbles.
The basic characteristics of the LSS are now understood
astrophysically as arising from the gravitational attraction of matter in the
Universe expanding from the Big Bang approximately 14 billion years ago.
The particular three-dimensional patterns are well-reproduced by
simulations requiring that attractive Cold Dark Matter and repulsive Dark
Energy are present in addition to attractive baryonic (ordinary) matter.
The properties of baryonic and dark components needed to explain LSS
agree very well with those needed to explain the fluctuations of the cosmic
microwave background and other results from observational cosmology.
4
Despite this fundamental understanding, there is considerable
interest in understanding the details of galaxy clustering; e.g. the processes
of collision and merging of rich galaxy clusters. The richest nearby
supercluster of interacting galaxy clusters is called the Shapley
Concentration. It includes several clusters from the Abell catalog of rich
galaxy clusters seen in the optical band, and a complex and massive hot
gaseous medium seen in the X-ray band. Optical measurement of galaxy
redshifts provide crucial information but represent an uncertain
convolution of the galaxy distance and gravitational effects of the clusters
in which they reside. The distance effect comes from the universal
expansion from the Big Bang, where the recessional velocity (galaxy
redshift) follows Hubble's Law v= , where v is the velocity in km/s,
is the galaxy distance from us in Mpc (million parsecs, 1 pc~3 light years),
and is Hubble's constant known to be about 72 km/s/Mpc. The cluster
gravitational effects must be estimated or simulated for individual galaxies.
1.2) Identity of the Dataset:-
The dataset consists of 5 variables which are as follows –
1) R.A. i.e. Right Ascension: Coordinate in the sky similar to longitude
on Earth, 0 to 360 degrees.
2) Dec. i.e. Declination: Coordinate in the sky similar to latitude on
Earth, -90 to +90 degrees.
3) Mag i.e. Magnitude: An inverted logarithmic measure of galaxy
brightness in the optical band. A Mag=17 galaxy is 100-times fainter than
a Mag=12 galaxy. Value is missing for some galaxies (which are
considered as 0).
4) V i.e. Velocity: Speed of the galaxy moving away from Earth, after
various corrections are applied.
5
5) SigV i.e. Sigma of velocity: Heteroscedastic measurement error known
for each individual velocity measurement.
1.3) Target for This Dataset:-
Generally in such astrostatistical dataset astronomers use different
hierarchical clustering algorithms. They often use single-linkage
nonparametric hierarchical agglomeration which they call “friends-of-
friends algorithm”.
Hence I am interested to apply a variety of multivariate clustering
algorithms and compare them if possible.
1.4) Why Clustering:-
In astrostatistical analysis generally we are interested to find those
astronomical bodies with similar type of characteristics. Here also our aim
is to analyze how much the galaxies cluster or how many cluster they form
on the basis of above 5 variables. Then in a given cluster we can say the
galaxies are of similar type of characteristics based on the above variables.
2. METHODOLOGY
2.1) Brief Idea of Different Hierarchical Clustering
Algorithms:-
The following are the steps in the agglomerative hierarchical
clustering algorithm for grouping N objects (items or variables):
6
1. Start with N clusters, each containing a single entity and an N X N
symmetric matrix of distances (or similarities) D = { }.
2. Search the distance matrix for the nearest (most similar) pair of clusters.
Let the distance between "most similar" clusters U and V be ·
3. Merge clusters U and V. Label the newly formed cluster (UV). Update
the entries in the distance matrix by deleting the rows and columns
corresponding to clusters U and V and adding a row and column giving
the distances between cluster (UV) and the remaining clusters.
4. Repeat Steps 2 and 3 a total of N-1 times. (All objects will be in a single
cluster after the algorithm terminates.) Record the identity of clusters that
are merged and the levels (distances or similarities) at which the mergers
take place.
2.2) Single Linkage Clustering:-
The inputs to a single linkage algorithm can be distances or
similarities between pairs of objects. Groups are formed from the
individual entities by merging nearest neighbors, where the term nearest
neighbor connotes the smallest distance or largest similarity.
Initially, we must find the smallest distance in D = { } and merge
the corresponding objects, say, U and V, to get the cluster (UV). For Step
3 of the above general algorithm, the distances between (UV) and any
other cluster W are computed by
= min { }
Here the quantities and are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.
The results of single linkage clustering can be graphically displayed
in the form of a dendrogram, or tree diagram. The branches in the tree
represent clusters. The branches come together (merge) at nodes whose
7
positions along a distance (or similarity) axis indicate the level at which
the fusions occur.
2.3) Complete Linkage Clustering:-
Complete linkage clustering proceeds in much the same manner as
single linkage clustering, with one important exception: At each stage, the
distance (similarity) between clusters is determined by the distance
(similarity) between the twoelements, one from each cluster, that are most
distant. Thus, complete linkage ensures that all items in a cluster are
within some maximum distance (or minimum similarity) of each other.
The general agglomerative algorithm again starts by finding the
minimum entry in D = { } and merging the corresponding objects, such
as U and V, to get cluster (UV). For Step 3 of the above general algorithm,
the distances between (UV) and any other cluster W are computed by
= max { , }
Here and are the distances between the most distant members
of clusters U and W and clusters V and W, respectively.
2.4) Average Linkage Clustering:-
Average linkage treats the distance between two clusters as the
average distance between all pairs of items where one member of a pair
belongs to each cluster.
Again, the input to the average linkage algorithm may be distances
or similarities, and the method can be used to group objects or variables.
The average linkage algorithm proceeds in the manner of the above
general algorithm. We begin by searching the distance matrix D = { }
to find the nearest (most similar) objects- for example, U and V. These
objects are merged to form the cluster (UV). For Step 3 of the above
8
general agglomerative algorithm, the distances between (UV) and the
other cluster W are determined by
=
∑ ∑
where is the distance between object i in the cluster (UV) and object
k in the cluster W, and and are the number of items in clusters
(UV) and W, respectively.
3. ANALYSIS
I have applied the above mentioned three clustering schemes i.e.
single, complete and average linkage clustering algorithms and got the
following dendrograms.
Dendrogram of Single Linkage
9
Dendrogram of Complete Linkage
Dendrogram of Average Linkage
10
From the above dendrograms one cannot say how much the galaxies
cluster among each other or how many clusters they form.
Hence further analyses are required.
4. METHODOLOGY FOR FURTHER
ANALYSIS
4.1) Model Based Clustering:-
The single linkage, complete linkage and average linkage clustering
methods are intuitively reasonable procedures but that is as much as we
can say without having a model to explain how the observations were
produced. Major advances in clustering methods have been made through
the introduction of statistical models that indicate how the collection of
(p X 1) measurements , from the N objects, was generated. The most
common model is one where cluster k has expected proportion of the
objects and the corresponding measurements are generated by a
probability density function . Then, if there are K clusters, the
observation vector for a single object is modeled as arising from the mixing
distribution
=
where each ≥ 0 and ∑ =1. This distribution is called a
mixture of the K distributions , , … , because the
observation is generated from the component distribution with
probability · The collection of N observation vectors generated from
this distribution will be a mixture of observations from the component
distributions.
11
The most common mixture model is a mixture of multivariate
normal distributions where the k-th component is the " # , ∑ )
density function which is known as Gaussian or Maximum Likelihood
Mixture model assuming individual clusters are multivariate normal.
The normal mixture model for one observation is
( ∣ # , ∑ , …, # , ∑ )
= ∑
( %)&/(∣∑ ∣)/( exp (− (x-# )+
∑,
( − # ))
Clusters generated by this model are ellipsoidal in shape with the heaviest
concentration of observations near the center.
Inferences are based on the likelihood, which for N objects and a
fixed number of clusters K, is
L( , , … , , # , ∑ , … , # , ∑ )=∏ ( . ∣. # , ∑ , … , # , ∑ )
=∏ (. ∑
( %)&/(∣∑ ∣)/( exp (− ( − # )+
∑,
( − # )))
where the proportions , , … , , the mean vectors # , # , … , # , and
the covariance matrices ∑ , ∑ , … , ∑ are unknown. The measurements
for different objects are treated as independent and identically distributed
observations from the mixture distribution.
Most importantly, under the sequence of above mixture models for
different K, the problems of choosing the number of clusters and choosing
an appropriate clustering method has been reduced to the problem of
selecting an appropriate statistical model. This is a major advancement.
A good approach to selecting a model is to first obtain the maximum
likelihood estimates ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ for a fixed number
of clusters K. These estimates must be obtained numerically using special
purpose software. The resulting value of the maximum of the likelihood
234 = L( ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ )
12
provides the basis for model selection. How do we decide on a reasonable
value for the number of clusters K? In order to compare models with
different numbers of parameters, a penalty is subtracted from twice the
maximized value of the log-likelihood to give
-2ln 234 − 789:;<=
where the penalty depends on the number of parameters estimated and
the number of observations N. Since the probabilities sum to 1, there
are only K-1 probabilities that must be estimated, K X p means and K X
p(p + 1)/2 variances and covariances. For the Akaike information
criterion (AIC), the penalty is 2N × (number of parameters) so
AIC = 2ln 234 −2N( ( + 1)( + 2) − 1)
The Bayesian information criterion (BIC) is similar but uses the logarithm
of the number of parameters in the penalty function
BIC = 2ln 234 −2ln N ( ( + 1)( + 2) − 1)
Even for a fixed number of clusters, the estimation of a mixture
model is complicated. One current software package, MCLUST, available
in the R software library, combines hierarchical clustering, the EM
algorithm and the BIC criterion to develop an appropriate model for
clustering. In the 'E' -step of the EM algorithm, a (N X K) matrix is
created whose BCD
row contains estimates of the conditional (on the
current parameter estimates) probabilities that observation . belongs to
cluster 1, 2, ... ,K. So, at convergence, the BCD
observation (object) is
assigned to the cluster k for which the conditional probability
E F ∣∣ . G = ̂. E . ∣∣ F G/ ̂ ∣ F
of membership is the largest.
13
4.2) EM Algorithm for the Mixture of Gaussians:-
Parameters estimated at the FCD
iteration are marked by a
superscript (r).
1. Initialize parameters (which have been taken arbitrarily by the
software).
2. EEEE----stepstepstepstep:- Compute the posterior probabilities for all j = 1,...,n; k =
1,...,K.
E F ∣∣ . G = .
H
E . ∣∣ F G/
H
∣ F
3. M. M. M. M----stepstepstepstep:-
.
HI
= ∑ E F ∣∣ . G, #
HI
=
∑ " ∣ J J
K
JL)
∑ " ∣ J
K
JL)
,
∑
HI
= E F ∣∣ . G . − #
HI
. − #
HI +
Repeat step 2 and 3 until convergence.
5. FURTHER ANALYSIS
Using MCLUST package in R software library and specifically the
Mclust() function and clustCombi() function, I first fit the = 5
dimensional normal mixture model.
Using the BIC criterion, the software chooses K=8 clusters with
estimated centers # , # , … , #N and variance covariance matrices
∑ , ∑ , … , ∑N with the mixing probabilities , , … , N (See Appendix).
And the scatter plot of the above analysis are given below.
14
Multiple scatter plots of K=8 clusters for the data
Multiple scatter plots of K=7 clusters for the data
15
Multiple scatter plots of K=6 clusters for the data
Multiple scatter plots of K=5 clusters for the data
16
Multiple scatter plots of K=4 clusters for the data
Multiple scatter plots of K=3 clusters for the data
17
Multiple scatter plots of K=2 clusters for the data
Multiple scatter plots of K=1 cluster for the data
18
And the cluster classification plot is as follows
BIC plot is also given below
Where the models are as below
“EII” = spherical, equal volume
“VII” = spherical, unequal volume
19
“EEI” = diagonal, equal volume and shape
“VEI” = diagonal, varying volume, equal shape
“EVI” = diagonal, equal volume, varying shape
“VVI” = diagonal, varying volume and shape
“EEE” = ellipsoidal, equal volume, shape, and orientation
“EVE” = ellipsoidal, equal volume and orientation
“VEE” = ellipsoidal, equal shape and orientation
“VVE” = ellipsoidal, equal orientation
“EEV” = ellipsoidal, equal volume and equal shape
“VEV” = ellipsoidal, equal shape
“EVV” = ellipsoidal, equal volume
“VVV” = ellipsoidal, varying volume, shape, and orientation
By mclustBIC() function and also from the above plot we get that
BIC is maximum for “VEV” i.e. equal ellipsoidal shape and BIC is
maximum for 8 cluster components.
Hence the Gaussian finite mixture model with 8 cluster components
fits well our dataset with the cluster parameters , , … , N,
# , # , … , #N , ∑ , ∑ , … , ∑N (See Appendix 6.2).
20
6.CONCLUSION
I have dealt with the Shapley Galaxy dataset by fitting parametric
model cluster analysis since plotting the dendrograms of usual clustering
algorithms (i.e. simple, complete and average linkage) I could not
conclude that how the galaxies are clustered or how many clusters they
form so that one can say in a certain cluster the galaxies are of equal
characteristics based on the given variables. I fitted a Gaussian mixture
model via Bayesian Information Criterion (BIC) assuming each cluster
having a multivariate normal distribution. And for 8 cluster components
BIC is maximum. Hence the dataset is a mixture of 8 normal populations
with the certain parameters.
The analysis I have done in this project can also be applied in
astrostatistical data like this or any big data in which plotting the
dendrograms of usual clustering algorithms any valid conclusion cannot
be made. Then one can apply this sophisticated analysis by fitting model
based clustering to the data. For this reason I think this project will be
very useful in statistical analysis in future.
21
7. APPENDIX
7.1) R Code for Cluster Analysis:-
>data=read.table("dataset.txt",header=T) ### read the data
>d=dist(as.matrix(data)) ### defining distance matrix
>hc1=hclust(d,"complete") ### complete linkage
>hc2=hclust(d,"single") ### single linkage
>hc3=hclust(d,"average") ### average linkage
>plot(hc1,xlab="Objects",ylab="Distance") ### Dendrogram of
complete linkage clustering
>plot(hc2,xlab="Objects",ylab="Distance") ### Dendrogram of single
linkage clustering
>plot(hc3,xlab="Objects",ylab="Distance") ### Dendrogram of
average linkage clustering
7.2) R Code for Gaussian Mixture Model Analysis:-
>install.packages(“mclust”) ### “mclust” package installation
>library(mclust) ### calling package “mclust”
> summary(Mclust(data),parameters=TRUE) ### parameter values
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
22
Mclust VEV (ellipsoidal, equal shape) model with 8 components:
log.likelihood n df BIC ICL
-91344.25 4215 139 -183848.6 -184592.5
Clustering table:
1 2 3 4 5 6 7 8
642 380 365 68 1664 311 252 533
Mixing probabilities:
1 2 3 4 5 6 7
0.14844329 0.08948725 0.08623729 0.01747060 0.39351658 0.07365214 0.05926123
8
0.13193162
Means:
[,1] [,2] [,3] [,4] [,5] [,6]
R.A. 202.30359 193.85946 198.48879 203.040838 202.19427 2.020683e+02
Dec. -31.62415 -29.87713 -31.19507 -32.339459 -32.22335 -3.214705e+01
Mag 17.68570 17.80851 17.61867 5.948804 15.92424 3.763733e-16
V 14235.45711 16459.78793 31277.52284 30855.133968 12483.56623 1.236849e+04
SigV 43.91562 44.11998 60.79286 186.819014 65.27643 6.296875e+01
[,7] [,8]
R.A. 194.82766 208.84368
Dec. -29.14227 -31.42455
Mag 17.61551 15.00507
V 20238.86560 7154.17168
SigV 67.83738 49.68235
23
Variances:
[,,1]
R.A. Dec. Mag V SigV
R.A. 7.287574e-01 -0.006095364 -0.20837492 -306.19632 -0.1675741
Dec. -6.095364e-03 0.096109142 0.00520636 89.80991 0.1991862
Mag -2.083749e-01 0.005206360 1.47481240 39.31490 2.9049595
V -3.061963e+02 89.809908618 39.31490404 2505189.73769 6826.4215651
SigV -1.675741e-01 0.199186214 2.90495950 6826.42157 239.2164453
[,,2]
R.A. Dec. Mag V SigV
R.A. 0.10208391 -0.1039204 0.01686484 8.253725e+01 -5.055577e-01
Dec. -0.10392037 0.5575132 -0.18991703 3.576764e+00 1.159739e+00
Mag 0.01686484 -0.1899170 1.16151860 3.301307e+01 6.897552e-02
V 82.53724504 3.5767638 33.01307187 2.040891e+06 -7.381334e+03
SigV -0.50555774 1.1597385 0.06897552 -7.381334e+03 2.064483e+02
[,,3]
R.A. Dec. Mag V SigV
R.A. 11.6035707 -1.574402 -0.6919389 2.817419e+03 -4.004506
Dec. -1.5744015 5.003039 1.6040287 7.138022e+02 9.742544
Mag -0.6919389 1.604029 1.3414609 6.797225e+02 3.762460
V 2817.4192582 713.802245 679.7224588 1.965393e+07 2531.912748
SigV -4.0045060 9.742544 3.7624604 2.531913e+03 1731.312164
24
[,,4
R.A. Dec. Mag V SigV
R.A. 74.035386 7.550342 30.98499 4171.425 -1.681867e+00
Dec. 7.550342 10.345911 14.04735 -8518.099 8.707003e+01
Mag 30.984993 14.047346 126.75194 -47601.402 4.410704e+02
V 4171.425135 -8518.099472 -47601.40192 224442780.655 -1.250386e+06
SigV -1.681867 87.070034 441.07038 -1250386.089 2.673377e+04
[,,5]
R.A. Dec. Mag V SigV
R.A. 18.3679545 -0.7246259 -0.3916425 8192.9301 -5.069535
Dec. -0.7246259 6.6309625 0.6971234 -1344.4685 2.235558
Mag -0.3916425 0.6971234 1.2054110 748.0295 17.488057
V 8192.9301477 -1344.4685000 748.0295188 26212974.6194 51547.113121
SigV -5.0695352 2.2355582 17.4880569 51547.1131 2409.825921
[,,6]
R.A. Dec. Mag V SigV
R.A. 1.617655e+01 -1.366427e-01 -4.049577e-15 -6.690651e+03 3.010281e+01
Dec. -1.366427e-01 5.821269e+00 -1.728647e-14 -6.578589e+02 1.218975e+01
Mag -4.049577e-15 -1.728647e-14 8.296247e-01 -2.067025e-12 -3.177826e-13
V -6.690651e+03 -6.578589e+02 -2.067025e-12 2.239437e+07 3.727633e+04
SigV 3.010281e+01 1.218975e+01 -3.177826e-13 3.727633e+04 2.033529e+03
25
[,,7]
R.A. Dec. Mag V SigV
R.A. 4.94712752 -0.01616675 -0.9052305 149.4066 -6.258482
Dec. -0.01616675 0.36436345 0.2510239 52.2886 2.212348
Mag -0.90523055 0.25102388 2.5169624 430.3722 -3.275991
V 149.40664650 52.28860010 430.3721890 8746849.8127 -39512.273470
SigV -6.25848182 2.21234752 -3.2759908 -39512.2735 948.865966
[,,8]
R.A. Dec. Mag V SigV
R.A. 11.62256349 -0.09353045 1.2133975 3902.249 37.31454
Dec. -0.09353045 4.66961672 0.4492924 1381.551 16.14600
Mag 1.21339748 0.44929245 1.1493867 1621.072 27.30945
V 3902.24896851 1381.55109212 1621.0715134 17492495.260 56416.61420
SigV 37.31454451 16.14600173 27.3094540 56416.614 1721.86659
>plot(clustCombi(data),data) ### to plot the multiple scatter plots of
different cluster combinations
>plot(Mclust(data)) ### gives BIC and classification plots
8. ACKNOWLEDGEMENT
I am very much thankful to the Department of Statistics, West Bengal
State University for their continuous guidance to realize this project. I am
also thankful to Astrostatistics Department of Penn State University and
Eberly College of Science of Penn State University.
26
9. REFERENCES
1. Dataset at the Department of Astrostatistics, PennState University
URL - http://astrostatistics.psu.edu/
2. Eric D. Feigelson, G. Jogesh Babu; Cambridge University Press
(2012): “Modern Statistical Method for Astronomy with R
Applications”
3. Johnson, R.A. and Wichern, D.W. (1998): “Applied multivariate
statistical analysis, New Jersey: Prentice Hall.
4. DAVID M. ROCKE AND JIAN DAI, Center for Image
Processing and Integrated Computing, University of California,
Davis, CA 95616, USA : “Sampling and Subsampling for Cluster
Analysis in Data Mining: With Applications to Sky Survey Data”
5. Jia Li, Department of Statistics The Pennsylvania State University
: “Mixture Models”
6. Fraley C, Raftery A (2009). Mclust : “Model-Based Clustering
and Normal Mixture Modeling”

More Related Content

What's hot

Clustering in artificial intelligence
Clustering in artificial intelligence Clustering in artificial intelligence
Clustering in artificial intelligence
Karam Munir Butt
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
first research paper
first research paperfirst research paper
first research paper
Justin McKennon
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Afzaal Subhani
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
IJDKP
 
Black hole entropy leads to the non-local grid dimensions theory
Black hole entropy leads to the non-local grid dimensions theory Black hole entropy leads to the non-local grid dimensions theory
Black hole entropy leads to the non-local grid dimensions theory
Eran Sinbar
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
ijpla
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
Dr.E.N.Sathishkumar
 
A new universal formula for atoms, planets, and galaxies
A new universal formula for atoms, planets, and galaxiesA new universal formula for atoms, planets, and galaxies
A new universal formula for atoms, planets, and galaxies
IOSR Journals
 
Clustering part 1
Clustering part 1Clustering part 1
Clustering part 1
Abdul Kawsar Tushar
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
Vinit Dantkale
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
Raffaele Capaldo
 
Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithm
khalid Shah
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Sajib Sen
 
K means clustering
K means clusteringK means clustering
K means clustering
Thomas K T
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
singh7599
 
08 clustering
08 clustering08 clustering
Heuristic approach for quantized space & time
Heuristic approach for quantized space & timeHeuristic approach for quantized space & time
Heuristic approach for quantized space & time
Eran Sinbar
 

What's hot (20)

Clustering in artificial intelligence
Clustering in artificial intelligence Clustering in artificial intelligence
Clustering in artificial intelligence
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
first research paper
first research paperfirst research paper
first research paper
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
Black hole entropy leads to the non-local grid dimensions theory
Black hole entropy leads to the non-local grid dimensions theory Black hole entropy leads to the non-local grid dimensions theory
Black hole entropy leads to the non-local grid dimensions theory
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
A new universal formula for atoms, planets, and galaxies
A new universal formula for atoms, planets, and galaxiesA new universal formula for atoms, planets, and galaxies
A new universal formula for atoms, planets, and galaxies
 
Clustering part 1
Clustering part 1Clustering part 1
Clustering part 1
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Intro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithmIntro to MATLAB and K-mean algorithm
Intro to MATLAB and K-mean algorithm
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
08 clustering
08 clustering08 clustering
08 clustering
 
Heuristic approach for quantized space & time
Heuristic approach for quantized space & timeHeuristic approach for quantized space & time
Heuristic approach for quantized space & time
 

Viewers also liked

Aprendizaje autonomo
Aprendizaje autonomoAprendizaje autonomo
Aprendizaje autonomo
lizeth dominguez
 
Aprendizaje autonomo
Aprendizaje autonomoAprendizaje autonomo
Aprendizaje autonomo
salvador frutis
 
Conceptofundamentalesrosamamallacta
ConceptofundamentalesrosamamallactaConceptofundamentalesrosamamallacta
Conceptofundamentalesrosamamallacta
Rous M.
 
Bicycle Donation of the Netherlands to Seoul
Bicycle Donation of the Netherlands to SeoulBicycle Donation of the Netherlands to Seoul
Bicycle Donation of the Netherlands to Seoul
Hajin Lee
 
Rohit Kalra
Rohit KalraRohit Kalra
Rohit Kalra
Rohit Kalra
 
Karthick
KarthickKarthick
Karthick
KARTHICK P
 
DIAKJokinen.Kuitunen
DIAKJokinen.KuitunenDIAKJokinen.Kuitunen
DIAKJokinen.KuitunenOuti Kuitunen
 
kumar cv
kumar cv kumar cv
kumar cv
Senthil Kumar
 
Relationship Retreat
Relationship RetreatRelationship Retreat
Relationship Retreat
mauihealingretreat111
 
Arun Kumar Resume
Arun Kumar ResumeArun Kumar Resume
Arun Kumar Resume
Arun Kumar C B
 
gestion de la calidad
gestion de la calidadgestion de la calidad
gestion de la calidad
sandy alejandro
 
SWC_Presentation of Services_20160622_
SWC_Presentation of Services_20160622_SWC_Presentation of Services_20160622_
SWC_Presentation of Services_20160622_
Justin Cobb
 
Techno_engineers-Presentation_..PDF
Techno_engineers-Presentation_..PDFTechno_engineers-Presentation_..PDF
Techno_engineers-Presentation_..PDF
Techno Engineers
 
Lotos p inicial
Lotos p  inicialLotos p  inicial
Lotos p inicial
Sandra Alarcón Navarro
 
MAPA CONCEPTUAL
MAPA CONCEPTUAL MAPA CONCEPTUAL

Viewers also liked (15)

Aprendizaje autonomo
Aprendizaje autonomoAprendizaje autonomo
Aprendizaje autonomo
 
Aprendizaje autonomo
Aprendizaje autonomoAprendizaje autonomo
Aprendizaje autonomo
 
Conceptofundamentalesrosamamallacta
ConceptofundamentalesrosamamallactaConceptofundamentalesrosamamallacta
Conceptofundamentalesrosamamallacta
 
Bicycle Donation of the Netherlands to Seoul
Bicycle Donation of the Netherlands to SeoulBicycle Donation of the Netherlands to Seoul
Bicycle Donation of the Netherlands to Seoul
 
Rohit Kalra
Rohit KalraRohit Kalra
Rohit Kalra
 
Karthick
KarthickKarthick
Karthick
 
DIAKJokinen.Kuitunen
DIAKJokinen.KuitunenDIAKJokinen.Kuitunen
DIAKJokinen.Kuitunen
 
kumar cv
kumar cv kumar cv
kumar cv
 
Relationship Retreat
Relationship RetreatRelationship Retreat
Relationship Retreat
 
Arun Kumar Resume
Arun Kumar ResumeArun Kumar Resume
Arun Kumar Resume
 
gestion de la calidad
gestion de la calidadgestion de la calidad
gestion de la calidad
 
SWC_Presentation of Services_20160622_
SWC_Presentation of Services_20160622_SWC_Presentation of Services_20160622_
SWC_Presentation of Services_20160622_
 
Techno_engineers-Presentation_..PDF
Techno_engineers-Presentation_..PDFTechno_engineers-Presentation_..PDF
Techno_engineers-Presentation_..PDF
 
Lotos p inicial
Lotos p  inicialLotos p  inicial
Lotos p inicial
 
MAPA CONCEPTUAL
MAPA CONCEPTUAL MAPA CONCEPTUAL
MAPA CONCEPTUAL
 

Similar to Project

F0344451
F0344451F0344451
F0344451
IOSR Journals
 
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Sérgio Sacani
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
SandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
ImXaib
 
REU_paper
REU_paperREU_paper
REU_paper
Hunter Gabbard
 
Senior_Project
Senior_ProjectSenior_Project
Senior_Project
Kyle Donnelly
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
Syed Ejaz
 
Mapping spiral structure on the far side of the Milky Way
Mapping spiral structure on the far side of the Milky WayMapping spiral structure on the far side of the Milky Way
Mapping spiral structure on the far side of the Milky Way
Sérgio Sacani
 
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
Sérgio Sacani
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
Derek Kane
 
tw1979_final_report_ref_correct
tw1979_final_report_ref_correcttw1979_final_report_ref_correct
tw1979_final_report_ref_correct
Thomas Wigg
 
UNF Undergrad Physics
UNF Undergrad PhysicsUNF Undergrad Physics
UNF Undergrad Physics
Nick Kypreos
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
IOSR Journals
 
Marzieh_Master_thesis
Marzieh_Master_thesisMarzieh_Master_thesis
Marzieh_Master_thesis
Marzieh Jafarian dehaghani
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun ParkInitial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
Daniel Park
 
Nature12917
Nature12917Nature12917
Nature12917
Carlos Bella
 
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
Sérgio Sacani
 
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
Sérgio Sacani
 
Optimal Estimations of Photometric Redshifts and SED Fitting Parameters
Optimal Estimations of Photometric Redshifts and SED Fitting ParametersOptimal Estimations of Photometric Redshifts and SED Fitting Parameters
Optimal Estimations of Photometric Redshifts and SED Fitting Parameters
julia avez
 

Similar to Project (20)

F0344451
F0344451F0344451
F0344451
 
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
REU_paper
REU_paperREU_paper
REU_paper
 
Senior_Project
Senior_ProjectSenior_Project
Senior_Project
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
 
Mapping spiral structure on the far side of the Milky Way
Mapping spiral structure on the far side of the Milky WayMapping spiral structure on the far side of the Milky Way
Mapping spiral structure on the far side of the Milky Way
 
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
Keck Integral-field Spectroscopy of M87 Reveals an Intrinsically Triaxial Gal...
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
tw1979_final_report_ref_correct
tw1979_final_report_ref_correcttw1979_final_report_ref_correct
tw1979_final_report_ref_correct
 
UNF Undergrad Physics
UNF Undergrad PhysicsUNF Undergrad Physics
UNF Undergrad Physics
 
Survey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in DataminingSurvey on Unsupervised Learning in Datamining
Survey on Unsupervised Learning in Datamining
 
Marzieh_Master_thesis
Marzieh_Master_thesisMarzieh_Master_thesis
Marzieh_Master_thesis
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun ParkInitial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
Initial Calibration of CCD Images for the Dark Energy Survey- Deokgeun Park
 
Nature12917
Nature12917Nature12917
Nature12917
 
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
Detection of an_unindentified_emission_line_in_the_stacked_x_ray_spectrum_of_...
 
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
 
Optimal Estimations of Photometric Redshifts and SED Fitting Parameters
Optimal Estimations of Photometric Redshifts and SED Fitting ParametersOptimal Estimations of Photometric Redshifts and SED Fitting Parameters
Optimal Estimations of Photometric Redshifts and SED Fitting Parameters
 

Project

  • 1. A PROJECT ON CLUSTERCLUSTERCLUSTERCLUSTERINGINGINGING SHAPLEYSHAPLEYSHAPLEYSHAPLEY GALAXYGALAXYGALAXYGALAXY DATADATADATADATASETSETSETSET In partial fulfillment for the award of the degree Of Master of Science (Statistics) Submitted By: SRIJAN PAUL Regn. No.: 2014003137 West Bengal State University Year: 2014-2016
  • 2. 2 CONTENTS 1. Introduction ……………………………………………………… 3-5 1.1) Astronomical Background ……………………………… 3-4 1.2) Identity of the Dataset …………………………….......... 4-5 1.3) Target for This Dataset …………………………………... 5 1.4) Why Clustering …………………………………………... 5 2. Methodology …………………………………………………...… 5-8 2.1) Brief Idea of Different Hierarchical Clustering Algorithms ……………………...……………….. 5-6 2.2) Single Linkage Clustering ........................................…... 6-7 2.3) Complete Linkage Clustering ……………….………….... 7 2.4) Average Linkage Clustering .....................................…... 7-8 3. Analysis …………………………………….………………….... 8-10 4. Methodology for Further Analysis ……………………………. 10-13 4.1) Model based Clustering ……………………………… 10-12 4.2) EM Algorithm for the Mixture of Gaussians …………... 13 5. Further Analysis ……………………………………………….. 13-19 6. Conclusion ………………………………………………………. 20 7. Appendix ………………………………………………………. 21-25 7.1) R Code for Cluster Analysis …………………………..... 21 7.2) R Code for Gaussian Mixture Model Analysis …….…. 21-25 8. Acknowledgement ……………………………………………….. 25 9. References ………………………………………………………... 26
  • 3. 3 1. INTRODUCTION In Statistics multivariate statistical analysis is very common. Now almost all datasets are multivariate datasets or high dimensional. In my project the corresponding dataset is also multidimensional. My dataset is astronomy related i.e. astronomical data relating to 4215 galaxies in space. The dataset is called Shapley Galaxy dataset. 1.1) Astronomical Background:- The distribution of galaxies in space is strongly clustered. The Milky Way Galaxy resides in its Local Group which lies on the outskirts of the Virgo Cluster of galaxies, which in turn is part of the Local Supercluster. Similar structures of galaxies are seen at greater distances, and collectively the phenomenon is known as the Large Scale Structure (LSS) of the Universe. The clustering is hierarchical, nonlinear, and anisotropic. The latter property is manifested as galaxies concentrating in huge flattened, curved superclusters surrounding "voids", resembling a collection of soap bubbles. The basic characteristics of the LSS are now understood astrophysically as arising from the gravitational attraction of matter in the Universe expanding from the Big Bang approximately 14 billion years ago. The particular three-dimensional patterns are well-reproduced by simulations requiring that attractive Cold Dark Matter and repulsive Dark Energy are present in addition to attractive baryonic (ordinary) matter. The properties of baryonic and dark components needed to explain LSS agree very well with those needed to explain the fluctuations of the cosmic microwave background and other results from observational cosmology.
  • 4. 4 Despite this fundamental understanding, there is considerable interest in understanding the details of galaxy clustering; e.g. the processes of collision and merging of rich galaxy clusters. The richest nearby supercluster of interacting galaxy clusters is called the Shapley Concentration. It includes several clusters from the Abell catalog of rich galaxy clusters seen in the optical band, and a complex and massive hot gaseous medium seen in the X-ray band. Optical measurement of galaxy redshifts provide crucial information but represent an uncertain convolution of the galaxy distance and gravitational effects of the clusters in which they reside. The distance effect comes from the universal expansion from the Big Bang, where the recessional velocity (galaxy redshift) follows Hubble's Law v= , where v is the velocity in km/s, is the galaxy distance from us in Mpc (million parsecs, 1 pc~3 light years), and is Hubble's constant known to be about 72 km/s/Mpc. The cluster gravitational effects must be estimated or simulated for individual galaxies. 1.2) Identity of the Dataset:- The dataset consists of 5 variables which are as follows – 1) R.A. i.e. Right Ascension: Coordinate in the sky similar to longitude on Earth, 0 to 360 degrees. 2) Dec. i.e. Declination: Coordinate in the sky similar to latitude on Earth, -90 to +90 degrees. 3) Mag i.e. Magnitude: An inverted logarithmic measure of galaxy brightness in the optical band. A Mag=17 galaxy is 100-times fainter than a Mag=12 galaxy. Value is missing for some galaxies (which are considered as 0). 4) V i.e. Velocity: Speed of the galaxy moving away from Earth, after various corrections are applied.
  • 5. 5 5) SigV i.e. Sigma of velocity: Heteroscedastic measurement error known for each individual velocity measurement. 1.3) Target for This Dataset:- Generally in such astrostatistical dataset astronomers use different hierarchical clustering algorithms. They often use single-linkage nonparametric hierarchical agglomeration which they call “friends-of- friends algorithm”. Hence I am interested to apply a variety of multivariate clustering algorithms and compare them if possible. 1.4) Why Clustering:- In astrostatistical analysis generally we are interested to find those astronomical bodies with similar type of characteristics. Here also our aim is to analyze how much the galaxies cluster or how many cluster they form on the basis of above 5 variables. Then in a given cluster we can say the galaxies are of similar type of characteristics based on the above variables. 2. METHODOLOGY 2.1) Brief Idea of Different Hierarchical Clustering Algorithms:- The following are the steps in the agglomerative hierarchical clustering algorithm for grouping N objects (items or variables):
  • 6. 6 1. Start with N clusters, each containing a single entity and an N X N symmetric matrix of distances (or similarities) D = { }. 2. Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between "most similar" clusters U and V be · 3. Merge clusters U and V. Label the newly formed cluster (UV). Update the entries in the distance matrix by deleting the rows and columns corresponding to clusters U and V and adding a row and column giving the distances between cluster (UV) and the remaining clusters. 4. Repeat Steps 2 and 3 a total of N-1 times. (All objects will be in a single cluster after the algorithm terminates.) Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place. 2.2) Single Linkage Clustering:- The inputs to a single linkage algorithm can be distances or similarities between pairs of objects. Groups are formed from the individual entities by merging nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity. Initially, we must find the smallest distance in D = { } and merge the corresponding objects, say, U and V, to get the cluster (UV). For Step 3 of the above general algorithm, the distances between (UV) and any other cluster W are computed by = min { } Here the quantities and are the distances between the nearest neighbors of clusters U and W and clusters V and W, respectively. The results of single linkage clustering can be graphically displayed in the form of a dendrogram, or tree diagram. The branches in the tree represent clusters. The branches come together (merge) at nodes whose
  • 7. 7 positions along a distance (or similarity) axis indicate the level at which the fusions occur. 2.3) Complete Linkage Clustering:- Complete linkage clustering proceeds in much the same manner as single linkage clustering, with one important exception: At each stage, the distance (similarity) between clusters is determined by the distance (similarity) between the twoelements, one from each cluster, that are most distant. Thus, complete linkage ensures that all items in a cluster are within some maximum distance (or minimum similarity) of each other. The general agglomerative algorithm again starts by finding the minimum entry in D = { } and merging the corresponding objects, such as U and V, to get cluster (UV). For Step 3 of the above general algorithm, the distances between (UV) and any other cluster W are computed by = max { , } Here and are the distances between the most distant members of clusters U and W and clusters V and W, respectively. 2.4) Average Linkage Clustering:- Average linkage treats the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to each cluster. Again, the input to the average linkage algorithm may be distances or similarities, and the method can be used to group objects or variables. The average linkage algorithm proceeds in the manner of the above general algorithm. We begin by searching the distance matrix D = { } to find the nearest (most similar) objects- for example, U and V. These objects are merged to form the cluster (UV). For Step 3 of the above
  • 8. 8 general agglomerative algorithm, the distances between (UV) and the other cluster W are determined by = ∑ ∑ where is the distance between object i in the cluster (UV) and object k in the cluster W, and and are the number of items in clusters (UV) and W, respectively. 3. ANALYSIS I have applied the above mentioned three clustering schemes i.e. single, complete and average linkage clustering algorithms and got the following dendrograms. Dendrogram of Single Linkage
  • 9. 9 Dendrogram of Complete Linkage Dendrogram of Average Linkage
  • 10. 10 From the above dendrograms one cannot say how much the galaxies cluster among each other or how many clusters they form. Hence further analyses are required. 4. METHODOLOGY FOR FURTHER ANALYSIS 4.1) Model Based Clustering:- The single linkage, complete linkage and average linkage clustering methods are intuitively reasonable procedures but that is as much as we can say without having a model to explain how the observations were produced. Major advances in clustering methods have been made through the introduction of statistical models that indicate how the collection of (p X 1) measurements , from the N objects, was generated. The most common model is one where cluster k has expected proportion of the objects and the corresponding measurements are generated by a probability density function . Then, if there are K clusters, the observation vector for a single object is modeled as arising from the mixing distribution = where each ≥ 0 and ∑ =1. This distribution is called a mixture of the K distributions , , … , because the observation is generated from the component distribution with probability · The collection of N observation vectors generated from this distribution will be a mixture of observations from the component distributions.
  • 11. 11 The most common mixture model is a mixture of multivariate normal distributions where the k-th component is the " # , ∑ ) density function which is known as Gaussian or Maximum Likelihood Mixture model assuming individual clusters are multivariate normal. The normal mixture model for one observation is ( ∣ # , ∑ , …, # , ∑ ) = ∑ ( %)&/(∣∑ ∣)/( exp (− (x-# )+ ∑, ( − # )) Clusters generated by this model are ellipsoidal in shape with the heaviest concentration of observations near the center. Inferences are based on the likelihood, which for N objects and a fixed number of clusters K, is L( , , … , , # , ∑ , … , # , ∑ )=∏ ( . ∣. # , ∑ , … , # , ∑ ) =∏ (. ∑ ( %)&/(∣∑ ∣)/( exp (− ( − # )+ ∑, ( − # ))) where the proportions , , … , , the mean vectors # , # , … , # , and the covariance matrices ∑ , ∑ , … , ∑ are unknown. The measurements for different objects are treated as independent and identically distributed observations from the mixture distribution. Most importantly, under the sequence of above mixture models for different K, the problems of choosing the number of clusters and choosing an appropriate clustering method has been reduced to the problem of selecting an appropriate statistical model. This is a major advancement. A good approach to selecting a model is to first obtain the maximum likelihood estimates ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ for a fixed number of clusters K. These estimates must be obtained numerically using special purpose software. The resulting value of the maximum of the likelihood 234 = L( ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ )
  • 12. 12 provides the basis for model selection. How do we decide on a reasonable value for the number of clusters K? In order to compare models with different numbers of parameters, a penalty is subtracted from twice the maximized value of the log-likelihood to give -2ln 234 − 789:;<= where the penalty depends on the number of parameters estimated and the number of observations N. Since the probabilities sum to 1, there are only K-1 probabilities that must be estimated, K X p means and K X p(p + 1)/2 variances and covariances. For the Akaike information criterion (AIC), the penalty is 2N × (number of parameters) so AIC = 2ln 234 −2N( ( + 1)( + 2) − 1) The Bayesian information criterion (BIC) is similar but uses the logarithm of the number of parameters in the penalty function BIC = 2ln 234 −2ln N ( ( + 1)( + 2) − 1) Even for a fixed number of clusters, the estimation of a mixture model is complicated. One current software package, MCLUST, available in the R software library, combines hierarchical clustering, the EM algorithm and the BIC criterion to develop an appropriate model for clustering. In the 'E' -step of the EM algorithm, a (N X K) matrix is created whose BCD row contains estimates of the conditional (on the current parameter estimates) probabilities that observation . belongs to cluster 1, 2, ... ,K. So, at convergence, the BCD observation (object) is assigned to the cluster k for which the conditional probability E F ∣∣ . G = ̂. E . ∣∣ F G/ ̂ ∣ F of membership is the largest.
  • 13. 13 4.2) EM Algorithm for the Mixture of Gaussians:- Parameters estimated at the FCD iteration are marked by a superscript (r). 1. Initialize parameters (which have been taken arbitrarily by the software). 2. EEEE----stepstepstepstep:- Compute the posterior probabilities for all j = 1,...,n; k = 1,...,K. E F ∣∣ . G = . H E . ∣∣ F G/ H ∣ F 3. M. M. M. M----stepstepstepstep:- . HI = ∑ E F ∣∣ . G, # HI = ∑ " ∣ J J K JL) ∑ " ∣ J K JL) , ∑ HI = E F ∣∣ . G . − # HI . − # HI + Repeat step 2 and 3 until convergence. 5. FURTHER ANALYSIS Using MCLUST package in R software library and specifically the Mclust() function and clustCombi() function, I first fit the = 5 dimensional normal mixture model. Using the BIC criterion, the software chooses K=8 clusters with estimated centers # , # , … , #N and variance covariance matrices ∑ , ∑ , … , ∑N with the mixing probabilities , , … , N (See Appendix). And the scatter plot of the above analysis are given below.
  • 14. 14 Multiple scatter plots of K=8 clusters for the data Multiple scatter plots of K=7 clusters for the data
  • 15. 15 Multiple scatter plots of K=6 clusters for the data Multiple scatter plots of K=5 clusters for the data
  • 16. 16 Multiple scatter plots of K=4 clusters for the data Multiple scatter plots of K=3 clusters for the data
  • 17. 17 Multiple scatter plots of K=2 clusters for the data Multiple scatter plots of K=1 cluster for the data
  • 18. 18 And the cluster classification plot is as follows BIC plot is also given below Where the models are as below “EII” = spherical, equal volume “VII” = spherical, unequal volume
  • 19. 19 “EEI” = diagonal, equal volume and shape “VEI” = diagonal, varying volume, equal shape “EVI” = diagonal, equal volume, varying shape “VVI” = diagonal, varying volume and shape “EEE” = ellipsoidal, equal volume, shape, and orientation “EVE” = ellipsoidal, equal volume and orientation “VEE” = ellipsoidal, equal shape and orientation “VVE” = ellipsoidal, equal orientation “EEV” = ellipsoidal, equal volume and equal shape “VEV” = ellipsoidal, equal shape “EVV” = ellipsoidal, equal volume “VVV” = ellipsoidal, varying volume, shape, and orientation By mclustBIC() function and also from the above plot we get that BIC is maximum for “VEV” i.e. equal ellipsoidal shape and BIC is maximum for 8 cluster components. Hence the Gaussian finite mixture model with 8 cluster components fits well our dataset with the cluster parameters , , … , N, # , # , … , #N , ∑ , ∑ , … , ∑N (See Appendix 6.2).
  • 20. 20 6.CONCLUSION I have dealt with the Shapley Galaxy dataset by fitting parametric model cluster analysis since plotting the dendrograms of usual clustering algorithms (i.e. simple, complete and average linkage) I could not conclude that how the galaxies are clustered or how many clusters they form so that one can say in a certain cluster the galaxies are of equal characteristics based on the given variables. I fitted a Gaussian mixture model via Bayesian Information Criterion (BIC) assuming each cluster having a multivariate normal distribution. And for 8 cluster components BIC is maximum. Hence the dataset is a mixture of 8 normal populations with the certain parameters. The analysis I have done in this project can also be applied in astrostatistical data like this or any big data in which plotting the dendrograms of usual clustering algorithms any valid conclusion cannot be made. Then one can apply this sophisticated analysis by fitting model based clustering to the data. For this reason I think this project will be very useful in statistical analysis in future.
  • 21. 21 7. APPENDIX 7.1) R Code for Cluster Analysis:- >data=read.table("dataset.txt",header=T) ### read the data >d=dist(as.matrix(data)) ### defining distance matrix >hc1=hclust(d,"complete") ### complete linkage >hc2=hclust(d,"single") ### single linkage >hc3=hclust(d,"average") ### average linkage >plot(hc1,xlab="Objects",ylab="Distance") ### Dendrogram of complete linkage clustering >plot(hc2,xlab="Objects",ylab="Distance") ### Dendrogram of single linkage clustering >plot(hc3,xlab="Objects",ylab="Distance") ### Dendrogram of average linkage clustering 7.2) R Code for Gaussian Mixture Model Analysis:- >install.packages(“mclust”) ### “mclust” package installation >library(mclust) ### calling package “mclust” > summary(Mclust(data),parameters=TRUE) ### parameter values ---------------------------------------------------- Gaussian finite mixture model fitted by EM algorithm ----------------------------------------------------
  • 22. 22 Mclust VEV (ellipsoidal, equal shape) model with 8 components: log.likelihood n df BIC ICL -91344.25 4215 139 -183848.6 -184592.5 Clustering table: 1 2 3 4 5 6 7 8 642 380 365 68 1664 311 252 533 Mixing probabilities: 1 2 3 4 5 6 7 0.14844329 0.08948725 0.08623729 0.01747060 0.39351658 0.07365214 0.05926123 8 0.13193162 Means: [,1] [,2] [,3] [,4] [,5] [,6] R.A. 202.30359 193.85946 198.48879 203.040838 202.19427 2.020683e+02 Dec. -31.62415 -29.87713 -31.19507 -32.339459 -32.22335 -3.214705e+01 Mag 17.68570 17.80851 17.61867 5.948804 15.92424 3.763733e-16 V 14235.45711 16459.78793 31277.52284 30855.133968 12483.56623 1.236849e+04 SigV 43.91562 44.11998 60.79286 186.819014 65.27643 6.296875e+01 [,7] [,8] R.A. 194.82766 208.84368 Dec. -29.14227 -31.42455 Mag 17.61551 15.00507 V 20238.86560 7154.17168 SigV 67.83738 49.68235
  • 23. 23 Variances: [,,1] R.A. Dec. Mag V SigV R.A. 7.287574e-01 -0.006095364 -0.20837492 -306.19632 -0.1675741 Dec. -6.095364e-03 0.096109142 0.00520636 89.80991 0.1991862 Mag -2.083749e-01 0.005206360 1.47481240 39.31490 2.9049595 V -3.061963e+02 89.809908618 39.31490404 2505189.73769 6826.4215651 SigV -1.675741e-01 0.199186214 2.90495950 6826.42157 239.2164453 [,,2] R.A. Dec. Mag V SigV R.A. 0.10208391 -0.1039204 0.01686484 8.253725e+01 -5.055577e-01 Dec. -0.10392037 0.5575132 -0.18991703 3.576764e+00 1.159739e+00 Mag 0.01686484 -0.1899170 1.16151860 3.301307e+01 6.897552e-02 V 82.53724504 3.5767638 33.01307187 2.040891e+06 -7.381334e+03 SigV -0.50555774 1.1597385 0.06897552 -7.381334e+03 2.064483e+02 [,,3] R.A. Dec. Mag V SigV R.A. 11.6035707 -1.574402 -0.6919389 2.817419e+03 -4.004506 Dec. -1.5744015 5.003039 1.6040287 7.138022e+02 9.742544 Mag -0.6919389 1.604029 1.3414609 6.797225e+02 3.762460 V 2817.4192582 713.802245 679.7224588 1.965393e+07 2531.912748 SigV -4.0045060 9.742544 3.7624604 2.531913e+03 1731.312164
  • 24. 24 [,,4 R.A. Dec. Mag V SigV R.A. 74.035386 7.550342 30.98499 4171.425 -1.681867e+00 Dec. 7.550342 10.345911 14.04735 -8518.099 8.707003e+01 Mag 30.984993 14.047346 126.75194 -47601.402 4.410704e+02 V 4171.425135 -8518.099472 -47601.40192 224442780.655 -1.250386e+06 SigV -1.681867 87.070034 441.07038 -1250386.089 2.673377e+04 [,,5] R.A. Dec. Mag V SigV R.A. 18.3679545 -0.7246259 -0.3916425 8192.9301 -5.069535 Dec. -0.7246259 6.6309625 0.6971234 -1344.4685 2.235558 Mag -0.3916425 0.6971234 1.2054110 748.0295 17.488057 V 8192.9301477 -1344.4685000 748.0295188 26212974.6194 51547.113121 SigV -5.0695352 2.2355582 17.4880569 51547.1131 2409.825921 [,,6] R.A. Dec. Mag V SigV R.A. 1.617655e+01 -1.366427e-01 -4.049577e-15 -6.690651e+03 3.010281e+01 Dec. -1.366427e-01 5.821269e+00 -1.728647e-14 -6.578589e+02 1.218975e+01 Mag -4.049577e-15 -1.728647e-14 8.296247e-01 -2.067025e-12 -3.177826e-13 V -6.690651e+03 -6.578589e+02 -2.067025e-12 2.239437e+07 3.727633e+04 SigV 3.010281e+01 1.218975e+01 -3.177826e-13 3.727633e+04 2.033529e+03
  • 25. 25 [,,7] R.A. Dec. Mag V SigV R.A. 4.94712752 -0.01616675 -0.9052305 149.4066 -6.258482 Dec. -0.01616675 0.36436345 0.2510239 52.2886 2.212348 Mag -0.90523055 0.25102388 2.5169624 430.3722 -3.275991 V 149.40664650 52.28860010 430.3721890 8746849.8127 -39512.273470 SigV -6.25848182 2.21234752 -3.2759908 -39512.2735 948.865966 [,,8] R.A. Dec. Mag V SigV R.A. 11.62256349 -0.09353045 1.2133975 3902.249 37.31454 Dec. -0.09353045 4.66961672 0.4492924 1381.551 16.14600 Mag 1.21339748 0.44929245 1.1493867 1621.072 27.30945 V 3902.24896851 1381.55109212 1621.0715134 17492495.260 56416.61420 SigV 37.31454451 16.14600173 27.3094540 56416.614 1721.86659 >plot(clustCombi(data),data) ### to plot the multiple scatter plots of different cluster combinations >plot(Mclust(data)) ### gives BIC and classification plots 8. ACKNOWLEDGEMENT I am very much thankful to the Department of Statistics, West Bengal State University for their continuous guidance to realize this project. I am also thankful to Astrostatistics Department of Penn State University and Eberly College of Science of Penn State University.
  • 26. 26 9. REFERENCES 1. Dataset at the Department of Astrostatistics, PennState University URL - http://astrostatistics.psu.edu/ 2. Eric D. Feigelson, G. Jogesh Babu; Cambridge University Press (2012): “Modern Statistical Method for Astronomy with R Applications” 3. Johnson, R.A. and Wichern, D.W. (1998): “Applied multivariate statistical analysis, New Jersey: Prentice Hall. 4. DAVID M. ROCKE AND JIAN DAI, Center for Image Processing and Integrated Computing, University of California, Davis, CA 95616, USA : “Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data” 5. Jia Li, Department of Statistics The Pennsylvania State University : “Mixture Models” 6. Fraley C, Raftery A (2009). Mclust : “Model-Based Clustering and Normal Mixture Modeling”