Project

A PROJECT ON
CLUSTERCLUSTERCLUSTERCLUSTERINGINGINGING SHAPLEYSHAPLEYSHAPLEYSHAPLEY GALAXYGALAXYGALAXYGALAXY
DATADATADATADATASETSETSETSET
In partial fulfillment for the award of the degree
Of
Master of Science (Statistics)
Submitted By:
SRIJAN PAUL
Regn. No.: 2014003137
West Bengal State University
Year: 2014-2016

2
CONTENTS
1. Introduction ……………………………………………………… 3-5
1.1) Astronomical Background ……………………………… 3-4
1.2) Identity of the Dataset …………………………….......... 4-5
1.3) Target for This Dataset …………………………………... 5
1.4) Why Clustering …………………………………………... 5
2. Methodology …………………………………………………...… 5-8
2.1) Brief Idea of Different Hierarchical
Clustering Algorithms ……………………...……………….. 5-6
2.2) Single Linkage Clustering ........................................…... 6-7
2.3) Complete Linkage Clustering ……………….………….... 7
2.4) Average Linkage Clustering .....................................…... 7-8
3. Analysis …………………………………….………………….... 8-10
4. Methodology for Further Analysis ……………………………. 10-13
4.1) Model based Clustering ……………………………… 10-12
4.2) EM Algorithm for the Mixture of Gaussians …………... 13
5. Further Analysis ……………………………………………….. 13-19
6. Conclusion ………………………………………………………. 20
7. Appendix ………………………………………………………. 21-25
7.1) R Code for Cluster Analysis …………………………..... 21
7.2) R Code for Gaussian Mixture Model Analysis …….…. 21-25
8. Acknowledgement ……………………………………………….. 25
9. References ………………………………………………………... 26

3
1. INTRODUCTION
In Statistics multivariate statistical analysis is very common. Now
almost all datasets are multivariate datasets or high dimensional. In my
project the corresponding dataset is also multidimensional. My dataset is
astronomy related i.e. astronomical data relating to 4215 galaxies in space.
The dataset is called Shapley Galaxy dataset.
1.1) Astronomical Background:-
The distribution of galaxies in space is strongly clustered. The Milky
Way Galaxy resides in its Local Group which lies on the outskirts of the
Virgo Cluster of galaxies, which in turn is part of the Local Supercluster.
Similar structures of galaxies are seen at greater distances, and collectively
the phenomenon is known as the Large Scale Structure (LSS) of the
Universe. The clustering is hierarchical, nonlinear, and anisotropic. The
latter property is manifested as galaxies concentrating in huge flattened,
curved superclusters surrounding "voids", resembling a collection of soap
bubbles.
The basic characteristics of the LSS are now understood
astrophysically as arising from the gravitational attraction of matter in the
Universe expanding from the Big Bang approximately 14 billion years ago.
The particular three-dimensional patterns are well-reproduced by
simulations requiring that attractive Cold Dark Matter and repulsive Dark
Energy are present in addition to attractive baryonic (ordinary) matter.
The properties of baryonic and dark components needed to explain LSS
agree very well with those needed to explain the fluctuations of the cosmic
microwave background and other results from observational cosmology.

4
Despite this fundamental understanding, there is considerable
interest in understanding the details of galaxy clustering; e.g. the processes
of collision and merging of rich galaxy clusters. The richest nearby
supercluster of interacting galaxy clusters is called the Shapley
Concentration. It includes several clusters from the Abell catalog of rich
galaxy clusters seen in the optical band, and a complex and massive hot
gaseous medium seen in the X-ray band. Optical measurement of galaxy
redshifts provide crucial information but represent an uncertain
convolution of the galaxy distance and gravitational effects of the clusters
in which they reside. The distance effect comes from the universal
expansion from the Big Bang, where the recessional velocity (galaxy
redshift) follows Hubble's Law v= , where v is the velocity in km/s,
is the galaxy distance from us in Mpc (million parsecs, 1 pc~3 light years),
and is Hubble's constant known to be about 72 km/s/Mpc. The cluster
gravitational effects must be estimated or simulated for individual galaxies.
1.2) Identity of the Dataset:-
The dataset consists of 5 variables which are as follows –
1) R.A. i.e. Right Ascension: Coordinate in the sky similar to longitude
on Earth, 0 to 360 degrees.
2) Dec. i.e. Declination: Coordinate in the sky similar to latitude on
Earth, -90 to +90 degrees.
3) Mag i.e. Magnitude: An inverted logarithmic measure of galaxy
brightness in the optical band. A Mag=17 galaxy is 100-times fainter than
a Mag=12 galaxy. Value is missing for some galaxies (which are
considered as 0).
4) V i.e. Velocity: Speed of the galaxy moving away from Earth, after
various corrections are applied.

5
5) SigV i.e. Sigma of velocity: Heteroscedastic measurement error known
for each individual velocity measurement.
1.3) Target for This Dataset:-
Generally in such astrostatistical dataset astronomers use different
hierarchical clustering algorithms. They often use single-linkage
nonparametric hierarchical agglomeration which they call “friends-of-
friends algorithm”.
Hence I am interested to apply a variety of multivariate clustering
algorithms and compare them if possible.
1.4) Why Clustering:-
In astrostatistical analysis generally we are interested to find those
astronomical bodies with similar type of characteristics. Here also our aim
is to analyze how much the galaxies cluster or how many cluster they form
on the basis of above 5 variables. Then in a given cluster we can say the
galaxies are of similar type of characteristics based on the above variables.
2. METHODOLOGY
2.1) Brief Idea of Different Hierarchical Clustering
Algorithms:-
The following are the steps in the agglomerative hierarchical
clustering algorithm for grouping N objects (items or variables):

6
1. Start with N clusters, each containing a single entity and an N X N
symmetric matrix of distances (or similarities) D = { }.
2. Search the distance matrix for the nearest (most similar) pair of clusters.
Let the distance between "most similar" clusters U and V be ·
3. Merge clusters U and V. Label the newly formed cluster (UV). Update
the entries in the distance matrix by deleting the rows and columns
corresponding to clusters U and V and adding a row and column giving
the distances between cluster (UV) and the remaining clusters.
4. Repeat Steps 2 and 3 a total of N-1 times. (All objects will be in a single
cluster after the algorithm terminates.) Record the identity of clusters that
are merged and the levels (distances or similarities) at which the mergers
take place.
2.2) Single Linkage Clustering:-
The inputs to a single linkage algorithm can be distances or
similarities between pairs of objects. Groups are formed from the
individual entities by merging nearest neighbors, where the term nearest
neighbor connotes the smallest distance or largest similarity.
Initially, we must find the smallest distance in D = { } and merge
the corresponding objects, say, U and V, to get the cluster (UV). For Step
3 of the above general algorithm, the distances between (UV) and any
other cluster W are computed by
= min { }
Here the quantities and are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.
The results of single linkage clustering can be graphically displayed
in the form of a dendrogram, or tree diagram. The branches in the tree
represent clusters. The branches come together (merge) at nodes whose

7
positions along a distance (or similarity) axis indicate the level at which
the fusions occur.
2.3) Complete Linkage Clustering:-
Complete linkage clustering proceeds in much the same manner as
single linkage clustering, with one important exception: At each stage, the
distance (similarity) between clusters is determined by the distance
(similarity) between the twoelements, one from each cluster, that are most
distant. Thus, complete linkage ensures that all items in a cluster are
within some maximum distance (or minimum similarity) of each other.
The general agglomerative algorithm again starts by finding the
minimum entry in D = { } and merging the corresponding objects, such
as U and V, to get cluster (UV). For Step 3 of the above general algorithm,
the distances between (UV) and any other cluster W are computed by
= max { , }
Here and are the distances between the most distant members
of clusters U and W and clusters V and W, respectively.
2.4) Average Linkage Clustering:-
Average linkage treats the distance between two clusters as the
average distance between all pairs of items where one member of a pair
belongs to each cluster.
Again, the input to the average linkage algorithm may be distances
or similarities, and the method can be used to group objects or variables.
The average linkage algorithm proceeds in the manner of the above
general algorithm. We begin by searching the distance matrix D = { }
to find the nearest (most similar) objects- for example, U and V. These
objects are merged to form the cluster (UV). For Step 3 of the above

8
general agglomerative algorithm, the distances between (UV) and the
other cluster W are determined by
=
∑ ∑
where is the distance between object i in the cluster (UV) and object
k in the cluster W, and and are the number of items in clusters
(UV) and W, respectively.
3. ANALYSIS
I have applied the above mentioned three clustering schemes i.e.
single, complete and average linkage clustering algorithms and got the
following dendrograms.
Dendrogram of Single Linkage

9
Dendrogram of Complete Linkage
Dendrogram of Average Linkage

10
From the above dendrograms one cannot say how much the galaxies
cluster among each other or how many clusters they form.
Hence further analyses are required.
4. METHODOLOGY FOR FURTHER
ANALYSIS
4.1) Model Based Clustering:-
The single linkage, complete linkage and average linkage clustering
methods are intuitively reasonable procedures but that is as much as we
can say without having a model to explain how the observations were
produced. Major advances in clustering methods have been made through
the introduction of statistical models that indicate how the collection of
(p X 1) measurements , from the N objects, was generated. The most
common model is one where cluster k has expected proportion of the
objects and the corresponding measurements are generated by a
probability density function . Then, if there are K clusters, the
observation vector for a single object is modeled as arising from the mixing
distribution
=
where each ≥ 0 and ∑ =1. This distribution is called a
mixture of the K distributions , , … , because the
observation is generated from the component distribution with
probability · The collection of N observation vectors generated from
this distribution will be a mixture of observations from the component
distributions.

11
The most common mixture model is a mixture of multivariate
normal distributions where the k-th component is the " # , ∑ )
density function which is known as Gaussian or Maximum Likelihood
Mixture model assuming individual clusters are multivariate normal.
The normal mixture model for one observation is
( ∣ # , ∑ , …, # , ∑ )
= ∑
( %)&/(∣∑ ∣)/( exp (− (x-# )+
∑,
( − # ))
Clusters generated by this model are ellipsoidal in shape with the heaviest
concentration of observations near the center.
Inferences are based on the likelihood, which for N objects and a
fixed number of clusters K, is
L( , , … , , # , ∑ , … , # , ∑ )=∏ ( . ∣. # , ∑ , … , # , ∑ )
=∏ (. ∑
( %)&/(∣∑ ∣)/( exp (− ( − # )+
∑,
( − # )))
where the proportions , , … , , the mean vectors # , # , … , # , and
the covariance matrices ∑ , ∑ , … , ∑ are unknown. The measurements
for different objects are treated as independent and identically distributed
observations from the mixture distribution.
Most importantly, under the sequence of above mixture models for
different K, the problems of choosing the number of clusters and choosing
an appropriate clustering method has been reduced to the problem of
selecting an appropriate statistical model. This is a major advancement.
A good approach to selecting a model is to first obtain the maximum
likelihood estimates ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ for a fixed number
of clusters K. These estimates must be obtained numerically using special
purpose software. The resulting value of the maximum of the likelihood
234 = L( ̂ , ̂ , … , ̂ , #̂ , ∑ˆ , . ..,#̂ , ∑ˆ )

12
provides the basis for model selection. How do we decide on a reasonable
value for the number of clusters K? In order to compare models with
different numbers of parameters, a penalty is subtracted from twice the
maximized value of the log-likelihood to give
-2ln 234 − 789:;<=
where the penalty depends on the number of parameters estimated and
the number of observations N. Since the probabilities sum to 1, there
are only K-1 probabilities that must be estimated, K X p means and K X
p(p + 1)/2 variances and covariances. For the Akaike information
criterion (AIC), the penalty is 2N × (number of parameters) so
AIC = 2ln 234 −2N( ( + 1)( + 2) − 1)
The Bayesian information criterion (BIC) is similar but uses the logarithm
of the number of parameters in the penalty function
BIC = 2ln 234 −2ln N ( ( + 1)( + 2) − 1)
Even for a fixed number of clusters, the estimation of a mixture
model is complicated. One current software package, MCLUST, available
in the R software library, combines hierarchical clustering, the EM
algorithm and the BIC criterion to develop an appropriate model for
clustering. In the 'E' -step of the EM algorithm, a (N X K) matrix is
created whose BCD
row contains estimates of the conditional (on the
current parameter estimates) probabilities that observation . belongs to
cluster 1, 2, ... ,K. So, at convergence, the BCD
observation (object) is
assigned to the cluster k for which the conditional probability
E F ∣∣ . G = ̂. E . ∣∣ F G/ ̂ ∣ F
of membership is the largest.

13
4.2) EM Algorithm for the Mixture of Gaussians:-
Parameters estimated at the FCD
iteration are marked by a
superscript (r).
1. Initialize parameters (which have been taken arbitrarily by the
software).
2. EEEE----stepstepstepstep:- Compute the posterior probabilities for all j = 1,...,n; k =
1,...,K.
E F ∣∣ . G = .
H
E . ∣∣ F G/
H
∣ F
3. M. M. M. M----stepstepstepstep:-
.
HI
= ∑ E F ∣∣ . G, #
HI
=
∑ " ∣ J J
K
JL)
∑ " ∣ J
K
JL)
,
∑
HI
= E F ∣∣ . G . − #
HI
. − #
HI +
Repeat step 2 and 3 until convergence.
5. FURTHER ANALYSIS
Using MCLUST package in R software library and specifically the
Mclust() function and clustCombi() function, I first fit the = 5
dimensional normal mixture model.
Using the BIC criterion, the software chooses K=8 clusters with
estimated centers # , # , … , #N and variance covariance matrices
∑ , ∑ , … , ∑N with the mixing probabilities , , … , N (See Appendix).
And the scatter plot of the above analysis are given below.

14
Multiple scatter plots of K=8 clusters for the data

15

16

17
Multiple scatter plots of K=1 cluster for the data

18
And the cluster classification plot is as follows
BIC plot is also given below
Where the models are as below
“EII” = spherical, equal volume
“VII” = spherical, unequal volume

19
“EEI” = diagonal, equal volume and shape
“VEI” = diagonal, varying volume, equal shape
“EVI” = diagonal, equal volume, varying shape
“VVI” = diagonal, varying volume and shape
“EEE” = ellipsoidal, equal volume, shape, and orientation
“EVE” = ellipsoidal, equal volume and orientation
“VEE” = ellipsoidal, equal shape and orientation
“VVE” = ellipsoidal, equal orientation
“EEV” = ellipsoidal, equal volume and equal shape
“VEV” = ellipsoidal, equal shape
“EVV” = ellipsoidal, equal volume
“VVV” = ellipsoidal, varying volume, shape, and orientation
By mclustBIC() function and also from the above plot we get that
BIC is maximum for “VEV” i.e. equal ellipsoidal shape and BIC is
maximum for 8 cluster components.
Hence the Gaussian finite mixture model with 8 cluster components
fits well our dataset with the cluster parameters , , … , N,
# , # , … , #N , ∑ , ∑ , … , ∑N (See Appendix 6.2).

20
6.CONCLUSION
I have dealt with the Shapley Galaxy dataset by fitting parametric
model cluster analysis since plotting the dendrograms of usual clustering
algorithms (i.e. simple, complete and average linkage) I could not
conclude that how the galaxies are clustered or how many clusters they
form so that one can say in a certain cluster the galaxies are of equal
characteristics based on the given variables. I fitted a Gaussian mixture
model via Bayesian Information Criterion (BIC) assuming each cluster
having a multivariate normal distribution. And for 8 cluster components
BIC is maximum. Hence the dataset is a mixture of 8 normal populations
with the certain parameters.
The analysis I have done in this project can also be applied in
astrostatistical data like this or any big data in which plotting the
dendrograms of usual clustering algorithms any valid conclusion cannot
be made. Then one can apply this sophisticated analysis by fitting model
based clustering to the data. For this reason I think this project will be
very useful in statistical analysis in future.

21
7. APPENDIX
7.1) R Code for Cluster Analysis:-
>data=read.table("dataset.txt",header=T) ### read the data
>d=dist(as.matrix(data)) ### defining distance matrix
>hc1=hclust(d,"complete") ### complete linkage
>hc2=hclust(d,"single") ### single linkage
>hc3=hclust(d,"average") ### average linkage
>plot(hc1,xlab="Objects",ylab="Distance") ### Dendrogram of
complete linkage clustering
>plot(hc2,xlab="Objects",ylab="Distance") ### Dendrogram of single
linkage clustering
>plot(hc3,xlab="Objects",ylab="Distance") ### Dendrogram of
average linkage clustering
7.2) R Code for Gaussian Mixture Model Analysis:-
>install.packages(“mclust”) ### “mclust” package installation
>library(mclust) ### calling package “mclust”
> summary(Mclust(data),parameters=TRUE) ### parameter values
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------

22
Mclust VEV (ellipsoidal, equal shape) model with 8 components:
log.likelihood n df BIC ICL
-91344.25 4215 139 -183848.6 -184592.5
Clustering table:
1 2 3 4 5 6 7 8
642 380 365 68 1664 311 252 533
Mixing probabilities:
1 2 3 4 5 6 7
0.14844329 0.08948725 0.08623729 0.01747060 0.39351658 0.07365214 0.05926123
8
0.13193162
Means:
[,1] [,2] [,3] [,4] [,5] [,6]
R.A. 202.30359 193.85946 198.48879 203.040838 202.19427 2.020683e+02
Dec. -31.62415 -29.87713 -31.19507 -32.339459 -32.22335 -3.214705e+01
Mag 17.68570 17.80851 17.61867 5.948804 15.92424 3.763733e-16
V 14235.45711 16459.78793 31277.52284 30855.133968 12483.56623 1.236849e+04
SigV 43.91562 44.11998 60.79286 186.819014 65.27643 6.296875e+01
[,7] [,8]
R.A. 194.82766 208.84368
Dec. -29.14227 -31.42455
Mag 17.61551 15.00507
V 20238.86560 7154.17168
SigV 67.83738 49.68235

23
Variances:
[,,1]
R.A. Dec. Mag V SigV
R.A. 7.287574e-01 -0.006095364 -0.20837492 -306.19632 -0.1675741
Dec. -6.095364e-03 0.096109142 0.00520636 89.80991 0.1991862
Mag -2.083749e-01 0.005206360 1.47481240 39.31490 2.9049595
V -3.061963e+02 89.809908618 39.31490404 2505189.73769 6826.4215651
SigV -1.675741e-01 0.199186214 2.90495950 6826.42157 239.2164453
[,,2]
R.A. 0.10208391 -0.1039204 0.01686484 8.253725e+01 -5.055577e-01
Dec. -0.10392037 0.5575132 -0.18991703 3.576764e+00 1.159739e+00
Mag 0.01686484 -0.1899170 1.16151860 3.301307e+01 6.897552e-02
V 82.53724504 3.5767638 33.01307187 2.040891e+06 -7.381334e+03
SigV -0.50555774 1.1597385 0.06897552 -7.381334e+03 2.064483e+02
[,,3]
R.A. 11.6035707 -1.574402 -0.6919389 2.817419e+03 -4.004506
Dec. -1.5744015 5.003039 1.6040287 7.138022e+02 9.742544
Mag -0.6919389 1.604029 1.3414609 6.797225e+02 3.762460
V 2817.4192582 713.802245 679.7224588 1.965393e+07 2531.912748
SigV -4.0045060 9.742544 3.7624604 2.531913e+03 1731.312164

24
[,,4
R.A. 74.035386 7.550342 30.98499 4171.425 -1.681867e+00
Dec. 7.550342 10.345911 14.04735 -8518.099 8.707003e+01
Mag 30.984993 14.047346 126.75194 -47601.402 4.410704e+02
V 4171.425135 -8518.099472 -47601.40192 224442780.655 -1.250386e+06
SigV -1.681867 87.070034 441.07038 -1250386.089 2.673377e+04
[,,5]
R.A. 18.3679545 -0.7246259 -0.3916425 8192.9301 -5.069535
Dec. -0.7246259 6.6309625 0.6971234 -1344.4685 2.235558
Mag -0.3916425 0.6971234 1.2054110 748.0295 17.488057
V 8192.9301477 -1344.4685000 748.0295188 26212974.6194 51547.113121
SigV -5.0695352 2.2355582 17.4880569 51547.1131 2409.825921
[,,6]
R.A. 1.617655e+01 -1.366427e-01 -4.049577e-15 -6.690651e+03 3.010281e+01
Dec. -1.366427e-01 5.821269e+00 -1.728647e-14 -6.578589e+02 1.218975e+01
Mag -4.049577e-15 -1.728647e-14 8.296247e-01 -2.067025e-12 -3.177826e-13
V -6.690651e+03 -6.578589e+02 -2.067025e-12 2.239437e+07 3.727633e+04
SigV 3.010281e+01 1.218975e+01 -3.177826e-13 3.727633e+04 2.033529e+03

25
[,,7]
R.A. 4.94712752 -0.01616675 -0.9052305 149.4066 -6.258482
Dec. -0.01616675 0.36436345 0.2510239 52.2886 2.212348
Mag -0.90523055 0.25102388 2.5169624 430.3722 -3.275991
V 149.40664650 52.28860010 430.3721890 8746849.8127 -39512.273470
SigV -6.25848182 2.21234752 -3.2759908 -39512.2735 948.865966
[,,8]
R.A. 11.62256349 -0.09353045 1.2133975 3902.249 37.31454
Dec. -0.09353045 4.66961672 0.4492924 1381.551 16.14600
Mag 1.21339748 0.44929245 1.1493867 1621.072 27.30945
V 3902.24896851 1381.55109212 1621.0715134 17492495.260 56416.61420
SigV 37.31454451 16.14600173 27.3094540 56416.614 1721.86659
>plot(clustCombi(data),data) ### to plot the multiple scatter plots of
different cluster combinations
>plot(Mclust(data)) ### gives BIC and classification plots
8. ACKNOWLEDGEMENT
I am very much thankful to the Department of Statistics, West Bengal
State University for their continuous guidance to realize this project. I am
also thankful to Astrostatistics Department of Penn State University and
Eberly College of Science of Penn State University.

26
9. REFERENCES
1. Dataset at the Department of Astrostatistics, PennState University
URL - http://astrostatistics.psu.edu/
2. Eric D. Feigelson, G. Jogesh Babu; Cambridge University Press
(2012): “Modern Statistical Method for Astronomy with R
Applications”
3. Johnson, R.A. and Wichern, D.W. (1998): “Applied multivariate
statistical analysis, New Jersey: Prentice Hall.
4. DAVID M. ROCKE AND JIAN DAI, Center for Image
Processing and Integrated Computing, University of California,
Davis, CA 95616, USA : “Sampling and Subsampling for Cluster
Analysis in Data Mining: With Applications to Sky Survey Data”
5. Jia Li, Department of Statistics The Pennsylvania State University
: “Mixture Models”
6. Fraley C, Raftery A (2009). Mclust : “Model-Based Clustering
and Normal Mixture Modeling”

Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Project

Similar to Project (20)

Project