SlideShare a Scribd company logo
1 of 73
1
Analyzing Interstellar Data
CAMCOS Report
San José State University
by
Teresa Baral Gianna Fusaro Wesley Ha Aerin Kim
Vaibhav Kishore Yunzhi Lin Gulrez Pathan Xiaoli Tong
Mine Zhao Xinling Zhang
Bradley W. Jackson (Leader)
Fall 2014
2
1. Introduction························································································4
1.1 Voronoi Diagram ··············································································4
1.2 Delaunay Triangulation ······································································6
1.3 Bayesian Blocks ···············································································7
1.4 HOP ······························································································9
1.4.1 HOP Algorithm ·········································································10
1.4.2 HOP Benefits & Drawbacks ·························································12
1.4.3 HOP Variants············································································13
2. Properties of Dr. Scargle’s New Objective Function ·································14
2.1 Convexity ·····················································································15
2.2 The Equal Density Property·······························································17
2.3 The Intermediate Density Property ······················································19
2.4 The Last Change Conjecture ······························································20
2.5 Experiments with Bayesian Blocks Algorithm········································24
2.5.1. Intuitive Block··········································································24
2.5.2. How to Find the Block Faster·······················································27
3. Arbitrarily Connected Bayesian Blocks··················································28
3.2 Connected Components ····································································29
3.3 Connected Components: Input/Output··················································30
3.4 Connected Components: Algorithm ·····················································31
3.5 Connected Bayesian Blocks: 2 Density Level·········································32
3.6 Testing and Analysis········································································33
4. Clusters ····························································································37
4.1 Dataset ·························································································37
4.2 Bayesian Blocks ·············································································38
4.2.1 Voronoi Diagram·······································································38
4.2.2 Delaunay Triangulation ·······························································39
4.3 Finding Clusters··············································································41
4.3.1 Connected components································································42
4.3.2 Clusters ···················································································43
3
4.3.3 Fewer clusters ···········································································44
4.4 Comparison ···················································································44
4.5 Conclusion ····················································································45
5. Future Directions of Research ······························································47
5.1 Properties of New Objective Function ··················································47
5.2 Clusters of Galaxies and Stars ····························································47
6. References ························································································49
7. Appendix ··························································································50
7.1 MATLAB Code··············································································50
7.1.1. Vectorized Bayesian Block Algorithm············································50
7.1.2. Last Change Conjecture Implementation of Bayesian Block Algorithm ··51
7.1.3. Code for finding Bayesian Blocks and connected components using
Bayesian Block algorithm ···································································53
7.1.4. finding clusters using HOP algorithm·············································65
7.1.5. Finding Bayesian Blocks based on the area of each triangle ·················71
7.2 PYTHON Code ··············································································72
7.2.1. How to find the block faster - Folding Method ·································72
4
1. Introduction
Given a sample of N points in D-dimensional space, there are two classes of problems
that we would like to solve: density estimation and cluster finding.
To infer the probability distribution function (pdf) from a sample of data is known as
density estimation. It is one of the most critical components of extracting knowledge
from data. For example, given a pdf estimation from point data, we can generate
simulated distributions of data and compare them against observations. If we can
identify regions of low probability within the pdf, we can also detect the unusual
events.
Clustering is grouping a set of data such that data in the same group are similar to each
other, but between groups are very different. There are manyalgorithms contributing to
find clusters, such as K-mean clustering, hierarchical clustering and DBSCAN. In our
project we will derive a new approach to find clusters.
1.1 Voronoi Diagram
A Voronoi diagram partitions space into ordered regions called Voronoi cells which
allow us to infer from its properties the shortest Euclideandistance ofany point in space
or the density of any Voronoi cell.
5
Figure 1.1 Voronoi diagram where each color region is a Voronoi cell containing exactly one point.
Definitions:
● Let R be a set of distinct points p1, p2, p3,…,pn in the plane, then the Voronoi
diagram of R is a subdivision of the plane into n Voronoi cells such that each
cell contains exactly one point.
● Area in any given cell is closer to its generating point than to any other.
○ If a point q lies in the same region as point pi then the Euclidean
distance from pi to q will be shorter than the Euclidean distance from
Pj to q, where Pj is any other point in R which is not pi.
In a Voronoi diagram, the Voronoi edge is a subset of locus of points equidistant from
any two generating points on R. A Voronoi vertex is positioned at the circle of any
empty circle generated by three or more neighboring points on the plane.
Density of Voronoi cell is calculated by taking the data point and dividing it by the
volume of cell.
6
1.2 DelaunayTriangulation
Delaunay triangulationbydefinition is the triangulationof the convex hull of the points
in a diagram wherein every circumcircle of a triangle is an empty circle. It is used as a
method for binning data. It is also defined as the dual of a Voronoi diagram. Therefore
a Delaunay triangulation can be formulated by taking the dual of a Voronoi diagram.
The formulation of a Delaunay triangulation by taking the dual of a Voronoi Diagram
can be seen by examining the image below (fig. 1.2). The edges of a Delaunay triangle
is created by connecting the data points between two adjacent voronoi cells sharing a
common edge as shown by the data points Pi and Pj in fig. 1.2. The vertices of the
Delaunay triangle thenconsist of the data points. In the example given, the circle shows
a Delaunay triangle which is connected by the data points Pi, Pj, and an unnamed data
point. A Delaunay triangle consists of three data points, each assumed to have a 360
degree angle, and a triangle in general has an 180 degree angle; therefore, a Delaunay
triangle contains half a data point. If a Delaunay triangle is considered as a cell, then its
nucleus would be a Voronoi vertex. Each Voronoi vertex is adjacent to three Voronoi
cells. Since a Voronoi vertex is adjacent to three cells, then it would have three edges
connected to it which forms the triangular shape when its dual is taken.
7
Figure 1.2 Delaunay Triangulation (purple diagram) created by taking the dual of a Voronoi Diagram
(blue diagram)
1.3 BayesianBlocks
Real data rarely follow one simple distribution and nonparametric methods can offer a
useful way of analyzing the data. For example, one of the simplest nonparametric
methods to analyze a one-dimensional data set is a histogram. To construct a histogram,
we need to specify a bin size, and we assume that the estimated distribution function is
piecewise constant within each bin. However in a massive astronomy dataset, it
sometimes may not be a true assumption.
A histogram can fit any shape of distribution, given enough bins. When the number of
data points is small, the number of bins should be small. As the number of data points
grows, the number of bins should also grow to capture the increasing amount of detail
in the distribution’s shape. This is a basic idea of Dr. Scargle’s Bayesian Block
methods—they are composed of simple pieces, and the number of pieces grows with
the number of data points.
8
Choosing the right number of bins is critical. It can have a drastic effect, for example
the number of bins could change the conclusion that a distribution is bimodal as
opposed to single mode. Intuitively, we expect that a large bin width will destroy
fine-scale features in the data distribution, while a small width will result in increased
counting noise per bin. Opposite to custom, it is unnecessary to choose bin size before
reading the data.
Dr. Scargle’s nonparametric Bayesian method was originally developed to detect bursts
in space and characterize shape of the astronomical data. The algorithm produces the
most probable segmentation of the observation into time intervals in which there is no
statistically significant variation in the observations. The method he first described
utilized a greedy algorithm.
A more efficient algorithm is described in An Algorithm for Optimal Partitioning of
Data on an Interval (2004). A dynamic programming algorithm is implemented. This
algorithm finds a more efficient way of calculating the optimal interval without having
to calculate all possible intervals. The algorithm calculates the value of the fitness
function of the first n cells by using previously calculated values of the first n-1 cells
from the previous iterations plus the calculated value of the last block itself. This
algorithm can be applied to piecewise constant models, piecewise linear models, and
piecewise exponential models. This algorithm can be thought of as an intelligent
binning method where the bin sizes and the number of bins are adapted to the data.
9
Bayesian blocks are useful in different contexts. For the pulse problem that it was
originally used for, Bayesian blocks was used to determine pulse attributes without
needing to determine specific models for the pulse shapes. This information was then
used to guess further attributes of the pulse attributes. For our research, Bayesianblocks
was used to determine blocks of voronoi tessellations and delaunay triangles with like
densities.
1.4 HOP
The HOP Algorithm was formulated by Daniel J. Eisenstein and Piet Hut in 1997. It is
considered to be a method for finding groups of particles for N-body simulations. It is a
grouping algorithm that divides a set of data points into equivalence classes, such that,
each point is a member of only one group. It has many uses in the field of astronomy for
finding clusters of stars, clusters of galaxies, locating dwarf galaxies and determining
whether galaxies merge.
Figure 1.3: Voronoi Tessellation and Delaunay Triangulation on a set of data points
10
1.4.1 HOP Algorithm
The HOP Algorithm is aimed at grouping data points into clusters so that the high
density regions can be distinguished from the low density regions. The HOP process
involves assigning an estimate of the density to every data point. The density function
used to come up with these density estimates involves dividing the number of data
points in a voronoi cell by its area.
Figure 1.4: The HOP Algorithm begins with randomly selecting a data point
Figure 1.5: Neighborhood densities of a selected data point are estimated
11
A random point is chosen and another point in its neighborhood which has the highest
density is chosen. The data points are linked to their densest neighbor and the process of
hopping to higher and higher densities continues until a point which is its own densest
neighbor. This point which is its own densest neighbor is called the local maximum and
all the points that hop to a local maximum form a max class.
Figure 1.6: The selected data point is linked to its densest neighbor
Figure 1.7: Continue hopping to higher and higher densities
12
Figure 1.8: Data point which is its own densest neighbor is called local maximum
Figure 1.9: Data points hopping to a local maximum form a max class
1.4.2 HOP Benefits & Drawbacks
The HOP algorithm is a relatively simple algorithm as compared to other grouping
algorithms since it is fast and is based on easy computations. However there are certain
drawbacks to this algorithm, mainly, it is overly sensitive to small fluctuations. This
leads to a grouping style wherein the groups are not optimally distinguished since
minute fluctuations result in the formation of a lot of mini-clusters.
13
1.4.3 HOP Variants
The HOP algorithm we used in this project is not the only one; there are many variants
to it, some simple and some very complex. A couple of variations to the HOP algorithm
that exist but we did not have the time to explore are:
· EHOP
· JHOP
· Joining a data point to its closest neighbor
· Joining a data point to another along the steepest gradient
· Joining a data point to a denser neighbor with the lowest density
· Repeated HOP algorithm
14
2. Properties of Dr. Scargle’s New Objective Function
The Bayesian Block algorithm that was implemented in this project made use of Dr.
Scargle’s new objective function (1).
The new objective function takes on three parameters, namely N, M, and c. The
parameter N is the number of data points and M is the area (volume) in the data cell.
The parameter c is related to the prior distribution of the number of blocks. Varying the
value of the constant will change the return number of blocks in the optimal partition.
The derivations of the new objective functioncan be found inScargle et. al. (2012). The
basic idea is that a prior Poisson distribution was used to derive the formula. This is
different from Dr. Scargle’s “old” objective function in that he previously used a Beta
function as the prior.
The idea of the new objective function is to model the likelihood that the data cells in a
given block have the same density. This can easily be seen why it is an ideal measure
for the Bayesian Block algorithm. The idea behind the Bayesian Block algorithm is to
dynamically find the partition of the data space, which contains data cells of roughly
equal density. The algorithm uses the values from the previously computed i cells to
compute the value of the new objective function for the i+1 cell. The algorithm then
15
takes the maximum value returned by the new objective function as the optimal
partition.
2.1 Convexity
Dr. Scargle’s new objective function (1) has many properties, which are applied in the
Bayesian Block algorithm. One of these properties is that the new objective function is
convex. For a function to be convex means that the line segment between any two
points evaluated by the function will always lie above the values of the function
between those two points. In other words if we were to graph the new objective
function would look concave up in two and three dimensions.
Another property that is a result of a function being convex is that it satisfies the
equation:
𝜆𝑓( 𝑥1, 𝑦1) + (1 − 𝜆) 𝑓( 𝑥2, 𝑦2) ≥ 𝑓(𝜆( 𝑥1, 𝑦1) + (1 − 𝜆)( 𝑥2, 𝑦2))
𝑓𝑜𝑟 0 ≥ 𝜆 ≥ 1
This result of convexity was used in the subsequent proofs of the other properties that
the new objective function satisfies.
Theorem:
The new objective function 𝐹( 𝑁, 𝑀) = 𝑁𝑙𝑜𝑔
𝑁
𝑀
− 𝑐 is convex and is actually strictly
convex when for given two points say (𝑁1, 𝑀1) and (𝑁2, 𝑀2) this equality is false
𝑁1
𝑀1
=
𝑁2
𝑀2
(i.e. 𝑓′′( 𝑡) > 0 𝑓𝑜𝑟 0 < 𝑡 < 1).
16
Proof:
Let (𝑁1, 𝑀1) and (𝑁2, 𝑀2) be two fixed data points where 𝑁1, 𝑁2, 𝑀1, 𝑀2 > 0
For some value 𝑡, 0 < 𝑡 < 1
𝑡𝑁1 + (1 − 𝑡) 𝑁2 > 0 and 𝑡𝑀1 + (1 − 𝑡) 𝑀2 > 0
Thus we can rewrite f(t) as
𝑓( 𝑡) = [ 𝑡𝑁1 + (1 − 𝑡) 𝑁2]𝑙𝑜𝑔
𝑡𝑁1 + (1 − 𝑡) 𝑁2
𝑡𝑀1 + (1 − 𝑡) 𝑀2
− 𝑐
It suffices to show that if 𝑓′′( 𝑡)
≥ 0 then the Dr. Scargle’s new objective function is
convex.
From derivative properties we get
𝑓′′( 𝑡)
=
( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2
( 𝑡𝑀1 + (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2)
Since N1, N2, M1, M2 > 0 this implies
( 𝑡𝑀1+ (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2) > 0
Thus the second derivative depends on the term
( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2
which implies 𝑓′′( 𝑡)
> 0 if
𝑁1
𝑀1
≠
𝑁2
𝑀2
otherwise 𝑓′′( 𝑡)
≥ 0
∴ Dr. Scargle’s new objective function is convex and is strictly convex if the densities
of the two data cells are not equal.
17
2.2 The Equal Density Property
The Equal Density Property states that in an optimal partition of a collection of data
cells into arbitrary blocks, if C1 and C2 are cells with the density of C1 = the density of
C2, then C1 and C2 are in the same block of the optimal partition.
Proof:
We use a proof by contradiction.
Let C1 be a cell with N1 data points and area M1.
Similarly, let C2 be a cell with N2 data points and area M2.
First assume that C1 and C2 are in different blocks of the optimal partition of the data
space, X, such that the density of C1 and C2 are equal to each other.
Call their respective blocks A and B.
Let C3 represent the remainder of block A, where the remainder of block A has N3 data
points and M3 area.
Let C4 represent the remainder of block B, where the remainder of block B has N4 data
points and M4 area.
From our assumption, density of C1 = density C2, or N1/M1 = N2/M2.
If the partition of X is indeed optimal, then
f(C1 U C3) + f(C2 U C4) > Max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2 U C4) + f(C3)}
, where f is Dr. Scargle's new objective function.
18
So we achieve a contradictionby showing that max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2
U C4) + f(C3)} ≥ λ[f(C1 U C2 U C3) + f(C4)] + (1- λ)[f(C3) + f(C1 U C2 U C4)] = f(C1 U
C3) + f(C2 U C4).
We must find a λ such that 0 ≤ λ ≤ 1 and
1) λ(N1 + N2 + N3, M1 + M2 + M3) + (1- λ)(N3,M3) = (N1 + N3, M1 + M3) and
2) λ(N4, M4) + (1- λ)(N1 + N2+N4, M1 + M2 + M4) = (N2 + N4, M2 + M4)
We can see that λ(N1 + N2 + N3) + (1- λ)N3 = N1 + N3
and λ(M1 + M2 + M3) + (1- λ)M3 = M1 + M3.
Simplifying the equations gives us
3) λ = N1 / (N1+N2)
and
4) λ = M1 / (M1 + M2)
Recall the assumption that the density of C1 = density of C2, or N1 / M1 = N2 / M2.
Thus N1 = (N2 * M1)/ M2. Substituting this expression for λ into 3 gives us 4.
Substituting 3 into 1 and 2 provides the desired results.
Thus there is an optimal partition with C1 and C2 in the same block.
Thus the equal density property is proved.
19
2.3 The Intermediate Density Property
The Intermediate Density property says that in an optimal partition of a collection of
data cells into arbitrary blocks, if C1, C2, and C3 are cells with densityof C1 > density of
C2 > densityof C3, then whenC1 and C3 are in the same block Bof the optimal partition,
so is C2.
Proof:
Suppose C1,C2,C3 are cells with density of C1 > density of C2 > density of C3 among a
larger collection of cells for which we are trying to find the optimal partition into
connected blocks. Let C1 U C3 U C4 | C2 U C5 be the part of anoptimalpartition withC1
and C3 inone block where C4 denoted the remainder of that block, and C2 is in a second
block where C5 denotes the remainder of that block.
We need to prove that f( C1 U C3 U C4) + f(C2 U C5) ≥ max{f( C2 U C3 U C5) + f(C1 U
C4), f( C1 U C2 U C5) + f(C3 U C4), f( C1 U C2 U C3 U C4) + f(C5)}, where f represents
Dr. Scargle’s new objective function.
Let Ni be the number of data points in Ci and let Mi represent the area of Ci and Ni,Mi >
0.
It suffices to show that for some 0 ≤ λ1, λ2, λ3 ≤ 1 withλ1 + λ2 + λ3 = 1 the following will
be true
N1 + N3 + N4 = λ1(N1 + N4) + λ2(N3 + N4) + λ3(N1 + N2 + N3 +N4)
and N2 + N5 = λ1(N2+N3+N5) + λ2(N1+N2+N5) + λ3(N5).
20
Substituting 1 - λ1 - λ2 for λ3 gives the following
λ1N3 + λ2(N1 + N2) = N2 and λ1(N2 + N3) + λ2(N1 + N2) = N2
Solving this system of equations gives the results
λ1 = 0, λ2 = N2 / (N1 + N2), λ3 = N1 / (N1 + N2)
Thus Max{f(C2 U C3 U C5) + f(C1 U C4), f(C1 U C2 U C5) + f(C3 U C4), f(C1 U C2 U C3
U C4) + f(C5) ≥ λ1[f(C2 U C3U C5) + f(C1 U C4)] + λ2[f(C1 U C2 U C5) + f(C3 U C4)] +
λ3[f(C1 U C2 U C3 U C4) + f(C5)] ≥ f(C1 U C3 U C4) + f(C2 U C5).
The Intermediate Density Property is the basis of the Bayesian Blocks algorithm which
finds the optimal partitionof a collectionof cells into arbitrary blocks (regardless of the
dimension). The intermediate density property allows us to find this optimal partition
by first sorting the cells by their densities and then only considering partitions into
blocks of consecutive cells in this order thus essentially reducing the problem down to a
1-dimensional problem.
2.4 The Last Change Conjecture
The Last Change Conjecture says that the entries in the last change vector are always
nondecreasing when finding the optimal partition of a set of cells sorted by density into
arbitrary blocks.
In proving this conjecture, we need to make use of several properties of Dr. Scargle's
new objective function (1),
1) Vector multiplication
21
k*f(x,y) = k(x*log(x/y)-c) = kx * log(kx/ky) – kc = kxlog(kx/ky) – c – (k – 1)c =
f(kx,ky) – (k-1)c
2) Additive property
f(a,b) + f(d,e) = 2f(a,b)/2 + f(d,e)/2) ≥ 2f((a+d)/2, (b+e)/2) = f(a+d, b+e) – c
3) 2 dimensional Convexity property
f(x,y) + e(x0,y0)) + f((x,y) – e(x0,y0)) ≥ 2f(x,y)
4) 2 dimensional Convexity property
for λ1 > λ2>0, f((x,y) + λ1(x0,y0)) + f((x,y) – λ1(x0,y0)) ≥ f((x,y) + λ2(x0,y0)) + f((x,y) –
λ2(x0,y0))
4’') 2 dimensional Convexity property
for λ1 > λ2 > 0 and λ3 > λ4 > 0,
f((x,y) + (λ1x0, λ3y0)) + f((x,y) – (λ1x0, λ3y0)) + f((x,y) – λ1x0, λ3y0)) ≥ f((x,y) + (λ2x0,
λ4y0)) + f((x,y) – (λ2x0, λ4y0))
Proof::
Let Ci be the ith cell, with Ni data points and Mi area such that Ni, Mi > 0.
Assume we have a data set sorted by its density such that the density of C1 > density of
C2 > density of C3 > density of C4.
Then by our assumption, N1/M1 > N2/M2 > N3/M3 > N4/M4.
We must show that if
22
f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3)
then f(N1 + N2, M1 + M2) + f(N3 + N4,M3 + M4) ≥ f(N1,M1) + f(N2 + N3 + N4, M2 + M3
+ M4).
Assume not,
then if f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3)
then f(N1 + N2, M1+ M2) + f(N3 + N4,M3 + M4) < f(N1,M1) + f(N2 + N3 + N4, N2 + M3 +
M4).
Equivalently f(N1+N2,M1+M2) – f(N1,M1) < f(N2+N3+N4,M2+M3+M4) –
f(N3+N4,M3+M4)
and f(N1+N2,M1+M2) – f(N1,M1) ≥ f(N2+N3,M2+M3) – f(N3,M3)
thus f(N2+N3+N4,M2+M3+M4) – f(N3+N4,M3+M4) > f(N2+N3,M2+M3) – f(N3,M3)
which implies f(N2+N3+N4,M2+M3+M4) + f(N3,M3) > f(N2+N3,M2+M3) +
f(N3+N4,M3+M4)
To find a contradiction, it will be sufficient to find λ1 and λ2 such that λ1(N2+N3,M2+M3)
+ λ2(N3+N4,M3+M4) = (N2+N3+N4,M2+M3+M4)
and (1-λ1)(N2+N3,M2+M3) + (1-λ2)(N3+N4,M3+M4) = (N3,M3)
From those equations, we get λ1(N2+N3) + λ2(N3+N4) = N2 + N3 +N4
and λ1(M2+M3) + λ2(M3+M4) = M2 + M3 + M4
23
This is a system of 2 equations with 2 unknowns(assuming Ni and Mi are known
quantities), so upon solving for those 2 unknowns, we get
λ1=((N2+N3+N4)(M3+M4) – (N3 +N4)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) -
(N3+N4)(M2+M3))
λ2=(-(N2+N3+N4)(M2+M3)+(N2 +N3)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) -
(N3+N4)(M2+M3))
After proving the last change conjecture we then went on to implement it in the
Bayesian Block algorithm. The MATLAB code can be found in Appendix A2. The
changes to the code only required altering a few lines of code.
We then did an efficiency comparison between the original Bayesian Block algorithm
and the one implementing the last change conjecture. We used a uniform example of
size 10,000 where each data point was evenly separated on the interval from zero to one.
The value of the constant, c was varied to change the number of blocks. The results
from this comparison can be seen in figure 2.1.
Table 2.1: Time Comparison between original Bayesian Block algorithm and last change conjecture
algorithms
24
Figure 2.1: Graphical Representation of Table 2.1
The results of the example showed that for all number of blocks in the optimal interval
the last change conjecture algorithm outperformed the original. The efficiency of the
last change conjecture implementation is O(N2/B), where B is the number of blocks. So
we can see from figure 2.1 that the efficiency improves proportional to the number of
blocks.
2.5 Experiments with BayesianBlocksAlgorithm
Here we would like to introduce some interesting experiments with Dr. Scargle’s
Bayesian Block Algorithm.
2.5.1. Intuitive Block
One of the few things that we can do to understand algorithms better is running it on a
small dataset. Here we ran the Bayesian Block algorithm on a data set consisting of 5
25
data points [ 2, 2.01, 2.03, 10.1, 10.11 ]. There are possibly four different cuts in
the dataset and the most ‘intuitive’ cut will be between 2.03 and 10.1. However when
you run the Bayesian Block algorithm, it gives youan unintuitive cut (two cuts at 2.005
and 10.105)
Figure 2.2: intuitive cut Figure 2.3: Algorithm’s result
The best partition of the dataset would be a high intensity interval near 2, a high
intensity interval near 10 and a low intensity interval between 2 and 10 which is similar
to the partition that Dr. Scargle’s function describes as the most likely.
To find out what causes the cut to be unintuitive, we checked every possible cut and its
objective values.
Figure 2.4: The element of the list represents the four positions of the dataset. 0 means there is no cut in
the position and 1 means there is a cut. For example, [0, 0, 1] means there is a cut in the 3rd position.
26
Surprisingly, our intuitive cut turned out to have the worst objective function value. The
reason is that the intuitive cut got penalized significantly because of its length factor M
in the objective function (1). Objective value gets very small when M (the distance
between two points) gets bigger.
The goal of Dr. Scargle’s objective function is to partition the whole data space not just
the data points and it does a very good job on that. Still this specific dataset illustrates
the problem when a low intensity region is adjacent to high intensity regions on either
end. We realized it is a shortcoming of the Voronoi diagrambecause every Voronoi cell
needs to have one data point.
a. Space Dust (Pseudo Points) Method
To avoid penalizing the block that happens to be next to a low density region, we
decided to add some extra points which aren’t data points but to distinguish them from
regular data points. We call them space dust.
For example, true data points would have weight 1.0 but we would give space dust a
very small weight such as 1/1,000,000. Now we can form the usual Voronoi diagram
in the following modified way. If two points x and y are bothdata points or both space
dust then for points between x and y, the points closest to x go to the cell containing x,
and the points closest to y go to the cell containing y as usual. However when say x is
space dust and y is a regular data point then for a point z between x and y, z goes to the
cell containing x if the distance from z to y is more than 1,000,000 times larger than the
27
distance from z to x otherwise z goes to the cell containing y. This is a sort of weighted
version of the Voronoi diagram.
In theory the weighting of the Voronoi diagram is meant for situations where there is a
particle of space dust between two regular data points. If you have a dust particle z
between two data points x and y which are roughly distance d apart then the cell
containing d will have length roughly d/1,000,000 which means that density of the cell
containing z will likelybe about the same as the densities of the cells containing x and y
so that x, y, and z are in the same block of the optimal partition so a dust particle
between two data points isn't likely to change the optimal partition.
This method raised two questions, which offer perspectives for future research. First,
we need to decide how many particles of space dust should you add and second, how to
choose the window of dusts.
2.5.2. How to Find the Block Faster
Suppose there is a symmetrical gamma ray burst. Gamma ray bursts emit highly
energetic photons and we canthink of it as 1-dimensional time-tagged event (TTE) data
which is symmetrical at its maximum emission. Under the assumption that TTE data
is symmetrical, we can use some ‘tricks’ to identify significantly distinct pulses in a
more efficient way.
a. Folding Method
Assuming that we know the center of symmetry, thenthe BayesianBlock algorithm can
run much faster than usual.
28
For example, if we have a set of times t1, t2, … , tn when photons are detected on an
interval [0,1] and suppose that you know that the gamma ray burst is symmetrical at t =
½.
If times satisfy t1 < t2 < … < tm < ½ < tm+1 < tm+2 < … < tn, we can fold the interval at t
= ½, average matching data points and apply the usual 1-dimensional dynamic
programming algorithm on the interval [0.½ ]. After that, unfold the data and use the
partition obtained on [0,½ ] and its reverse on [½ , 1].
Python code for this Folding method is given in Appendix 7.2.
3. Arbitrarily Connected Bayesian Blocks
Given a 2-dimensional dataset, we can find the optimal partition using Dr. Scargle’s
new objective function. This allows us to adaptively bin areas of space that are roughly
equal in density. The optimal partition is chosen using dynamic programming to decide
which arrangement of cells maximizes Dr. Scargle’s fitness function. Using a binning
method dependent on the composition of the data, also known as “Bayesian Blocks”,
allows for more detailed and significant results.
Before applying ‘optinterval1.m’ to a dataset, included in Appendix A3, we first
partition the data into either a Voronoi diagram or Delaunay triangulation so we can
determine the cells of the dataset and their respective sizes. The “optinterval” function
takes inputs ‘n’, ‘a’, and ‘c’, where ‘n’ is the number of data points per cell, ‘a’ is the
29
sorted areas of each cell, and ‘c’ is a constant based on the previous distribution. Dr.
Scargle’s new objective function is a 1-dimensional application, thus in order to apply
the function to any multi-dimensional data, the data must be reduced to a single
dimension. By manipulating input ‘a’ to be either areas, volumes, or a side of a polygon
or triangle, this 1-dimensional algorithm can be applied to multi-dimensional data.
After running the objective function, the resulting blocks in the optimal partition
contain cells of roughly equal density, and thus the density of each block is nearly
constant. These Bayesian Blocks are organized by highest to lowest density, such that
the first block has the highest relative density and the last block has the lowest relative
density.
While the optimal configuration of the dataset groups cells together based on size, cells
do not need to be connected in order to belong to the same block. This is why the results
of ‘optinterval1.m’ are arbitrarily connected Bayesian Blocks; the algorithm doesn’t
take connectedness of cells into account.
3.2 ConnectedComponents
Our research focused on finding connected components of space that are roughly equal
or similar in density. To accomplish finding optimal or near-optimal partitions of a
2-dimensional dataset into connected components, we used Dr. Scargle’s new objective
function to first organize our data by density, and then examined connected
components (unions of cells) within k density levels.
30
The basic idea for finding connected components of relative density is to take the data
binned into blocks by relative density, where
Density(Block 1) >= Density(Block 2) >= …>= Density(Block n)
Each block represents a different density level. We then iterate over each block to find
connected components within k density levels.
3.3 ConnectedComponents:Input/Output
The ‘concomponents2.m’ function written by Dr. Bradley Jackson, found in Appendix
A3, can be called using the function [x, y, z, w] = concomponents2(blocks, n, a ,
adjmat)
The function takes in the inputs: blocks, n, a, and adjmat. The ‘blocks’ input is a cell
array that contains the cells for each Bayesian Block. Input ‘n’ is an array that contains
the number of data points per cell. The ‘a’ input is a sorted array that contains the areas
of each cell. The last input, ’adjmat’, is a sparse matrix containing adjacent pairs of
cells. Inputs ‘n’, ‘a’, and ‘adjmat’ share the same index for each respective cell. The
‘adjmat’ input was obtained using a function written by Dr. Jackson called
‘Delaunay2Adjacencies2.m’.
The ‘conncomponents2’ function has four outputs, x, y, z, and w. Output ‘x’ is a cell
array of the resulting connected components. ‘y’ is an array containing the number of
data points in each component. The ‘z’ output is an array containing the areas of each
component. Lastly, ‘w’ is a sparse matrix of component adjacency pairs.
31
3.4 ConnectedComponents:Algorithm
To determine whether two cells within k density levels belong in the same connected
component, we first calculate which cells are adjacent pairs. If we use a dual Voronoi
and Delaunay diagram, we can use Voronoi edges to determine which Delaunay
triangles are adjacent. We then store the adjacent pair information in a large sparse
matrix, storing a ‘1’ for adjacencies, and a ‘0’ for non-adjacencies. We used Dr.
Jackson’s ‘Delaunay2adjacencies.m’ to calculate and store adjacencies. The code for
‘Delaunay2adjacencies.m’ is listed in Appendix A3.
To compute connected components, we worked from a program written by Dr. Jackson
called ‘conncomps2.m’. We made a few modifications to the code to obtain connected
components within two density levels.
After running the Bayesian Block partition on the dataset and computing its adjacency
matrix, we can start looking for connected components within the same density level.
Please note in the following pseudocode that Ci refers to cell Ci and {Ci} refers to the
component that Ci belongs to.
1. Assign each cell to belong to its own component
2. For each Bayesian Block:
a. Find all adjacent pairs that exist in the same block
i. For every pair of adjacent cells, (Ci, Cj):
1. IfCi belongs to a smaller component than Cj,, then {Ci} is
added to {Cj}
32
2. Set {Ci} = [ ] (empty)
3. Else, {Cj} is added to {Ci}
a. Set {Cj} = [ ] (empty)
3. Select all non-empty components to calculate number of data points in each
component, and the area of each component.
4. Calculate sparse matrix of adjacent components from the matrix of cell
adjacencies. If two adjacent cells do not belong to the same component, then
their components are adjacent.
This algorithm finds connected components of cells in the same Bayesian Block. Since
our criteria for components is strict in that they must have connected cells and be in the
same Bayesian Block, there is a high possibility that many components are generated.
We extend our reach to include connected cells in 2 density levels, 1 density level apart,
in the same component.
3.5 ConnectedBayesian Blocks:2 Density Level
We made several modifications to Dr. Jackson’s “conncomps2” function to find
connected components in the same density level or 1 density level higher. The basic
idea is the same as searching for connected cells in the same density level, but we
change where we look for adjacencies to include in components. We looked for
adjacencies twice: 1) adjacent cells within the same block, and 2) adjacent cells 1
density level apart. We also included an additional check to ensure that only cells in the
same level or 1 density level apart are grouped into the same component. The code for
‘conncomps2mod.m’ can be found in Appendix A3.
33
3.6 Testing and Analysis
We used two different 2-dimensional datasets to test and analyze optimal or
near-optimal partitions into connected components. The ”Counties of California By
Population Density” (CCPD) dataset has dimensions 58 x 2, numbering all California
counties with their population densities. The Sloan Digital Sky Survey (SDSS) dataset
has dimensions 1523 x 2. We used CCPD as a small, known dataset to test our logic
when making changes to the algorithms.
The CCPD data set is already binned and sorted, but the SDSS data is not. We used
Delaunay triangulation to format the SDSS data, limiting the number of cell
adjacencies to 3 per cell. This is preferred over using a Voronoi diagram, which can
return 3 or more adjacencies per cell.
To call and plot ‘optinterval1’, ‘conncomps2’, and ‘conncomps2mod’, we used
‘conncomp_script.m’, ‘bayesian_plot.m’, ‘population_density.m’, and ‘belongs_to.m’.
All helper functions and scripts are included in Appendix A3.
After running ‘optinterval1’ onboth CCPD and SDSS, we obtained a similar number of
partitions relative to the dataset. CCPD yields 8 blocks, and SDSS is represented by 6
blocks. Both datasets have smaller sized blocks that represent the highest and lowest
density regions, while the intermediate blocks contain the majority of cells. For
example, the block partition for SDSS is [215 779 1458 2129 2731 3017], and each of
these blocks have the following sizes, respectively [215, 564, 679, 671, 602, 286]. The
lowest and highest density regions, block 1 and block 6, have significantly less cells
34
compared to blocks 2, 3, 4, and 5. Thus, the arbitrarily connected blocks algorithm finds
significance in blocks that have a sufficiently dissimilar density(either higher or lower)
from the majority. However, these blocks contain cells that are not necessarily
connected.
After running ‘conncomps2.m’ on both CCPD and SDSS, the resulting partitions
varied greatly from the arbitrarily connected Bayesian Blocks. In CCPD, 27 connected
components in the same density level were found. For the larger dataset, SDSS returned
835 connected components. The number of components increased considerably relative
to the size of our data. Unlike the arbitrarily connected blocks, which only filter blocks
by relative density, the connected Bayesian Blocks must be both relatively close in
density and connected. The strict criteria for grouping cells increased the number of
partitions to such a degree that we lost significance in our results.
We ran ‘conncomps2mod.m’ to add another density level to our connected components.
Again, bothdatasets had similar results witha decrease in the number ofpartitions from
‘conncomps2.m’. 11 connected components were found in CCPD, along with 205
connected components in SDSS. Because we included more cells in our search for
adjacencies, more connected cells were grouped together that had similar relative
densities. At 2 density levels, the number of components is higher than that of the
arbitrarily connected blocks, yet lower than density level 1. This is because the
connected components at 2 density levels signify both connectedness and relative
density with less restriction than density level 1.
35
The CCPD and SDSS datasets were used to test three algorithms: arbitrarily connected
Bayesianblocks, connected Bayesianblocks at 1 density level, and connected Bayesian
blocks at 2 density levels. Fig. 3.2 shows the results of running these algorithms on
CCPD data. The arbitrarily connected Bayesian blocks have the smallest number of
partitions. In connected Bayesianblocks at 1 density level, cells in the same component
must not only have the same density but also be connected. This causes an increase in
the number of components. When CCPD is analyzed for connected Bayesian blocks at
2 density levels, the number of components decreases from density level 1, but still has
a higher number of components than the arbitrarily connected Bayesian blocks. The
criteria for cells in each component is relaxed from requiring the same level density to
similar level densities, allowing more cells in the same component. The same
observation was detected with the SDSS data, as shown in figure 3.3.
Figure 3.1: Output for the Counties of California by Population Density dataset
36
Figure 3.2: Output for the Sloan Digital Sky Survey dataset
37
4. Clusters
A galaxy is a huge collection of gas, dust and stars. In this section, we are interested in
finding clusters of galaxies in a collection of stars using the Bayesian Blocks algorithm
and the HOP algorithm.
4.1 Dataset
The data set we use in this section is a 2- dimensional data slice. The data set was
obtained from Sloan Digital Sky Survey. It’s worth mentioning that Sloan Digital Sky
Survey has been one of the most successful surveys in the history of astronomy. Our
dataset contains 1523 data points, which can be shown in the figure 4.1 below.
Figure 4.1: 2- dimensional Sloan Digital Sky Survey Raw Data
38
We could see that the data points are not scattered uniformly in figure 4.1. There are
some high density regions and some low density regions. We will classify the regions
into different clusters using algorithms based on the calculation of density.
4.2 BayesianBlocks
We use the Bayesian Block algorithm to find the optimal partitions based on equal
density properties. There are two ways to get the density of the data points: Voronoi
diagram and Delaunay triangulation.
4.2.1 Voronoi Diagram
As discussed in the previous section, “a Voronoi diagram is a partition of a plane into
regions based on distance to points in a specific subset of the plane”(wiki, 2014). In
each Voronoi cell, there is only one data point. The density of each Voronoi cell is
obtained by the number of the data points, which is one, divided by the area of the
Voronoi cell. By using the Bayesian Blocks algorithm, the cells with the same density
are grouped into one block. Then we have 8 blocks for our data sets. The blocks are
shown in the figure 4.2 below. Cells of the same color are in the same blocks. The
blocks with red color are low density blocks, while the blocks with blue color are high
density blocks. We notice that the cells in the same block are not necessarily connected.
In addition, in this part we ignore the Voronoi cells that have infinity edges or whose
edges are out of the boundary. By doing this, we could get limited areas of the cells,
thus there are some limitations for the method. We lost some data points near the
boundaries.
39
Figure 4.2: Blocks based on Voronoi diagram
4.2.2 Delaunay Triangulation
The Delaunay triangulation is dual to the Voronoi diagram and there is one Voronoi
vertex in the center of each Delaunay triangle. A Delaunay triangle has three data
points as its vertices. In every Delaunay triangle, there is one half data point. By using
a Delaunay triangle, we would not lose a data point from our original data set. One way
to calculate the density is to take the number of data points, which is one half, and
divide it by the area of each triangle. Like the Voronoi diagram, triangulations with
the same density are put into one block. Then we got 10 blocks, which is shown in
figure 3. The triangulations with the same color are in one block. Also, different colors
represent different blocks. However, some triangulations with long edges are separated
into different blocks because of the small areas.
40
Figure 4.3: Bayesian blocks based on Delaunay triangulation using area
Another way to calculate the density is to take the number of data points and divide it
by the maximum lengthof each triangle. By applying the Bayesian Block algorithm, we
got 6 Bayesian Blocks. Figure 4.4 shows 6 different blocks, each block has a different
color. It is clear to see that the red color block represents low density level and the blue
color represents a high density level.
Figure 4.4: Blocks based on Delaunay triangulation with maxi-length
41
Figure 4.5 below is a zoomed in picture of Figure 4.4. The advantage of this method
could be illustrated from this figure. On the left side of the picture, the triangles with
long edges are classified into the same block with their neighbors. By doing this, we
reduce the number of the Bayesian Blocks from 10 to 6.
Figure 4.5: Zoom in picture
4.3 Finding Clusters
As we can see the Bayesian Blocks method is very efficient to find optimalpartitions. It
especially works well on the partitions with components that are not necessarily
connected. A lot of applications have been found, such as detecting gamma ray bursts,
and flares in active galactic nuclei and other variable objects such as the Crab Nebula.
Although the Bayesian Blocks method can be used to find partitions efficiently, the
application is restricted when we want to find clusters, in which the components are
connected. So we introduce a different algorithm (HOP) for identifying clusters of
connected components based on the results from the Bayesian Blocks algorithm.
42
To find clusters using the HOP algorithm, we need two steps. First, we have to find
connected components of the Bayesianblocks. Second, we apply the HOP algorithm on
the connected components to find clusters. Inaddition, if we want to reduce the number
of the clusters, we can apply the HOP algorithm again on those clusters. The details of
each step are described later one by one.
4.3.1 Connected components
There are many ways to define a connected component. In our research, we use the
most straightforward definition: two cells in the same block sharing an edge are
connected components.
Using this definition, we found 835 connected components illustrated in figure 4.6.
Different colors represent different Bayesian blocks, and the number on each cell tells
its connected components index. We can see that the three highlighted triangles have
the same background color, which means they are in the same block. Notice that the
number on the orange one and white one are the same, which means they are in the
same connected component, because they are sharing a common edge. But the number
on the purple triangle is different from the other two, which means this triangle is
included in a different connected component, because it does not have any common
edge with the other two.
43
Figure 4.6: Connected Components
4.3.2 Clusters
How can we find clusters? We can apply the HOP algorithm on those connected
components. The density is defined to be the number of data points in the connected
component divided by the maximum length of that connected component. Every
connected component is joined to its most dense neighbor with higher density.
Taking the circled area in figure 4.7 as an example, the dark blue area has higher
density than the light blue area and they are in different connected components.
Visually, we think they might in the same cluster. After applying HOP, they are
eventually in the same cluster as shown in figure 4.8. In figure 4.8, the number on each
cell means cluster number.
We finally get 249 clusters. It seems there are too many clusters, and we need to merge
some clusters to reduce the number of clusters.
44
Figure 4.7: Connected Components Figure 4.8: HOP on Connected Components
4.3.3 Fewer clusters
In order to reduce the number of clusters, we apply HOP again on the clusters found.
The density is defined as the number of data points in the cluster divided by the
maximum length of that cluster. Every cluster is joined to its most dense neighbor with
higher density. Then every cell now is assigned to a new cluster number shown in
figure 4.9. In the end, we reduced the number of the clusters down to 46.
Figure 4.9: HOP on Clusters
4.4 Comparison
We also did a comparison between using a HOP only algorithm versus HOP algorithm
45
on Bayesian Blocks. The HOP also seems to be overly sensitive to minor fluctuations in
the data which can create a large number of local maximums. In our research, it gives
390 clusters, while HOP algorithm on Bayesian Blocks gives 249 clusters. If we
examine a small area, we can also see how the number of HOP only clusters will be
affected by the density of each cell. For instance, in the highlighted area in figure 4.10,
three triangles form three different clusters according to their densities. However, in
figure 4.11, they are in the same cluster. To sum up, HOP algorithm on Bayesian
Blocks approach is more robust.
Figure 4.10: HOP only (390 clusters) Figure 4.11: HOP on Bayesian Blocks(249 clusters)
4.5 Conclusion
HOP algorithm is probably the most efficient cluster-finding algorithm. But when it is
applied alone, for astronomical data in particular, it seems overly sensitive to small
fluctuations in the density and often results in a lot of mini clusters.
However, whenwe apply the HOP algorithm on the connected components of Bayesian
Blocks, it leads fewer larger clusters, although it still takes a huge amount of time to run
46
the Bayesian Blocks algorithm.
Compared with 835 clusters resulted from the HOP algorithm alone, the method of
“HOP + Connected components” gives only 249 clusters. Nevertheless, reducing the
number of the clusters is still desired for analyzing the astronomical data. Therefore,
we repeat HOP algorithm on connected components of Bayesian Block and then obtain
46 clusters.
Repeating HOP algorithm onconnected components might be the best method we have
used during our research. But there is risk with such repetitive uses of the HOP
algorithm. By repeating the HOP algorithm, we may end up joining two clusters at a
very low levelof density, which have a very high-level local maximum. In most cases,
it might be more appropriate for them to remain separated from each other.
Furthermore, both HOP and repeated HOP onconnected component of Bayesian Block
are probably not effective in finding filamentary clusters because they partition based
only ondensity without considering the shape of clusters, while filamentaryclusters are
usually a vital proportion of the astronomical data.
47
5. Future Directions of Research
In this section we present topics that we were interested in, but did not have enough
time to complete. These ideas could be implemented by a curious researcher.
5.1 Properties of New Objective Function
One possible avenue of future research could be to look into other possible ways of
increasing the efficiency of the Bayesian Block algorithm. The two methods that we
looked into, vectorization and implementing property, were good methods, but
definitely not the only ones.
Another aspect of the project that we would have liked to spend more time on was
looking into which algorithm for the symmetric Bayesian Blocks performed the best.
Also we would have liked to find out other possible, and maybe more efficient
algorithms, for the symmetric Bayesian Block problem.
5.2 Clusters of Galaxies and Stars
In our research, we were focusing on 1-d and 2-d voronoi diagram. Using the
Bayesian Blocks and HOP algorithm on 3-d voronoi diagrams to find clusters is
therefore one of the possible future direction for this research. And which HOP
algorithm to apply is also worthy for further research. Other than the HOP algorithm
implemented in our research, there are some other versions, for example, HOP
algorithm of
a) Joining a cell to its closest neighbor;
48
b) Or joining a cell to another along the steepest gradient, which is an improvement on
hopping to a cell’s densest neighbor;
c) Or joining a cell to a denser neighbor with the lowest density.
Moreover, it’s worthy to look for the possibility of further speeding up the Bayesian
Block algorithm of which we apply repeated HOP on connected component.
Finally, because of the nature of repeatedly using the HOP algorithm, which would
eventually result in having a single cluster, we are curious about when to stop using the
HOP algorithm so as to avoid combining two clusters which should remain separate.
49
6. References
Eisenstein, D. J. & Hut, P. (1998). HOP: A New Group-Finding Algorithm for N-Body
Simulations. The Astrophysical Journal, 498, 137-142.
Jackson, B., et. al. (2004). An Algorithm for Optimal Partitioning of Data onan Interval.
Signal Processing Letters, IEEE, 12 (2) 105-108.
Jackson, B., et. al. (2003). Optimal Partitions of Data in Higher Dimensions.
Scargle, J. D., et. al. (1998). Studies in Astronomical Times Series Analysis V.
Bayesian Blocks, A New Method Structure in Photon Counting Data. Astronomical
Journal, 504, 405.
Scargle, J. F., et. al. (2012). Studies in Astronomical Time Series Analysis VI. Bayesian
Block Representation. The Astrophysical Journal, 764 (2).
50
7. Appendix
7.1 MATLAB Code
This section includes all of the MATLAB code that we created or implemented during
this project. All of the code in this section (and the Python code section) should suffice
to replicate our work.
7.1.1. Vectorized BayesianBlock Algorithm
To improve the efficiency of an algorithm one easy solution is to implement a
vectorized version of the original. This algorithm performs the same operations as the
original Bayesian Block algorithm, but replaces the inner for loop with a vectorized
function call and a vectorized calculation.
%
%%Vectorized version of the Bayesian Block Algorithm
%This vectorized version should increase the efficiency
%of the original BB algorithm
%
function [part,opt,lastchange] = optinterval1(N,A,c)
%
% N is a vector containing the number of data points in each interval
% A is a vector containing the length of each interval
% c is a specified constant
%
% An O(N^2) algorithm for finding the optimal partition of N data points on an
interval
n = length(N);%gets the number of cells
opt = zeros(1,n+1); %initialization
lastchange = ones(1,n); %initialization
changeA = zeros(1,n+1); changeN = zeros(1,n+1);
changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N);
endobj = zeros(1,n); optint = zeros(1,n);
51
% Consider all possible last blocks for the optimal partition
%
% endN = the number of points in the possible last blocks
% endA = the lengths of the possible last blocks
for i = 1:n
endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i);
endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i);
% Computing the values for all possible last blocks in endobj
newc=c*ones(size(endN));
endobj(1:i)= arrayfun(@newobjective,endN(1:i),endA(1:i),newc);
% Computing the values for all possible optimal partitions in optint
optint(1:i) = opt(1:i)+endobj(1:i);
% The optimal partition is the one with the maximum value
% opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0.
% lastchange(i) is the first cell in the last block of an optimal partition
[opt(i+1),lastchange(i)] = max(optint(1:i));
end
% backtracking to find the blocks of the optimal partition and its change points
i = 1; part(1,1)=n; k = n;
while k > 0
i=i+1;
k = lastchange(k)-1; if k > 0 part(i,1) = k; end
end
7.1.2. Last Change Conjecture Implementation of BayesianBlock Algorithm
Another solution to increase the efficiencyof the Bayesian Block algorithm was to code
in a property of the new objective function. The last change conjecture (Section 2.4)
says that when the data set (data cells) are in increasing order the last chance point is
always nondecreasing. This property allows us to eliminate some of the unnecessary
computation in the inner loop.
%
%%Last Change Conjecture implementation of Bayesian Block Algorithm
%a dynamic programming algorithm with efficiency O(N^2/B)
52
%where B is the number of blocks
%Makes use of vectorization
%
function [part,opt,lastchange] = newoptinterval1(N,A,c)
% N is a vector containing the number of data points in each interval
% A is a vector containing the length of each interval
% c is a specified constant
%
% An O(N^2) algorithm for finding the optimal partition of N data points on an
interval
%gets the number of cells
n = length(N);
%initialization
opt = zeros(1,n+1);
lastchange = ones(1,n);
changeA = zeros(1,n+1); changeN = zeros(1,n+1);
changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N);
endobj = zeros(1,n); optint = zeros(1,n);
% Consider all possible last blocks for the optimal partition
%
% endN = the number of points in the possible last blocks
% endA = the lengths of the possible last blocks
%
% k = the value of the last change point
%
% algorithm only checks up to k for the current change point
for i = 1:n
if i==1 k=1; else k= lastchange(i-1);
end
endN = ones(1,i); endN(k:i) = endN(k:i)*changeN(i+1)-changeN(k:i);
endA = ones(1,i); endA(k:i) = endA(k:i)*changeA(i+1)-changeA(k:i);
% Computing the values for all possible last blocks in endobj
newc=c*ones(size(endN(k:i)));
endobj(k:i)= arrayfun(@newobjective,endN(k:i),endA(k:i),newc);
53
% Computing the values for all possible optimal partitions in optint
optint(k:i) = opt(k:i)+endobj(k:i);
% The optimal partition is the one with the maximum value
% opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0.
% lastchange(i) is the first cell in the last block of an optimal partition
[opt(i+1),lastchange(i)] = max(optint(k:i));
%Adjust last change point values to correspond to correct points
lastchange(i)=lastchange(i)+k-1;
end
% backtracking to find the blocks of the optimal partition and its changepoints
i = 1; part(1,1)=n; k = n;
while k > 0
i=i+1;
k = lastchange(k)-1; if k > 0 part(i,1) = k; end
end
7.1.3. Code for finding Bayesian Blocks and connected components using
BayesianBlock algorithm
p=presentation2d;
tri=delaunay(p);
[n,~]=size(p);
[m,~]=size(tri);
for i=1:m
maxdist(i) =
max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t
ri(i,2),:))]);
end
[r,s] = sort(maxdist);
[x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10);
dt = DelaunayTri(p);
triplot(dt)
blockno = 6; blockends = [215 779 1458 2129 2731 3017];
54
for i = 1:blockno
if i == 1 first = 1; else first = blockends(i-1)+1; end
last = blockends(i);
for j = first:last
patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i.
end
end
blockno = 6; blockends = [215 779 1458 2129 2731 3017];
for i = 1:blockno
if i == 1 first = 1; else first = blockends(i-1)+1; end
last = blockends(i);
cellblock(s(first:last))=i ;
end
voronoi(p(:,1),p(:,2))
for i = 1:3017
pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3];
end
triplot(dt)
% Assign labels to the triangles.
numtri = size(tri,1); triangleno = [1:3017];
plabels = arrayfun(@(n) {sprintf('%d', triangleno(n))}, (1:numtri)');
hold on
Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ...
'bold', 'HorizontalAlignment','center', ...
'BackgroundColor', 'none');
hold off
%computing the adjacies of the Delaunay triangulion and the components of
%Baysian blocks
[adjmat,adjcell] = Delaunay2dadjacencies(p);
n =.5*ones(3017,1);
a = transpose(maxdist);
[r,s] = sort(a./n);
blocks = cell(6,1);
for i = 1:6
if i == 1 first = 1;
55
else first = blockends(i-1)+1;
end;
last = blockends(i);
blocks{i} = s(first:last);
end
[w,x,y,z] = conncomponents2(blocks,n,a,adjmat);
Assign labels to the triangles.
for i=1:835
cellcomp(w{i}) = i;
end
for i = 1:3017
pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3];
end
triplot(dt)
numtri = size(tri,1);
plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)');
hold on
Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ...
'bold', 'HorizontalAlignment','center', ...
hold off
blockno = 6; blockends = [215 779 1458 2129 2731 3017];
for i = 1:blockno
if i == 1 first = 1; else first = blockends(i-1)+1; end
last = blockends(i);
for j = first:last
patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i.
end
end
function [part,opt,lastchange] = optinterval1(N,A,c)
% An O(N^2) algorithm for finding the optimal partition of N data points on an
interval
% N is a vector containing the number of data points in each interval
% A is a vector containing the length of each interval
56
n = length(N); opt = zeros(1,n+1); lastchange = ones(1,n);
changeA = zeros(1,n+1); changeN = zeros(1,n+1);
changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N);
endobj = zeros(1,n+1); optint = zeros(1,n+1);
for i = 1:n
endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i);
endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i);
for j = 1:i
endobj(j) = newobjective(endN(j),endA(j),c);
optint(j) = opt(j)+endobj(j);
end
[opt(i+1),lastchange(i)] = max(optint(1:i));
end
i = 1; part(1,1)=n; k = n;
while k > 0
i=i+1; k = lastchange(k)-1; if k > 0 part(i) = k; end
end
function [value] = newobjective(x,y,c)
% A function that computes the objective function value of a block
% x is the length of the block and y is the number of data points in the block
value = x*(log(x/y)) - c;
function [adjmat,adjcell] = Delaunay2dadjacencies(p)
% p is a 2 x n matrix of 2-d points
% computing adjacencies of Delaunay triangles from Voronoi edges
[vert,bound] = voronoin(p); tri = delaunay(p); [j,~] = size(vert);
adjmat = sparse(j,j); adjcell = cell(j,1); [n,~] = size(p);
%storing adjacencies in a sparse adjacency matrix
for i=1:n
num = length(bound{i});
for h = 1:num
if h < num
adjmat(bound{i}(h),bound{i}(h+1))=1;
adjmat(bound{i}(h+1),bound{i}(h))=1;
else
57
adjmat(bound{i}(h),bound{i}(1))=1;
adjmat(bound{i}(1),bound{i}(h))=1;
end
end
end
for i=1:j-1
trino(i) = intersect(bound{tri(i,1)},intersect(bound{tri(i,2)},bound{tri(i,3)}));
end
adjmat = adjmat(trino,trino);
%storing adjacencies as a cell array
for i = 1:j-1
adjcell{i}=find(adjmat(i,:));
end
function [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,n,a,adjmat)
% finding the connected components of a partition given by cell array blocks
% two cells with the same color are in the same component
cellno = length(n); color = [1:cellno];
for i=1:cellno comps{i} = [i]; end;
% going through adjacencies to update components
for i = 1:length(blocks);
[row,col] = find(adjmat(blocks{i},blocks{i}));
row = blocks{i}(row); col = blocks{i}(col);
for j = 1:length(row)
if color(row(j)) ~= color(col(j))
if length(comps{color(row(j))}) >= length(comps{color(col(j))});
comps{color(row(j))} =
union(comps{color(row(j))},comps{color(col(j))});
x = comps{color(col(j))}; y = color(col(j));
color(x) = color(row(j));
comps{y} = [];
else
comps{color(col(j))} =
union(comps{color(row(j))},comps{color(col(j))});
x = comps{color(row(j))}; y = color(row(j));
color(x) = color(col(j));
comps{y} = [];
end
end
58
end
end
nonempty = [];
for i = 1:cellno
if length(comps{i}) > 0
nonempty = [nonempty i];
end
end
comps = comps(nonempty);
for i = 1:length(comps)
ncomps(i) = sum(n(comps{i}));
acomps(i) = sum(a(comps{i}));
color(comps{i}) = i;
end
[row,col] = find(adjmat);
adjcomps = sparse(length(row),length(row));
for i = 1:length(row)
if color(row(i)) ~= color(col(i))
adjcomps(color(row(i)),color(col(i))) = 1;
adjcomps(color(col(i)),color(row(i))) = 1;
end
end
%%% conncomp_script.m
% Import Data from F14PresentationData
p = presentation2d;
% computing the Bayesian Blocks partition of the Delaunay triangulation using
maxdist for the size of a cell
tri = delaunay(p);
[tri_size, ~] = size(tri);
for i = 1:tri_size
maxdist(i) =
max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t
ri(i,2),:))]);
end
% use maxdist between points in the triangle as area and sort area
[r,s] = sort(maxdist);
% create optimal partition
59
[x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10);
% blockends
blockno = length(x); blockends = fliplr(transpose(x));
% conncomps2 takes transpose
n =.5*ones(3017,1); a = transpose(maxdist);
[r,s] = sort(a./n);
% initialize blocks
blocks = cell(blockno,1);
% assign cells blocks cell array
for i = 1:blockno
if i == 1
first = 1;
else
first = blockends(i-1)+1;
end
last = blockends(i);
blocks{i} = s(first:last);
end
% Computing the adjacencies of the Delaunay triangulation and the components of
the Bayesian Blocks
[adjmat,adjcell] = Delaunay2dadjacencies(p);
% calculate connected components
[comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,n,a,adjmat);
% Assign labels to the triangles
for i=1:length(comps)
cellcomp(comps{i}) = i;
end
% draw and color connected components
triplot(dt)
numtri = size(tri,1);
plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)');
map = colormap(hsv(length(comps)));
for i = 1:length(comps)
current_comp = comps{i};
for j = 1:length(current_comp)
patch(p(tri(current_comp(j),:),1),p(tri(current_comp(j),:),2),map(i,:)); %
use color i.
end
end
60
%% conncomps2mod function
%% finds connected cells within 2 density levels and groups them into components
%% returns components, number of data points in components, area of components,
%% and adjacent components
function [comps,ncomps,acomps,adjcomps] = conncomps2mod(blocks,n,a,adjmat)
% figuring out references to blockends
blockends=[];
for temp=1:length(blocks)
blockends=[blockends blocks{temp}(length(blocks{temp}))];
end
% finding the connected components of a partition given by cell array blocks
% two cells with the same color are in the same component
cellno = length(n); color = [1:cellno];
for i=1:cellno comps{i} = [i]; end;
% going through adjacencies to update components
for i = 1:length(blocks)
% find all pairs of cells in current block that are adjacent
% and store their cell number
[row,col] = find(adjmat(blocks{i},blocks{i}));
% finding adjacencies between blocks{i} and blocks{i+1}
if i<length(blocks)
[row2, col2] = find(adjmat(blocks{i},blocks{i+1}));
% reindexing row2 and col2 -> all indexes in row2 and col2 reference
% a cell in either block(i) or block(i+1)
for k=1:length(row2)
row2_block_number = belongs_to(blocks{i}(row2(k)), blocks);
col2_block_number = belongs_to(blocks{i+1}(col2(k)), blocks);
if row2_block_number
row2(k) = blocks{row2_block_number}(row2(k));
else
row2(k) = blocks{i+1}(row2(k));
end
if col2_block_number
col2(k) = blocks{col2_block_number}(col2(k));
else
col2(k) = blocks{i+1}(col2(k));
end
61
end
end
% grab cell number of adjacency index from block cell array
row = blocks{i}(row); col = blocks{i}(col);
row = vertcat(row, row2);
col = vertcat(col, col2);
% comment out above two lines and uncomment the below two lines
% depending if the data needs horzcat versus vertcat
% row = horzcat(row,transpose(row2));
% col = horzcat(col, transpose(col2));
for j = 1:length(row)
% if the color of the two adjacent cells don’t already belong to same
component
if color(row(j)) ~= color(col(j))
% number of components in color array for cell in row is greater than
number of
% components in color array for cell in col
if length(comps{color(row(j))}) >= length(comps{color(col(j))});
% check to see if the incoming cell(s) are within one
% level density apart
head_cell_block_number = belongs_to(color(row(j)),blocks);
possible_guests = belongs_to(color(col(j)),blocks);
difference = head_cell_block_number - possible_guests;
% do nothing if components contain cells from blocks of
% more than one density apart
if(abs(difference) > 1)
comps{color(col(j))} = comps{color(col(j))};
else
% place union of components for adjacent cells in
components
% for row cell
comps{color(row(j))} =
union(comps{color(row(j))},comps{color(col(j))});
x = comps{color(col(j))}; y = color(col(j));
% store new location of cell in color
color(x) = color(row(j));
comps{y} = [];
end
else
62
% check to see if the incoming cell(s) are within one
% level density apart
head_cell_block_number = belongs_to(color(col(j)),blocks);
possible_guests = belongs_to(color(row(j)),blocks);
difference = head_cell_block_number - possible_guests;
% do nothing if components contain cells from blocks of
% more than one density apart
if(abs(difference) > 1)
comps{color(row(j))} = comps{color(row(j))};
else
% place union of components for adjacent cells in
components
% for col cell
comps{color(col(j))} =
union(comps{color(row(j))},comps{color(col(j))});
x = comps{color(row(j))}; y = color(row(j));
% store new location of cell in color
color(x) = color(col(j));
comps{y} = [];
end
end
end
end
end
nonempty = [];
for i = 1:cellno
if length(comps{i}) > 0
nonempty = [nonempty i];
end
end
% keep only nonempty connected components
comps = comps(nonempty);
63
for i = 1:length(comps)
ncomps(i) = sum(n(comps{i}));
acomps(i) = sum(a(comps{i}));
% re-indexing color to reference new component array
color(comps{i}) = i;
end
% all adjacencies
[row,col] = find(adjmat);
% create sparse matrix with same size as all adjacencies
adjcomps = sparse(length(row),length(row));
for i = 1:length(row)
% cells do not belong to same component
if color(row(i)) ~= color(col(i))
adjcomps(color(row(i)),color(col(i))) = 1;
adjcomps(color(col(i)),color(row(i))) = 1;
end
end
%% belongs_to.m
function [ block_number ] = belongs_to( cell_number, blocks )
% belongs_to: helper function to return the block number that the cell
% belongs to
for i=1:length(blocks)
for k=1:length(blocks{i})
if blocks{i}(k) == cell_number
block_number = i;
return;
end
end
end
end
64
%% population_density.m
%% takes data from population density .xls file and finds
%% the bayesian blocks and connected components
% Data is already sorted on import
[x,y,z] = optinterval1(pop/10000,area,15)
% import neighbors into matrix called Adjacencies
[num_cols, num_rows] = size(adjacencies);
adjmat = sparse(num_cols, num_cols);
for i=1:num_cols
for j=1:num_rows
if isnan(adjacencies(i,j)) == 0
adjmat(i,adjacencies(i,j)) = 1;
end
end
end
blockno = length(x);
blocks = cell(blockno,1);
x = flipud(x);
for i = 1:blockno
if i == 1
first = 1;
else first = x(i-1)+1;
end;
last = x(i);
blocks{i} = (first:last);
end
[comps,ncomps,acomps,adjcomps] =
conncomponents2(blocks,pop/1000,area,adjmat);
%% bayesian_plot.m script
%% plots data and colors cells in the same block the same color
% importing data from CAMCOS2ddataset
p = presentation2d;
% computing the triangles of the Delaunay triangulation
tri = delaunay(p);
[n,~] = size(p);
[m,~] = size(tri);
65
% computing the areas of the triangles
for i = 1:m [~,area(i,1)] = convhull(p(tri(i,:),:)); end
% sorting the triangles by density/area
[r,s] = sort(area);
% applying the Bayesian Blocks algorithm to the sorted data
tic
[x,y,z] = optinterval1(.5*ones(m,1),area(s),10);
toc
% use colormap for pretty colors
map = colormap(hsv(length(x)));
blockno = length(x); blockends = x(blockno:-1:1);
dt = DelaunayTri(p);
% Drawing the Delaunay triangulation
triplot(dt)
% coloring the blocks of the optimal partition
for i = 1:blockno
if i == 1
first = 1; else first = blockends(i-1)+1;
end
last = blockends(i);
for j = first:last
% use color i.
patch(p(tri(s(j),:),1),p(tri(s(j),:),2),map(i,:));
end
end
7.1.4. finding clusters using HOP algorithm
% Computing the adjacencies of the components of the Bayesian Blocks and the HOP
partition of these components
for i = 1:835
cc{i} = find(z(i,:));
end
[w1,x1,y1,z1] = HOPPlus2Fall14(x,y,cc);
% size(w1) = 249
for i = 1:249 w2{i} = [];
for j = 1:length(w1{i}) w2{i} = union(w2{i},w{w1{i}(j)});
end;
66
cellpart(w2{i}) = i;
i
w2{i}
end
for i = 1:3017
pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3];
end
triplot(dt)
% Assign labels to the triangles.
numtri = size(tri,1);
plabels = arrayfun(@(n) {sprintf('%d', cellpart(n))}, (1:numtri)');
hold on
Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ...
'bold', 'HorizontalAlignment','center', ...
'BackgroundColor', 'none');
hold off
blockno = 6; blockends = [215 779 1458 2129 2731 3017];
for i = 1:blockno
if i == 1 first = 1; else first = blockends(i-1)+1; end
last = blockends(i);
for j = first:last
patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i.
end
end
% computing the cells www{j} in each component j and the Bayesian block =
cellblock(k) each cell k is in
for i = 1:835
www{i} = s(w{i});
end
for i = 1:blockno
if i == 1 j = 1;
else j = blockends(i-1)+1;
end;
cellblock(s(j:blockends(i))) = i;
end
67
Index = [];
for i = 1:length(w1)
for j = 1:length(w1{i});
Index(w1{i}(j)) = i;
end
end
% find adjacecies of the cluster
adjclusters = sparse(length(w1),length(w1));
for i = 1:length(w1)
for j = 1:length(w1{i});
adjs = cc{w1{i}(j)};
for k = 1:length(adjs);
adjclusters(i, Index(adjs(k))) = 1;
adjclusters(Index(adjs(k)), i) = 1;
end
end
end
for i = 1:length(w1)
ss{i} = find(adjclusters(i,:));
end
[w3,x3,y3,z3] = HOPPlusFall14(x1,y1,ss);
% figure
%length(w3)=46
for i = 1:46 w4{i} = [];
for j = 1:length(w3{i})
for k =1: length (w1{w3{i}(j)})
w4{i} = union(w4{i},w{w1{w3{i}(j)}(k)});
end;
end;
cellsecond(w4{i}) = i;
i
w4{i}
end
68
for i = 1:3017
pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3];
end
triplot(dt)
% Assign labels to the triangles.
numtri = size(tri,1);
plabels = arrayfun(@(n) {sprintf('%d', cellsecond(n))}, (1:numtri)');
hold on
Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ...
'bold', 'HorizontalAlignment','center', ...
'BackgroundColor', 'none');
hold off
blockno = 6; blockends = [215 779 1458 2129 2731 3017];
for i = 1:blockno
if i == 1 first = 1; else first = blockends(i-1)+1; end
last = blockends(i);
for j = first:last
patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i.
end
end
[ww,xx,yy,zz] = HOPPlus2Fall14(n,a,adjcell);
%%% hop plot without color
dt = DelaunayTri(p);
triplot(dt)
for i = 1:length(ww)
cellP(ww{i})=i;
end
% Assign labels to the triangles.
for i = 1:3017
pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3];
end
numtri = size(tri,1);
69
plabels = arrayfun(@(n) {sprintf('%d', cellP(n))}, (1:numtri)');
hold on
Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ...
'bold', 'HorizontalAlignment','center', ...
'BackgroundColor', 'none');
hold off
function [ X,NX,AX,cX] = HOPPlusFall14(N,A,c)
% HOPPlus Partitions the data cells so that each cell is joined to its most dense
neighbor
% Each part of the partition contains one local maximum
% sorting the cells by their densities which is O(Nlog N)
[~,I] = sort(A./N); %sorting by decreasing density
% changing to new coordinates obtained by sorting
N=N(I); A=A(I); c=c(I);
% X{i} = cells in the ith part of the HOP partition
% R(i) = part of the HOP partition containing cell i
% cX{i} = neighboring cells of the ith part of the HOP partition
% NX(i) = number of data points in the ith part of the HOP partition
% AX(i) = area/volume of the ith part of the HOP partition
j = length(N);
X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j);
k=0; % k = number of parts in the partition
for i=1:j
c{i}=I(c{i}); % changing neighbors of cell i to the new coordinates
m=min(c{i}); % finding the most dense neighbor of cell i
if m < i % cell i is joined to its most dense neighbor in part kk
kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]);
AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i);
cX{kk} = union(cX{kk},c{i});
else % cell i is a local maximum and a new part k is started
k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i};
AX(k)= A(i); NX(k) = N(i);
end
end
cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k);
% Returning to the original indices before sorting
for i = 1:k
X{i} = I(X{i});
70
cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i});
end
function [ X,NX,AX,cX] = HOPPlus2Fall14(N,A,c)
% HOPPlus Partitions the data cells so that each cell is joined to its most dense
neighbor
% Each part of the partition contains one local maximum
% Inputs N = number of data points in each cell
% A = area/vol of each cell
% c = cell array containing adjacencies of each cell
% sorting the cells by their densities which is O(Nlog N)
[~,I] = sort(A./N); %sorting by decreasing density
% changing to new coordinates obtained by sorting
N=N(I); A=A(I); c=c(I);
% X{i} = cells in the ith part of the HOP partition
% R(i) = part of the HOP partition containing cell i
% cX{i} = neighboring cells of the ith part of the HOP partition
% NX(i) = number of data points in the ith part of the HOP partition
% AX(i) = area/volume of the ith part of the HOP partition
j = length(N);
X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j);
k=0; % k = number of parts in the partition
for i = 1:j II(I(i)) = i; end
for i = 1:j
c{i}=II(c{i}); % changing neighbors of cell i to the new coordinates
end
for i=1:j
m=min(c{i}); % finding the most dense neighbor of cell i
if m < i % cell i is joined to its most dense neighbor in part kk
kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]);
AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i);
cX{kk} = union(cX{kk},c{i});
else % cell i is a local maximum and a new part k is started
k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i};
% localmax = [localmax; i];
AX(k)= A(i); NX(k) = N(i);
end
end
71
cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k);
% Returning to the original indices before sorting
for i = 1:k
X{i} = I(X{i});
cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i});
end
7.1.5. Finding Bayesian Blocks based on the area of each triangle
matrix=presentation2d;
tri=delaunay(matrix);
[n,~]=size(matrix);
[m,~]=size(tri);
dt=DelaunayTri(matrix(:,1), matrix(:,2));
%triplot(dt);
for i= 1:m
[~,area(i,1)]= convhull(matrix(tri(i,:),:));
end
[r,s]=sort(area);
[blockno,blockends,opt,lastchange,cellblock]=bayesianblocks2d(.5*ones(m,1),area,10
)
dt = DelaunayTri(matrix);
triplot(dt)
for i = 1:blockno
if i == 1
first = 1;
else first = blockends(i-1)+1;
end
last = blockends(i);
for j = first:last
patch(matrix(tri(s(j),:),1),matrix(tri(s(j),:),2),i); % use color i.
end
end
72
7.2 PYTHON Code
Chapter 2.5.2 code is written in Python because of its expandability. We hope to use
AstroML module for future research. AstroML contains a growing library of statistical
and machine learning routines for analyzing astronomical data, loaders for several open
astronomical datasets, and a large suite of examples of analyzing and visualizing
astronomical datasets.
7.2.1. How to find the block faster - Folding Method
import numpy as np
from scipy import stats
import pylab as pl
import pdb
import b_blocks as bb
def BB_folding(a, cntr_index): # a: dataset, cntr_index: index of center
a = np.sort(a)
a = np.asarray(a)
# We truncate dataset around center
if cntr_index > len(a)/2:
a=a[(2*cntr_index -len(a))+1:]
cntr_index=len(a)//2
if cntr_index < len(a)/2:
a=a[:2*cntr_index+1]
cntr_index=len(a)//2
edges_whole = np.concatenate([a[:1],
0.5 * (a[1:] + a[:-1]),
a[-1:]])
temp= a[:cntr_index+1] # cut the midpoint of the data
temp= np.asarray(temp)
temp_reverse=a[cntr_index:] - a[cntr_index] # other half - center value
temp_reverse= np.asarray(temp_reverse)
temp_reverse= a[cntr_index] - temp_reverse[::-1] # reverse
t= (temp +temp_reverse)/2 # t: averaged data (half) of symmetrical
dataset
t = np.sort(t)
N = t.size -1
edges = np.concatenate([t[:1], 0.5 * (t[1:] + t[:-1]), t[-1:]])
block_length = t[-1] - edges
73
# arrays needed for the iteration
nn_vec = np.ones(N)
best = np.zeros(N, dtype=float)
last = np.zeros(N, dtype=int)
# Start with first data cell; add one cell at each iteration
for K in range(N):
# Compute the width and count of the final bin for all possible
# locations of the K^th changepoint
width = block_length[:K + 1] - block_length[K + 1]
count_vec = np.cumsum(nn_vec[:K + 1][::-1])[::-1] # calculates
no. of data #points in all possible blocks
# evaluate fitness function for these possibilities
fit_vec = count_vec * (np.log(count_vec) - np.log(width)) #
objective function
fit_vec -= 4 # 4 comes from the prior on the number of
changepoints
fit_vec[1:] += best[:K] # additive property of obj function n
# find the max of the fitness: this is the K^th changepoint
i_max = np.argmax(fit_vec) # Indices of the maximum values
along an axis
last[K] = i_max
best[K] = fit_vec[i_max]
# Recover changepoints by iteratively peeling off the last block
change_points = np.zeros(N, dtype=int)
i_cp = N
ind = N
while True:
i_cp -= 1
change_points[i_cp] = ind
if ind == 0:
break
ind = last[ind - 1]
change_points = change_points[i_cp:]
#UNFOLDING begins
temp1= cntr_index - change_points
temp1_reverse= cntr_index + temp1[::-1]
cp_whole=np.concatenate((change_points,temp1_reverse[1:]))
res = edges_whole[cp_whole]
return res

More Related Content

What's hot

Suprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
Suprises In Higher DImensions - Ahluwalia, Chou, Conant, VitaliSuprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
Suprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
Elias Vitali
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
 
Pak eko 4412ijdms01
Pak eko 4412ijdms01Pak eko 4412ijdms01
Pak eko 4412ijdms01
hyuviridvic
 
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Pourya Jafarzadeh
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
Editor IJARCET
 

What's hot (19)

Suprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
Suprises In Higher DImensions - Ahluwalia, Chou, Conant, VitaliSuprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
Suprises In Higher DImensions - Ahluwalia, Chou, Conant, Vitali
 
Accurate time series classification using shapelets
Accurate time series classification using shapeletsAccurate time series classification using shapelets
Accurate time series classification using shapelets
 
3 平均・分散・相関
3 平均・分散・相関3 平均・分散・相関
3 平均・分散・相関
 
I0341042048
I0341042048I0341042048
I0341042048
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
POSITION ESTIMATION OF AUTONOMOUS UNDERWATER SENSORS USING THE VIRTUAL LONG B...
POSITION ESTIMATION OF AUTONOMOUS UNDERWATER SENSORS USING THE VIRTUAL LONG B...POSITION ESTIMATION OF AUTONOMOUS UNDERWATER SENSORS USING THE VIRTUAL LONG B...
POSITION ESTIMATION OF AUTONOMOUS UNDERWATER SENSORS USING THE VIRTUAL LONG B...
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Cardinal direction relations in qualitative spatial reasoning
Cardinal direction relations in qualitative spatial reasoningCardinal direction relations in qualitative spatial reasoning
Cardinal direction relations in qualitative spatial reasoning
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Ch24
Ch24Ch24
Ch24
 
A0360109
A0360109A0360109
A0360109
 
NavMesh
NavMeshNavMesh
NavMesh
 
Pak eko 4412ijdms01
Pak eko 4412ijdms01Pak eko 4412ijdms01
Pak eko 4412ijdms01
 
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
Online Multi-Person Tracking Using Variance Magnitude of Image colors and Sol...
 
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONFUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
 
lec14_ref.pdf
lec14_ref.pdflec14_ref.pdf
lec14_ref.pdf
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 

Similar to Written_report_Math_203_v2

Molecular Space FramesV1 2
Molecular Space FramesV1 2Molecular Space FramesV1 2
Molecular Space FramesV1 2
Paul Melnyk
 
Kevin_Park_OSU_ Master_Project Report
Kevin_Park_OSU_ Master_Project ReportKevin_Park_OSU_ Master_Project Report
Kevin_Park_OSU_ Master_Project Report
Kevin Park
 
Non-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docxNon-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docx
vannagoforth
 
3. frequency distribution
3. frequency distribution3. frequency distribution
3. frequency distribution
Ilham Bashir
 
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding TechniqueHyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
CSCJournals
 
Gupte - first year paper_approved (1)
Gupte - first year paper_approved (1)Gupte - first year paper_approved (1)
Gupte - first year paper_approved (1)
Shweta Gupte
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
Xuan Chen
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
perfj
 

Similar to Written_report_Math_203_v2 (20)

COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
Molecular Space FramesV1 2
Molecular Space FramesV1 2Molecular Space FramesV1 2
Molecular Space FramesV1 2
 
Study on atome probe
Study on atome probe Study on atome probe
Study on atome probe
 
Kevin_Park_OSU_ Master_Project Report
Kevin_Park_OSU_ Master_Project ReportKevin_Park_OSU_ Master_Project Report
Kevin_Park_OSU_ Master_Project Report
 
B0343011014
B0343011014B0343011014
B0343011014
 
poster09
poster09poster09
poster09
 
50134 10
50134 1050134 10
50134 10
 
Non-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docxNon-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docx
 
[PPT]
[PPT][PPT]
[PPT]
 
3. frequency distribution
3. frequency distribution3. frequency distribution
3. frequency distribution
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding TechniqueHyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
Hyperspectral Data Compression Using Spatial-Spectral Lossless Coding Technique
 
Gupte - first year paper_approved (1)
Gupte - first year paper_approved (1)Gupte - first year paper_approved (1)
Gupte - first year paper_approved (1)
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
Multi fractal analysis of human brain mr image
Multi fractal analysis of human brain mr imageMulti fractal analysis of human brain mr image
Multi fractal analysis of human brain mr image
 
Multi fractal analysis of human brain mr image
Multi fractal analysis of human brain mr imageMulti fractal analysis of human brain mr image
Multi fractal analysis of human brain mr image
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
 
Working with Numerical Data
Working with  Numerical DataWorking with  Numerical Data
Working with Numerical Data
 
Bachelor's Thesis
Bachelor's ThesisBachelor's Thesis
Bachelor's Thesis
 

Written_report_Math_203_v2

  • 1. 1 Analyzing Interstellar Data CAMCOS Report San José State University by Teresa Baral Gianna Fusaro Wesley Ha Aerin Kim Vaibhav Kishore Yunzhi Lin Gulrez Pathan Xiaoli Tong Mine Zhao Xinling Zhang Bradley W. Jackson (Leader) Fall 2014
  • 2. 2 1. Introduction························································································4 1.1 Voronoi Diagram ··············································································4 1.2 Delaunay Triangulation ······································································6 1.3 Bayesian Blocks ···············································································7 1.4 HOP ······························································································9 1.4.1 HOP Algorithm ·········································································10 1.4.2 HOP Benefits & Drawbacks ·························································12 1.4.3 HOP Variants············································································13 2. Properties of Dr. Scargle’s New Objective Function ·································14 2.1 Convexity ·····················································································15 2.2 The Equal Density Property·······························································17 2.3 The Intermediate Density Property ······················································19 2.4 The Last Change Conjecture ······························································20 2.5 Experiments with Bayesian Blocks Algorithm········································24 2.5.1. Intuitive Block··········································································24 2.5.2. How to Find the Block Faster·······················································27 3. Arbitrarily Connected Bayesian Blocks··················································28 3.2 Connected Components ····································································29 3.3 Connected Components: Input/Output··················································30 3.4 Connected Components: Algorithm ·····················································31 3.5 Connected Bayesian Blocks: 2 Density Level·········································32 3.6 Testing and Analysis········································································33 4. Clusters ····························································································37 4.1 Dataset ·························································································37 4.2 Bayesian Blocks ·············································································38 4.2.1 Voronoi Diagram·······································································38 4.2.2 Delaunay Triangulation ·······························································39 4.3 Finding Clusters··············································································41 4.3.1 Connected components································································42 4.3.2 Clusters ···················································································43
  • 3. 3 4.3.3 Fewer clusters ···········································································44 4.4 Comparison ···················································································44 4.5 Conclusion ····················································································45 5. Future Directions of Research ······························································47 5.1 Properties of New Objective Function ··················································47 5.2 Clusters of Galaxies and Stars ····························································47 6. References ························································································49 7. Appendix ··························································································50 7.1 MATLAB Code··············································································50 7.1.1. Vectorized Bayesian Block Algorithm············································50 7.1.2. Last Change Conjecture Implementation of Bayesian Block Algorithm ··51 7.1.3. Code for finding Bayesian Blocks and connected components using Bayesian Block algorithm ···································································53 7.1.4. finding clusters using HOP algorithm·············································65 7.1.5. Finding Bayesian Blocks based on the area of each triangle ·················71 7.2 PYTHON Code ··············································································72 7.2.1. How to find the block faster - Folding Method ·································72
  • 4. 4 1. Introduction Given a sample of N points in D-dimensional space, there are two classes of problems that we would like to solve: density estimation and cluster finding. To infer the probability distribution function (pdf) from a sample of data is known as density estimation. It is one of the most critical components of extracting knowledge from data. For example, given a pdf estimation from point data, we can generate simulated distributions of data and compare them against observations. If we can identify regions of low probability within the pdf, we can also detect the unusual events. Clustering is grouping a set of data such that data in the same group are similar to each other, but between groups are very different. There are manyalgorithms contributing to find clusters, such as K-mean clustering, hierarchical clustering and DBSCAN. In our project we will derive a new approach to find clusters. 1.1 Voronoi Diagram A Voronoi diagram partitions space into ordered regions called Voronoi cells which allow us to infer from its properties the shortest Euclideandistance ofany point in space or the density of any Voronoi cell.
  • 5. 5 Figure 1.1 Voronoi diagram where each color region is a Voronoi cell containing exactly one point. Definitions: ● Let R be a set of distinct points p1, p2, p3,…,pn in the plane, then the Voronoi diagram of R is a subdivision of the plane into n Voronoi cells such that each cell contains exactly one point. ● Area in any given cell is closer to its generating point than to any other. ○ If a point q lies in the same region as point pi then the Euclidean distance from pi to q will be shorter than the Euclidean distance from Pj to q, where Pj is any other point in R which is not pi. In a Voronoi diagram, the Voronoi edge is a subset of locus of points equidistant from any two generating points on R. A Voronoi vertex is positioned at the circle of any empty circle generated by three or more neighboring points on the plane. Density of Voronoi cell is calculated by taking the data point and dividing it by the volume of cell.
  • 6. 6 1.2 DelaunayTriangulation Delaunay triangulationbydefinition is the triangulationof the convex hull of the points in a diagram wherein every circumcircle of a triangle is an empty circle. It is used as a method for binning data. It is also defined as the dual of a Voronoi diagram. Therefore a Delaunay triangulation can be formulated by taking the dual of a Voronoi diagram. The formulation of a Delaunay triangulation by taking the dual of a Voronoi Diagram can be seen by examining the image below (fig. 1.2). The edges of a Delaunay triangle is created by connecting the data points between two adjacent voronoi cells sharing a common edge as shown by the data points Pi and Pj in fig. 1.2. The vertices of the Delaunay triangle thenconsist of the data points. In the example given, the circle shows a Delaunay triangle which is connected by the data points Pi, Pj, and an unnamed data point. A Delaunay triangle consists of three data points, each assumed to have a 360 degree angle, and a triangle in general has an 180 degree angle; therefore, a Delaunay triangle contains half a data point. If a Delaunay triangle is considered as a cell, then its nucleus would be a Voronoi vertex. Each Voronoi vertex is adjacent to three Voronoi cells. Since a Voronoi vertex is adjacent to three cells, then it would have three edges connected to it which forms the triangular shape when its dual is taken.
  • 7. 7 Figure 1.2 Delaunay Triangulation (purple diagram) created by taking the dual of a Voronoi Diagram (blue diagram) 1.3 BayesianBlocks Real data rarely follow one simple distribution and nonparametric methods can offer a useful way of analyzing the data. For example, one of the simplest nonparametric methods to analyze a one-dimensional data set is a histogram. To construct a histogram, we need to specify a bin size, and we assume that the estimated distribution function is piecewise constant within each bin. However in a massive astronomy dataset, it sometimes may not be a true assumption. A histogram can fit any shape of distribution, given enough bins. When the number of data points is small, the number of bins should be small. As the number of data points grows, the number of bins should also grow to capture the increasing amount of detail in the distribution’s shape. This is a basic idea of Dr. Scargle’s Bayesian Block methods—they are composed of simple pieces, and the number of pieces grows with the number of data points.
  • 8. 8 Choosing the right number of bins is critical. It can have a drastic effect, for example the number of bins could change the conclusion that a distribution is bimodal as opposed to single mode. Intuitively, we expect that a large bin width will destroy fine-scale features in the data distribution, while a small width will result in increased counting noise per bin. Opposite to custom, it is unnecessary to choose bin size before reading the data. Dr. Scargle’s nonparametric Bayesian method was originally developed to detect bursts in space and characterize shape of the astronomical data. The algorithm produces the most probable segmentation of the observation into time intervals in which there is no statistically significant variation in the observations. The method he first described utilized a greedy algorithm. A more efficient algorithm is described in An Algorithm for Optimal Partitioning of Data on an Interval (2004). A dynamic programming algorithm is implemented. This algorithm finds a more efficient way of calculating the optimal interval without having to calculate all possible intervals. The algorithm calculates the value of the fitness function of the first n cells by using previously calculated values of the first n-1 cells from the previous iterations plus the calculated value of the last block itself. This algorithm can be applied to piecewise constant models, piecewise linear models, and piecewise exponential models. This algorithm can be thought of as an intelligent binning method where the bin sizes and the number of bins are adapted to the data.
  • 9. 9 Bayesian blocks are useful in different contexts. For the pulse problem that it was originally used for, Bayesian blocks was used to determine pulse attributes without needing to determine specific models for the pulse shapes. This information was then used to guess further attributes of the pulse attributes. For our research, Bayesianblocks was used to determine blocks of voronoi tessellations and delaunay triangles with like densities. 1.4 HOP The HOP Algorithm was formulated by Daniel J. Eisenstein and Piet Hut in 1997. It is considered to be a method for finding groups of particles for N-body simulations. It is a grouping algorithm that divides a set of data points into equivalence classes, such that, each point is a member of only one group. It has many uses in the field of astronomy for finding clusters of stars, clusters of galaxies, locating dwarf galaxies and determining whether galaxies merge. Figure 1.3: Voronoi Tessellation and Delaunay Triangulation on a set of data points
  • 10. 10 1.4.1 HOP Algorithm The HOP Algorithm is aimed at grouping data points into clusters so that the high density regions can be distinguished from the low density regions. The HOP process involves assigning an estimate of the density to every data point. The density function used to come up with these density estimates involves dividing the number of data points in a voronoi cell by its area. Figure 1.4: The HOP Algorithm begins with randomly selecting a data point Figure 1.5: Neighborhood densities of a selected data point are estimated
  • 11. 11 A random point is chosen and another point in its neighborhood which has the highest density is chosen. The data points are linked to their densest neighbor and the process of hopping to higher and higher densities continues until a point which is its own densest neighbor. This point which is its own densest neighbor is called the local maximum and all the points that hop to a local maximum form a max class. Figure 1.6: The selected data point is linked to its densest neighbor Figure 1.7: Continue hopping to higher and higher densities
  • 12. 12 Figure 1.8: Data point which is its own densest neighbor is called local maximum Figure 1.9: Data points hopping to a local maximum form a max class 1.4.2 HOP Benefits & Drawbacks The HOP algorithm is a relatively simple algorithm as compared to other grouping algorithms since it is fast and is based on easy computations. However there are certain drawbacks to this algorithm, mainly, it is overly sensitive to small fluctuations. This leads to a grouping style wherein the groups are not optimally distinguished since minute fluctuations result in the formation of a lot of mini-clusters.
  • 13. 13 1.4.3 HOP Variants The HOP algorithm we used in this project is not the only one; there are many variants to it, some simple and some very complex. A couple of variations to the HOP algorithm that exist but we did not have the time to explore are: · EHOP · JHOP · Joining a data point to its closest neighbor · Joining a data point to another along the steepest gradient · Joining a data point to a denser neighbor with the lowest density · Repeated HOP algorithm
  • 14. 14 2. Properties of Dr. Scargle’s New Objective Function The Bayesian Block algorithm that was implemented in this project made use of Dr. Scargle’s new objective function (1). The new objective function takes on three parameters, namely N, M, and c. The parameter N is the number of data points and M is the area (volume) in the data cell. The parameter c is related to the prior distribution of the number of blocks. Varying the value of the constant will change the return number of blocks in the optimal partition. The derivations of the new objective functioncan be found inScargle et. al. (2012). The basic idea is that a prior Poisson distribution was used to derive the formula. This is different from Dr. Scargle’s “old” objective function in that he previously used a Beta function as the prior. The idea of the new objective function is to model the likelihood that the data cells in a given block have the same density. This can easily be seen why it is an ideal measure for the Bayesian Block algorithm. The idea behind the Bayesian Block algorithm is to dynamically find the partition of the data space, which contains data cells of roughly equal density. The algorithm uses the values from the previously computed i cells to compute the value of the new objective function for the i+1 cell. The algorithm then
  • 15. 15 takes the maximum value returned by the new objective function as the optimal partition. 2.1 Convexity Dr. Scargle’s new objective function (1) has many properties, which are applied in the Bayesian Block algorithm. One of these properties is that the new objective function is convex. For a function to be convex means that the line segment between any two points evaluated by the function will always lie above the values of the function between those two points. In other words if we were to graph the new objective function would look concave up in two and three dimensions. Another property that is a result of a function being convex is that it satisfies the equation: 𝜆𝑓( 𝑥1, 𝑦1) + (1 − 𝜆) 𝑓( 𝑥2, 𝑦2) ≥ 𝑓(𝜆( 𝑥1, 𝑦1) + (1 − 𝜆)( 𝑥2, 𝑦2)) 𝑓𝑜𝑟 0 ≥ 𝜆 ≥ 1 This result of convexity was used in the subsequent proofs of the other properties that the new objective function satisfies. Theorem: The new objective function 𝐹( 𝑁, 𝑀) = 𝑁𝑙𝑜𝑔 𝑁 𝑀 − 𝑐 is convex and is actually strictly convex when for given two points say (𝑁1, 𝑀1) and (𝑁2, 𝑀2) this equality is false 𝑁1 𝑀1 = 𝑁2 𝑀2 (i.e. 𝑓′′( 𝑡) > 0 𝑓𝑜𝑟 0 < 𝑡 < 1).
  • 16. 16 Proof: Let (𝑁1, 𝑀1) and (𝑁2, 𝑀2) be two fixed data points where 𝑁1, 𝑁2, 𝑀1, 𝑀2 > 0 For some value 𝑡, 0 < 𝑡 < 1 𝑡𝑁1 + (1 − 𝑡) 𝑁2 > 0 and 𝑡𝑀1 + (1 − 𝑡) 𝑀2 > 0 Thus we can rewrite f(t) as 𝑓( 𝑡) = [ 𝑡𝑁1 + (1 − 𝑡) 𝑁2]𝑙𝑜𝑔 𝑡𝑁1 + (1 − 𝑡) 𝑁2 𝑡𝑀1 + (1 − 𝑡) 𝑀2 − 𝑐 It suffices to show that if 𝑓′′( 𝑡) ≥ 0 then the Dr. Scargle’s new objective function is convex. From derivative properties we get 𝑓′′( 𝑡) = ( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2 ( 𝑡𝑀1 + (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2) Since N1, N2, M1, M2 > 0 this implies ( 𝑡𝑀1+ (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2) > 0 Thus the second derivative depends on the term ( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2 which implies 𝑓′′( 𝑡) > 0 if 𝑁1 𝑀1 ≠ 𝑁2 𝑀2 otherwise 𝑓′′( 𝑡) ≥ 0 ∴ Dr. Scargle’s new objective function is convex and is strictly convex if the densities of the two data cells are not equal.
  • 17. 17 2.2 The Equal Density Property The Equal Density Property states that in an optimal partition of a collection of data cells into arbitrary blocks, if C1 and C2 are cells with the density of C1 = the density of C2, then C1 and C2 are in the same block of the optimal partition. Proof: We use a proof by contradiction. Let C1 be a cell with N1 data points and area M1. Similarly, let C2 be a cell with N2 data points and area M2. First assume that C1 and C2 are in different blocks of the optimal partition of the data space, X, such that the density of C1 and C2 are equal to each other. Call their respective blocks A and B. Let C3 represent the remainder of block A, where the remainder of block A has N3 data points and M3 area. Let C4 represent the remainder of block B, where the remainder of block B has N4 data points and M4 area. From our assumption, density of C1 = density C2, or N1/M1 = N2/M2. If the partition of X is indeed optimal, then f(C1 U C3) + f(C2 U C4) > Max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2 U C4) + f(C3)} , where f is Dr. Scargle's new objective function.
  • 18. 18 So we achieve a contradictionby showing that max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2 U C4) + f(C3)} ≥ λ[f(C1 U C2 U C3) + f(C4)] + (1- λ)[f(C3) + f(C1 U C2 U C4)] = f(C1 U C3) + f(C2 U C4). We must find a λ such that 0 ≤ λ ≤ 1 and 1) λ(N1 + N2 + N3, M1 + M2 + M3) + (1- λ)(N3,M3) = (N1 + N3, M1 + M3) and 2) λ(N4, M4) + (1- λ)(N1 + N2+N4, M1 + M2 + M4) = (N2 + N4, M2 + M4) We can see that λ(N1 + N2 + N3) + (1- λ)N3 = N1 + N3 and λ(M1 + M2 + M3) + (1- λ)M3 = M1 + M3. Simplifying the equations gives us 3) λ = N1 / (N1+N2) and 4) λ = M1 / (M1 + M2) Recall the assumption that the density of C1 = density of C2, or N1 / M1 = N2 / M2. Thus N1 = (N2 * M1)/ M2. Substituting this expression for λ into 3 gives us 4. Substituting 3 into 1 and 2 provides the desired results. Thus there is an optimal partition with C1 and C2 in the same block. Thus the equal density property is proved.
  • 19. 19 2.3 The Intermediate Density Property The Intermediate Density property says that in an optimal partition of a collection of data cells into arbitrary blocks, if C1, C2, and C3 are cells with densityof C1 > density of C2 > densityof C3, then whenC1 and C3 are in the same block Bof the optimal partition, so is C2. Proof: Suppose C1,C2,C3 are cells with density of C1 > density of C2 > density of C3 among a larger collection of cells for which we are trying to find the optimal partition into connected blocks. Let C1 U C3 U C4 | C2 U C5 be the part of anoptimalpartition withC1 and C3 inone block where C4 denoted the remainder of that block, and C2 is in a second block where C5 denotes the remainder of that block. We need to prove that f( C1 U C3 U C4) + f(C2 U C5) ≥ max{f( C2 U C3 U C5) + f(C1 U C4), f( C1 U C2 U C5) + f(C3 U C4), f( C1 U C2 U C3 U C4) + f(C5)}, where f represents Dr. Scargle’s new objective function. Let Ni be the number of data points in Ci and let Mi represent the area of Ci and Ni,Mi > 0. It suffices to show that for some 0 ≤ λ1, λ2, λ3 ≤ 1 withλ1 + λ2 + λ3 = 1 the following will be true N1 + N3 + N4 = λ1(N1 + N4) + λ2(N3 + N4) + λ3(N1 + N2 + N3 +N4) and N2 + N5 = λ1(N2+N3+N5) + λ2(N1+N2+N5) + λ3(N5).
  • 20. 20 Substituting 1 - λ1 - λ2 for λ3 gives the following λ1N3 + λ2(N1 + N2) = N2 and λ1(N2 + N3) + λ2(N1 + N2) = N2 Solving this system of equations gives the results λ1 = 0, λ2 = N2 / (N1 + N2), λ3 = N1 / (N1 + N2) Thus Max{f(C2 U C3 U C5) + f(C1 U C4), f(C1 U C2 U C5) + f(C3 U C4), f(C1 U C2 U C3 U C4) + f(C5) ≥ λ1[f(C2 U C3U C5) + f(C1 U C4)] + λ2[f(C1 U C2 U C5) + f(C3 U C4)] + λ3[f(C1 U C2 U C3 U C4) + f(C5)] ≥ f(C1 U C3 U C4) + f(C2 U C5). The Intermediate Density Property is the basis of the Bayesian Blocks algorithm which finds the optimal partitionof a collectionof cells into arbitrary blocks (regardless of the dimension). The intermediate density property allows us to find this optimal partition by first sorting the cells by their densities and then only considering partitions into blocks of consecutive cells in this order thus essentially reducing the problem down to a 1-dimensional problem. 2.4 The Last Change Conjecture The Last Change Conjecture says that the entries in the last change vector are always nondecreasing when finding the optimal partition of a set of cells sorted by density into arbitrary blocks. In proving this conjecture, we need to make use of several properties of Dr. Scargle's new objective function (1), 1) Vector multiplication
  • 21. 21 k*f(x,y) = k(x*log(x/y)-c) = kx * log(kx/ky) – kc = kxlog(kx/ky) – c – (k – 1)c = f(kx,ky) – (k-1)c 2) Additive property f(a,b) + f(d,e) = 2f(a,b)/2 + f(d,e)/2) ≥ 2f((a+d)/2, (b+e)/2) = f(a+d, b+e) – c 3) 2 dimensional Convexity property f(x,y) + e(x0,y0)) + f((x,y) – e(x0,y0)) ≥ 2f(x,y) 4) 2 dimensional Convexity property for λ1 > λ2>0, f((x,y) + λ1(x0,y0)) + f((x,y) – λ1(x0,y0)) ≥ f((x,y) + λ2(x0,y0)) + f((x,y) – λ2(x0,y0)) 4’') 2 dimensional Convexity property for λ1 > λ2 > 0 and λ3 > λ4 > 0, f((x,y) + (λ1x0, λ3y0)) + f((x,y) – (λ1x0, λ3y0)) + f((x,y) – λ1x0, λ3y0)) ≥ f((x,y) + (λ2x0, λ4y0)) + f((x,y) – (λ2x0, λ4y0)) Proof:: Let Ci be the ith cell, with Ni data points and Mi area such that Ni, Mi > 0. Assume we have a data set sorted by its density such that the density of C1 > density of C2 > density of C3 > density of C4. Then by our assumption, N1/M1 > N2/M2 > N3/M3 > N4/M4. We must show that if
  • 22. 22 f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3) then f(N1 + N2, M1 + M2) + f(N3 + N4,M3 + M4) ≥ f(N1,M1) + f(N2 + N3 + N4, M2 + M3 + M4). Assume not, then if f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3) then f(N1 + N2, M1+ M2) + f(N3 + N4,M3 + M4) < f(N1,M1) + f(N2 + N3 + N4, N2 + M3 + M4). Equivalently f(N1+N2,M1+M2) – f(N1,M1) < f(N2+N3+N4,M2+M3+M4) – f(N3+N4,M3+M4) and f(N1+N2,M1+M2) – f(N1,M1) ≥ f(N2+N3,M2+M3) – f(N3,M3) thus f(N2+N3+N4,M2+M3+M4) – f(N3+N4,M3+M4) > f(N2+N3,M2+M3) – f(N3,M3) which implies f(N2+N3+N4,M2+M3+M4) + f(N3,M3) > f(N2+N3,M2+M3) + f(N3+N4,M3+M4) To find a contradiction, it will be sufficient to find λ1 and λ2 such that λ1(N2+N3,M2+M3) + λ2(N3+N4,M3+M4) = (N2+N3+N4,M2+M3+M4) and (1-λ1)(N2+N3,M2+M3) + (1-λ2)(N3+N4,M3+M4) = (N3,M3) From those equations, we get λ1(N2+N3) + λ2(N3+N4) = N2 + N3 +N4 and λ1(M2+M3) + λ2(M3+M4) = M2 + M3 + M4
  • 23. 23 This is a system of 2 equations with 2 unknowns(assuming Ni and Mi are known quantities), so upon solving for those 2 unknowns, we get λ1=((N2+N3+N4)(M3+M4) – (N3 +N4)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) - (N3+N4)(M2+M3)) λ2=(-(N2+N3+N4)(M2+M3)+(N2 +N3)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) - (N3+N4)(M2+M3)) After proving the last change conjecture we then went on to implement it in the Bayesian Block algorithm. The MATLAB code can be found in Appendix A2. The changes to the code only required altering a few lines of code. We then did an efficiency comparison between the original Bayesian Block algorithm and the one implementing the last change conjecture. We used a uniform example of size 10,000 where each data point was evenly separated on the interval from zero to one. The value of the constant, c was varied to change the number of blocks. The results from this comparison can be seen in figure 2.1. Table 2.1: Time Comparison between original Bayesian Block algorithm and last change conjecture algorithms
  • 24. 24 Figure 2.1: Graphical Representation of Table 2.1 The results of the example showed that for all number of blocks in the optimal interval the last change conjecture algorithm outperformed the original. The efficiency of the last change conjecture implementation is O(N2/B), where B is the number of blocks. So we can see from figure 2.1 that the efficiency improves proportional to the number of blocks. 2.5 Experiments with BayesianBlocksAlgorithm Here we would like to introduce some interesting experiments with Dr. Scargle’s Bayesian Block Algorithm. 2.5.1. Intuitive Block One of the few things that we can do to understand algorithms better is running it on a small dataset. Here we ran the Bayesian Block algorithm on a data set consisting of 5
  • 25. 25 data points [ 2, 2.01, 2.03, 10.1, 10.11 ]. There are possibly four different cuts in the dataset and the most ‘intuitive’ cut will be between 2.03 and 10.1. However when you run the Bayesian Block algorithm, it gives youan unintuitive cut (two cuts at 2.005 and 10.105) Figure 2.2: intuitive cut Figure 2.3: Algorithm’s result The best partition of the dataset would be a high intensity interval near 2, a high intensity interval near 10 and a low intensity interval between 2 and 10 which is similar to the partition that Dr. Scargle’s function describes as the most likely. To find out what causes the cut to be unintuitive, we checked every possible cut and its objective values. Figure 2.4: The element of the list represents the four positions of the dataset. 0 means there is no cut in the position and 1 means there is a cut. For example, [0, 0, 1] means there is a cut in the 3rd position.
  • 26. 26 Surprisingly, our intuitive cut turned out to have the worst objective function value. The reason is that the intuitive cut got penalized significantly because of its length factor M in the objective function (1). Objective value gets very small when M (the distance between two points) gets bigger. The goal of Dr. Scargle’s objective function is to partition the whole data space not just the data points and it does a very good job on that. Still this specific dataset illustrates the problem when a low intensity region is adjacent to high intensity regions on either end. We realized it is a shortcoming of the Voronoi diagrambecause every Voronoi cell needs to have one data point. a. Space Dust (Pseudo Points) Method To avoid penalizing the block that happens to be next to a low density region, we decided to add some extra points which aren’t data points but to distinguish them from regular data points. We call them space dust. For example, true data points would have weight 1.0 but we would give space dust a very small weight such as 1/1,000,000. Now we can form the usual Voronoi diagram in the following modified way. If two points x and y are bothdata points or both space dust then for points between x and y, the points closest to x go to the cell containing x, and the points closest to y go to the cell containing y as usual. However when say x is space dust and y is a regular data point then for a point z between x and y, z goes to the cell containing x if the distance from z to y is more than 1,000,000 times larger than the
  • 27. 27 distance from z to x otherwise z goes to the cell containing y. This is a sort of weighted version of the Voronoi diagram. In theory the weighting of the Voronoi diagram is meant for situations where there is a particle of space dust between two regular data points. If you have a dust particle z between two data points x and y which are roughly distance d apart then the cell containing d will have length roughly d/1,000,000 which means that density of the cell containing z will likelybe about the same as the densities of the cells containing x and y so that x, y, and z are in the same block of the optimal partition so a dust particle between two data points isn't likely to change the optimal partition. This method raised two questions, which offer perspectives for future research. First, we need to decide how many particles of space dust should you add and second, how to choose the window of dusts. 2.5.2. How to Find the Block Faster Suppose there is a symmetrical gamma ray burst. Gamma ray bursts emit highly energetic photons and we canthink of it as 1-dimensional time-tagged event (TTE) data which is symmetrical at its maximum emission. Under the assumption that TTE data is symmetrical, we can use some ‘tricks’ to identify significantly distinct pulses in a more efficient way. a. Folding Method Assuming that we know the center of symmetry, thenthe BayesianBlock algorithm can run much faster than usual.
  • 28. 28 For example, if we have a set of times t1, t2, … , tn when photons are detected on an interval [0,1] and suppose that you know that the gamma ray burst is symmetrical at t = ½. If times satisfy t1 < t2 < … < tm < ½ < tm+1 < tm+2 < … < tn, we can fold the interval at t = ½, average matching data points and apply the usual 1-dimensional dynamic programming algorithm on the interval [0.½ ]. After that, unfold the data and use the partition obtained on [0,½ ] and its reverse on [½ , 1]. Python code for this Folding method is given in Appendix 7.2. 3. Arbitrarily Connected Bayesian Blocks Given a 2-dimensional dataset, we can find the optimal partition using Dr. Scargle’s new objective function. This allows us to adaptively bin areas of space that are roughly equal in density. The optimal partition is chosen using dynamic programming to decide which arrangement of cells maximizes Dr. Scargle’s fitness function. Using a binning method dependent on the composition of the data, also known as “Bayesian Blocks”, allows for more detailed and significant results. Before applying ‘optinterval1.m’ to a dataset, included in Appendix A3, we first partition the data into either a Voronoi diagram or Delaunay triangulation so we can determine the cells of the dataset and their respective sizes. The “optinterval” function takes inputs ‘n’, ‘a’, and ‘c’, where ‘n’ is the number of data points per cell, ‘a’ is the
  • 29. 29 sorted areas of each cell, and ‘c’ is a constant based on the previous distribution. Dr. Scargle’s new objective function is a 1-dimensional application, thus in order to apply the function to any multi-dimensional data, the data must be reduced to a single dimension. By manipulating input ‘a’ to be either areas, volumes, or a side of a polygon or triangle, this 1-dimensional algorithm can be applied to multi-dimensional data. After running the objective function, the resulting blocks in the optimal partition contain cells of roughly equal density, and thus the density of each block is nearly constant. These Bayesian Blocks are organized by highest to lowest density, such that the first block has the highest relative density and the last block has the lowest relative density. While the optimal configuration of the dataset groups cells together based on size, cells do not need to be connected in order to belong to the same block. This is why the results of ‘optinterval1.m’ are arbitrarily connected Bayesian Blocks; the algorithm doesn’t take connectedness of cells into account. 3.2 ConnectedComponents Our research focused on finding connected components of space that are roughly equal or similar in density. To accomplish finding optimal or near-optimal partitions of a 2-dimensional dataset into connected components, we used Dr. Scargle’s new objective function to first organize our data by density, and then examined connected components (unions of cells) within k density levels.
  • 30. 30 The basic idea for finding connected components of relative density is to take the data binned into blocks by relative density, where Density(Block 1) >= Density(Block 2) >= …>= Density(Block n) Each block represents a different density level. We then iterate over each block to find connected components within k density levels. 3.3 ConnectedComponents:Input/Output The ‘concomponents2.m’ function written by Dr. Bradley Jackson, found in Appendix A3, can be called using the function [x, y, z, w] = concomponents2(blocks, n, a , adjmat) The function takes in the inputs: blocks, n, a, and adjmat. The ‘blocks’ input is a cell array that contains the cells for each Bayesian Block. Input ‘n’ is an array that contains the number of data points per cell. The ‘a’ input is a sorted array that contains the areas of each cell. The last input, ’adjmat’, is a sparse matrix containing adjacent pairs of cells. Inputs ‘n’, ‘a’, and ‘adjmat’ share the same index for each respective cell. The ‘adjmat’ input was obtained using a function written by Dr. Jackson called ‘Delaunay2Adjacencies2.m’. The ‘conncomponents2’ function has four outputs, x, y, z, and w. Output ‘x’ is a cell array of the resulting connected components. ‘y’ is an array containing the number of data points in each component. The ‘z’ output is an array containing the areas of each component. Lastly, ‘w’ is a sparse matrix of component adjacency pairs.
  • 31. 31 3.4 ConnectedComponents:Algorithm To determine whether two cells within k density levels belong in the same connected component, we first calculate which cells are adjacent pairs. If we use a dual Voronoi and Delaunay diagram, we can use Voronoi edges to determine which Delaunay triangles are adjacent. We then store the adjacent pair information in a large sparse matrix, storing a ‘1’ for adjacencies, and a ‘0’ for non-adjacencies. We used Dr. Jackson’s ‘Delaunay2adjacencies.m’ to calculate and store adjacencies. The code for ‘Delaunay2adjacencies.m’ is listed in Appendix A3. To compute connected components, we worked from a program written by Dr. Jackson called ‘conncomps2.m’. We made a few modifications to the code to obtain connected components within two density levels. After running the Bayesian Block partition on the dataset and computing its adjacency matrix, we can start looking for connected components within the same density level. Please note in the following pseudocode that Ci refers to cell Ci and {Ci} refers to the component that Ci belongs to. 1. Assign each cell to belong to its own component 2. For each Bayesian Block: a. Find all adjacent pairs that exist in the same block i. For every pair of adjacent cells, (Ci, Cj): 1. IfCi belongs to a smaller component than Cj,, then {Ci} is added to {Cj}
  • 32. 32 2. Set {Ci} = [ ] (empty) 3. Else, {Cj} is added to {Ci} a. Set {Cj} = [ ] (empty) 3. Select all non-empty components to calculate number of data points in each component, and the area of each component. 4. Calculate sparse matrix of adjacent components from the matrix of cell adjacencies. If two adjacent cells do not belong to the same component, then their components are adjacent. This algorithm finds connected components of cells in the same Bayesian Block. Since our criteria for components is strict in that they must have connected cells and be in the same Bayesian Block, there is a high possibility that many components are generated. We extend our reach to include connected cells in 2 density levels, 1 density level apart, in the same component. 3.5 ConnectedBayesian Blocks:2 Density Level We made several modifications to Dr. Jackson’s “conncomps2” function to find connected components in the same density level or 1 density level higher. The basic idea is the same as searching for connected cells in the same density level, but we change where we look for adjacencies to include in components. We looked for adjacencies twice: 1) adjacent cells within the same block, and 2) adjacent cells 1 density level apart. We also included an additional check to ensure that only cells in the same level or 1 density level apart are grouped into the same component. The code for ‘conncomps2mod.m’ can be found in Appendix A3.
  • 33. 33 3.6 Testing and Analysis We used two different 2-dimensional datasets to test and analyze optimal or near-optimal partitions into connected components. The ”Counties of California By Population Density” (CCPD) dataset has dimensions 58 x 2, numbering all California counties with their population densities. The Sloan Digital Sky Survey (SDSS) dataset has dimensions 1523 x 2. We used CCPD as a small, known dataset to test our logic when making changes to the algorithms. The CCPD data set is already binned and sorted, but the SDSS data is not. We used Delaunay triangulation to format the SDSS data, limiting the number of cell adjacencies to 3 per cell. This is preferred over using a Voronoi diagram, which can return 3 or more adjacencies per cell. To call and plot ‘optinterval1’, ‘conncomps2’, and ‘conncomps2mod’, we used ‘conncomp_script.m’, ‘bayesian_plot.m’, ‘population_density.m’, and ‘belongs_to.m’. All helper functions and scripts are included in Appendix A3. After running ‘optinterval1’ onboth CCPD and SDSS, we obtained a similar number of partitions relative to the dataset. CCPD yields 8 blocks, and SDSS is represented by 6 blocks. Both datasets have smaller sized blocks that represent the highest and lowest density regions, while the intermediate blocks contain the majority of cells. For example, the block partition for SDSS is [215 779 1458 2129 2731 3017], and each of these blocks have the following sizes, respectively [215, 564, 679, 671, 602, 286]. The lowest and highest density regions, block 1 and block 6, have significantly less cells
  • 34. 34 compared to blocks 2, 3, 4, and 5. Thus, the arbitrarily connected blocks algorithm finds significance in blocks that have a sufficiently dissimilar density(either higher or lower) from the majority. However, these blocks contain cells that are not necessarily connected. After running ‘conncomps2.m’ on both CCPD and SDSS, the resulting partitions varied greatly from the arbitrarily connected Bayesian Blocks. In CCPD, 27 connected components in the same density level were found. For the larger dataset, SDSS returned 835 connected components. The number of components increased considerably relative to the size of our data. Unlike the arbitrarily connected blocks, which only filter blocks by relative density, the connected Bayesian Blocks must be both relatively close in density and connected. The strict criteria for grouping cells increased the number of partitions to such a degree that we lost significance in our results. We ran ‘conncomps2mod.m’ to add another density level to our connected components. Again, bothdatasets had similar results witha decrease in the number ofpartitions from ‘conncomps2.m’. 11 connected components were found in CCPD, along with 205 connected components in SDSS. Because we included more cells in our search for adjacencies, more connected cells were grouped together that had similar relative densities. At 2 density levels, the number of components is higher than that of the arbitrarily connected blocks, yet lower than density level 1. This is because the connected components at 2 density levels signify both connectedness and relative density with less restriction than density level 1.
  • 35. 35 The CCPD and SDSS datasets were used to test three algorithms: arbitrarily connected Bayesianblocks, connected Bayesianblocks at 1 density level, and connected Bayesian blocks at 2 density levels. Fig. 3.2 shows the results of running these algorithms on CCPD data. The arbitrarily connected Bayesian blocks have the smallest number of partitions. In connected Bayesianblocks at 1 density level, cells in the same component must not only have the same density but also be connected. This causes an increase in the number of components. When CCPD is analyzed for connected Bayesian blocks at 2 density levels, the number of components decreases from density level 1, but still has a higher number of components than the arbitrarily connected Bayesian blocks. The criteria for cells in each component is relaxed from requiring the same level density to similar level densities, allowing more cells in the same component. The same observation was detected with the SDSS data, as shown in figure 3.3. Figure 3.1: Output for the Counties of California by Population Density dataset
  • 36. 36 Figure 3.2: Output for the Sloan Digital Sky Survey dataset
  • 37. 37 4. Clusters A galaxy is a huge collection of gas, dust and stars. In this section, we are interested in finding clusters of galaxies in a collection of stars using the Bayesian Blocks algorithm and the HOP algorithm. 4.1 Dataset The data set we use in this section is a 2- dimensional data slice. The data set was obtained from Sloan Digital Sky Survey. It’s worth mentioning that Sloan Digital Sky Survey has been one of the most successful surveys in the history of astronomy. Our dataset contains 1523 data points, which can be shown in the figure 4.1 below. Figure 4.1: 2- dimensional Sloan Digital Sky Survey Raw Data
  • 38. 38 We could see that the data points are not scattered uniformly in figure 4.1. There are some high density regions and some low density regions. We will classify the regions into different clusters using algorithms based on the calculation of density. 4.2 BayesianBlocks We use the Bayesian Block algorithm to find the optimal partitions based on equal density properties. There are two ways to get the density of the data points: Voronoi diagram and Delaunay triangulation. 4.2.1 Voronoi Diagram As discussed in the previous section, “a Voronoi diagram is a partition of a plane into regions based on distance to points in a specific subset of the plane”(wiki, 2014). In each Voronoi cell, there is only one data point. The density of each Voronoi cell is obtained by the number of the data points, which is one, divided by the area of the Voronoi cell. By using the Bayesian Blocks algorithm, the cells with the same density are grouped into one block. Then we have 8 blocks for our data sets. The blocks are shown in the figure 4.2 below. Cells of the same color are in the same blocks. The blocks with red color are low density blocks, while the blocks with blue color are high density blocks. We notice that the cells in the same block are not necessarily connected. In addition, in this part we ignore the Voronoi cells that have infinity edges or whose edges are out of the boundary. By doing this, we could get limited areas of the cells, thus there are some limitations for the method. We lost some data points near the boundaries.
  • 39. 39 Figure 4.2: Blocks based on Voronoi diagram 4.2.2 Delaunay Triangulation The Delaunay triangulation is dual to the Voronoi diagram and there is one Voronoi vertex in the center of each Delaunay triangle. A Delaunay triangle has three data points as its vertices. In every Delaunay triangle, there is one half data point. By using a Delaunay triangle, we would not lose a data point from our original data set. One way to calculate the density is to take the number of data points, which is one half, and divide it by the area of each triangle. Like the Voronoi diagram, triangulations with the same density are put into one block. Then we got 10 blocks, which is shown in figure 3. The triangulations with the same color are in one block. Also, different colors represent different blocks. However, some triangulations with long edges are separated into different blocks because of the small areas.
  • 40. 40 Figure 4.3: Bayesian blocks based on Delaunay triangulation using area Another way to calculate the density is to take the number of data points and divide it by the maximum lengthof each triangle. By applying the Bayesian Block algorithm, we got 6 Bayesian Blocks. Figure 4.4 shows 6 different blocks, each block has a different color. It is clear to see that the red color block represents low density level and the blue color represents a high density level. Figure 4.4: Blocks based on Delaunay triangulation with maxi-length
  • 41. 41 Figure 4.5 below is a zoomed in picture of Figure 4.4. The advantage of this method could be illustrated from this figure. On the left side of the picture, the triangles with long edges are classified into the same block with their neighbors. By doing this, we reduce the number of the Bayesian Blocks from 10 to 6. Figure 4.5: Zoom in picture 4.3 Finding Clusters As we can see the Bayesian Blocks method is very efficient to find optimalpartitions. It especially works well on the partitions with components that are not necessarily connected. A lot of applications have been found, such as detecting gamma ray bursts, and flares in active galactic nuclei and other variable objects such as the Crab Nebula. Although the Bayesian Blocks method can be used to find partitions efficiently, the application is restricted when we want to find clusters, in which the components are connected. So we introduce a different algorithm (HOP) for identifying clusters of connected components based on the results from the Bayesian Blocks algorithm.
  • 42. 42 To find clusters using the HOP algorithm, we need two steps. First, we have to find connected components of the Bayesianblocks. Second, we apply the HOP algorithm on the connected components to find clusters. Inaddition, if we want to reduce the number of the clusters, we can apply the HOP algorithm again on those clusters. The details of each step are described later one by one. 4.3.1 Connected components There are many ways to define a connected component. In our research, we use the most straightforward definition: two cells in the same block sharing an edge are connected components. Using this definition, we found 835 connected components illustrated in figure 4.6. Different colors represent different Bayesian blocks, and the number on each cell tells its connected components index. We can see that the three highlighted triangles have the same background color, which means they are in the same block. Notice that the number on the orange one and white one are the same, which means they are in the same connected component, because they are sharing a common edge. But the number on the purple triangle is different from the other two, which means this triangle is included in a different connected component, because it does not have any common edge with the other two.
  • 43. 43 Figure 4.6: Connected Components 4.3.2 Clusters How can we find clusters? We can apply the HOP algorithm on those connected components. The density is defined to be the number of data points in the connected component divided by the maximum length of that connected component. Every connected component is joined to its most dense neighbor with higher density. Taking the circled area in figure 4.7 as an example, the dark blue area has higher density than the light blue area and they are in different connected components. Visually, we think they might in the same cluster. After applying HOP, they are eventually in the same cluster as shown in figure 4.8. In figure 4.8, the number on each cell means cluster number. We finally get 249 clusters. It seems there are too many clusters, and we need to merge some clusters to reduce the number of clusters.
  • 44. 44 Figure 4.7: Connected Components Figure 4.8: HOP on Connected Components 4.3.3 Fewer clusters In order to reduce the number of clusters, we apply HOP again on the clusters found. The density is defined as the number of data points in the cluster divided by the maximum length of that cluster. Every cluster is joined to its most dense neighbor with higher density. Then every cell now is assigned to a new cluster number shown in figure 4.9. In the end, we reduced the number of the clusters down to 46. Figure 4.9: HOP on Clusters 4.4 Comparison We also did a comparison between using a HOP only algorithm versus HOP algorithm
  • 45. 45 on Bayesian Blocks. The HOP also seems to be overly sensitive to minor fluctuations in the data which can create a large number of local maximums. In our research, it gives 390 clusters, while HOP algorithm on Bayesian Blocks gives 249 clusters. If we examine a small area, we can also see how the number of HOP only clusters will be affected by the density of each cell. For instance, in the highlighted area in figure 4.10, three triangles form three different clusters according to their densities. However, in figure 4.11, they are in the same cluster. To sum up, HOP algorithm on Bayesian Blocks approach is more robust. Figure 4.10: HOP only (390 clusters) Figure 4.11: HOP on Bayesian Blocks(249 clusters) 4.5 Conclusion HOP algorithm is probably the most efficient cluster-finding algorithm. But when it is applied alone, for astronomical data in particular, it seems overly sensitive to small fluctuations in the density and often results in a lot of mini clusters. However, whenwe apply the HOP algorithm on the connected components of Bayesian Blocks, it leads fewer larger clusters, although it still takes a huge amount of time to run
  • 46. 46 the Bayesian Blocks algorithm. Compared with 835 clusters resulted from the HOP algorithm alone, the method of “HOP + Connected components” gives only 249 clusters. Nevertheless, reducing the number of the clusters is still desired for analyzing the astronomical data. Therefore, we repeat HOP algorithm on connected components of Bayesian Block and then obtain 46 clusters. Repeating HOP algorithm onconnected components might be the best method we have used during our research. But there is risk with such repetitive uses of the HOP algorithm. By repeating the HOP algorithm, we may end up joining two clusters at a very low levelof density, which have a very high-level local maximum. In most cases, it might be more appropriate for them to remain separated from each other. Furthermore, both HOP and repeated HOP onconnected component of Bayesian Block are probably not effective in finding filamentary clusters because they partition based only ondensity without considering the shape of clusters, while filamentaryclusters are usually a vital proportion of the astronomical data.
  • 47. 47 5. Future Directions of Research In this section we present topics that we were interested in, but did not have enough time to complete. These ideas could be implemented by a curious researcher. 5.1 Properties of New Objective Function One possible avenue of future research could be to look into other possible ways of increasing the efficiency of the Bayesian Block algorithm. The two methods that we looked into, vectorization and implementing property, were good methods, but definitely not the only ones. Another aspect of the project that we would have liked to spend more time on was looking into which algorithm for the symmetric Bayesian Blocks performed the best. Also we would have liked to find out other possible, and maybe more efficient algorithms, for the symmetric Bayesian Block problem. 5.2 Clusters of Galaxies and Stars In our research, we were focusing on 1-d and 2-d voronoi diagram. Using the Bayesian Blocks and HOP algorithm on 3-d voronoi diagrams to find clusters is therefore one of the possible future direction for this research. And which HOP algorithm to apply is also worthy for further research. Other than the HOP algorithm implemented in our research, there are some other versions, for example, HOP algorithm of a) Joining a cell to its closest neighbor;
  • 48. 48 b) Or joining a cell to another along the steepest gradient, which is an improvement on hopping to a cell’s densest neighbor; c) Or joining a cell to a denser neighbor with the lowest density. Moreover, it’s worthy to look for the possibility of further speeding up the Bayesian Block algorithm of which we apply repeated HOP on connected component. Finally, because of the nature of repeatedly using the HOP algorithm, which would eventually result in having a single cluster, we are curious about when to stop using the HOP algorithm so as to avoid combining two clusters which should remain separate.
  • 49. 49 6. References Eisenstein, D. J. & Hut, P. (1998). HOP: A New Group-Finding Algorithm for N-Body Simulations. The Astrophysical Journal, 498, 137-142. Jackson, B., et. al. (2004). An Algorithm for Optimal Partitioning of Data onan Interval. Signal Processing Letters, IEEE, 12 (2) 105-108. Jackson, B., et. al. (2003). Optimal Partitions of Data in Higher Dimensions. Scargle, J. D., et. al. (1998). Studies in Astronomical Times Series Analysis V. Bayesian Blocks, A New Method Structure in Photon Counting Data. Astronomical Journal, 504, 405. Scargle, J. F., et. al. (2012). Studies in Astronomical Time Series Analysis VI. Bayesian Block Representation. The Astrophysical Journal, 764 (2).
  • 50. 50 7. Appendix 7.1 MATLAB Code This section includes all of the MATLAB code that we created or implemented during this project. All of the code in this section (and the Python code section) should suffice to replicate our work. 7.1.1. Vectorized BayesianBlock Algorithm To improve the efficiency of an algorithm one easy solution is to implement a vectorized version of the original. This algorithm performs the same operations as the original Bayesian Block algorithm, but replaces the inner for loop with a vectorized function call and a vectorized calculation. % %%Vectorized version of the Bayesian Block Algorithm %This vectorized version should increase the efficiency %of the original BB algorithm % function [part,opt,lastchange] = optinterval1(N,A,c) % % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval % c is a specified constant % % An O(N^2) algorithm for finding the optimal partition of N data points on an interval n = length(N);%gets the number of cells opt = zeros(1,n+1); %initialization lastchange = ones(1,n); %initialization changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n); optint = zeros(1,n);
  • 51. 51 % Consider all possible last blocks for the optimal partition % % endN = the number of points in the possible last blocks % endA = the lengths of the possible last blocks for i = 1:n endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i); endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i); % Computing the values for all possible last blocks in endobj newc=c*ones(size(endN)); endobj(1:i)= arrayfun(@newobjective,endN(1:i),endA(1:i),newc); % Computing the values for all possible optimal partitions in optint optint(1:i) = opt(1:i)+endobj(1:i); % The optimal partition is the one with the maximum value % opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0. % lastchange(i) is the first cell in the last block of an optimal partition [opt(i+1),lastchange(i)] = max(optint(1:i)); end % backtracking to find the blocks of the optimal partition and its change points i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i,1) = k; end end 7.1.2. Last Change Conjecture Implementation of BayesianBlock Algorithm Another solution to increase the efficiencyof the Bayesian Block algorithm was to code in a property of the new objective function. The last change conjecture (Section 2.4) says that when the data set (data cells) are in increasing order the last chance point is always nondecreasing. This property allows us to eliminate some of the unnecessary computation in the inner loop. % %%Last Change Conjecture implementation of Bayesian Block Algorithm %a dynamic programming algorithm with efficiency O(N^2/B)
  • 52. 52 %where B is the number of blocks %Makes use of vectorization % function [part,opt,lastchange] = newoptinterval1(N,A,c) % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval % c is a specified constant % % An O(N^2) algorithm for finding the optimal partition of N data points on an interval %gets the number of cells n = length(N); %initialization opt = zeros(1,n+1); lastchange = ones(1,n); changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n); optint = zeros(1,n); % Consider all possible last blocks for the optimal partition % % endN = the number of points in the possible last blocks % endA = the lengths of the possible last blocks % % k = the value of the last change point % % algorithm only checks up to k for the current change point for i = 1:n if i==1 k=1; else k= lastchange(i-1); end endN = ones(1,i); endN(k:i) = endN(k:i)*changeN(i+1)-changeN(k:i); endA = ones(1,i); endA(k:i) = endA(k:i)*changeA(i+1)-changeA(k:i); % Computing the values for all possible last blocks in endobj newc=c*ones(size(endN(k:i))); endobj(k:i)= arrayfun(@newobjective,endN(k:i),endA(k:i),newc);
  • 53. 53 % Computing the values for all possible optimal partitions in optint optint(k:i) = opt(k:i)+endobj(k:i); % The optimal partition is the one with the maximum value % opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0. % lastchange(i) is the first cell in the last block of an optimal partition [opt(i+1),lastchange(i)] = max(optint(k:i)); %Adjust last change point values to correspond to correct points lastchange(i)=lastchange(i)+k-1; end % backtracking to find the blocks of the optimal partition and its changepoints i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i,1) = k; end end 7.1.3. Code for finding Bayesian Blocks and connected components using BayesianBlock algorithm p=presentation2d; tri=delaunay(p); [n,~]=size(p); [m,~]=size(tri); for i=1:m maxdist(i) = max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t ri(i,2),:))]); end [r,s] = sort(maxdist); [x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10); dt = DelaunayTri(p); triplot(dt) blockno = 6; blockends = [215 779 1458 2129 2731 3017];
  • 54. 54 for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); cellblock(s(first:last))=i ; end voronoi(p(:,1),p(:,2)) for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); triangleno = [1:3017]; plabels = arrayfun(@(n) {sprintf('%d', triangleno(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off %computing the adjacies of the Delaunay triangulion and the components of %Baysian blocks [adjmat,adjcell] = Delaunay2dadjacencies(p); n =.5*ones(3017,1); a = transpose(maxdist); [r,s] = sort(a./n); blocks = cell(6,1); for i = 1:6 if i == 1 first = 1;
  • 55. 55 else first = blockends(i-1)+1; end; last = blockends(i); blocks{i} = s(first:last); end [w,x,y,z] = conncomponents2(blocks,n,a,adjmat); Assign labels to the triangles. for i=1:835 cellcomp(w{i}) = i; end for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end function [part,opt,lastchange] = optinterval1(N,A,c) % An O(N^2) algorithm for finding the optimal partition of N data points on an interval % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval
  • 56. 56 n = length(N); opt = zeros(1,n+1); lastchange = ones(1,n); changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n+1); optint = zeros(1,n+1); for i = 1:n endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i); endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i); for j = 1:i endobj(j) = newobjective(endN(j),endA(j),c); optint(j) = opt(j)+endobj(j); end [opt(i+1),lastchange(i)] = max(optint(1:i)); end i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i) = k; end end function [value] = newobjective(x,y,c) % A function that computes the objective function value of a block % x is the length of the block and y is the number of data points in the block value = x*(log(x/y)) - c; function [adjmat,adjcell] = Delaunay2dadjacencies(p) % p is a 2 x n matrix of 2-d points % computing adjacencies of Delaunay triangles from Voronoi edges [vert,bound] = voronoin(p); tri = delaunay(p); [j,~] = size(vert); adjmat = sparse(j,j); adjcell = cell(j,1); [n,~] = size(p); %storing adjacencies in a sparse adjacency matrix for i=1:n num = length(bound{i}); for h = 1:num if h < num adjmat(bound{i}(h),bound{i}(h+1))=1; adjmat(bound{i}(h+1),bound{i}(h))=1; else
  • 57. 57 adjmat(bound{i}(h),bound{i}(1))=1; adjmat(bound{i}(1),bound{i}(h))=1; end end end for i=1:j-1 trino(i) = intersect(bound{tri(i,1)},intersect(bound{tri(i,2)},bound{tri(i,3)})); end adjmat = adjmat(trino,trino); %storing adjacencies as a cell array for i = 1:j-1 adjcell{i}=find(adjmat(i,:)); end function [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,n,a,adjmat) % finding the connected components of a partition given by cell array blocks % two cells with the same color are in the same component cellno = length(n); color = [1:cellno]; for i=1:cellno comps{i} = [i]; end; % going through adjacencies to update components for i = 1:length(blocks); [row,col] = find(adjmat(blocks{i},blocks{i})); row = blocks{i}(row); col = blocks{i}(col); for j = 1:length(row) if color(row(j)) ~= color(col(j)) if length(comps{color(row(j))}) >= length(comps{color(col(j))}); comps{color(row(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(col(j))}; y = color(col(j)); color(x) = color(row(j)); comps{y} = []; else comps{color(col(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(row(j))}; y = color(row(j)); color(x) = color(col(j)); comps{y} = []; end end
  • 58. 58 end end nonempty = []; for i = 1:cellno if length(comps{i}) > 0 nonempty = [nonempty i]; end end comps = comps(nonempty); for i = 1:length(comps) ncomps(i) = sum(n(comps{i})); acomps(i) = sum(a(comps{i})); color(comps{i}) = i; end [row,col] = find(adjmat); adjcomps = sparse(length(row),length(row)); for i = 1:length(row) if color(row(i)) ~= color(col(i)) adjcomps(color(row(i)),color(col(i))) = 1; adjcomps(color(col(i)),color(row(i))) = 1; end end %%% conncomp_script.m % Import Data from F14PresentationData p = presentation2d; % computing the Bayesian Blocks partition of the Delaunay triangulation using maxdist for the size of a cell tri = delaunay(p); [tri_size, ~] = size(tri); for i = 1:tri_size maxdist(i) = max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t ri(i,2),:))]); end % use maxdist between points in the triangle as area and sort area [r,s] = sort(maxdist); % create optimal partition
  • 59. 59 [x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10); % blockends blockno = length(x); blockends = fliplr(transpose(x)); % conncomps2 takes transpose n =.5*ones(3017,1); a = transpose(maxdist); [r,s] = sort(a./n); % initialize blocks blocks = cell(blockno,1); % assign cells blocks cell array for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); blocks{i} = s(first:last); end % Computing the adjacencies of the Delaunay triangulation and the components of the Bayesian Blocks [adjmat,adjcell] = Delaunay2dadjacencies(p); % calculate connected components [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,n,a,adjmat); % Assign labels to the triangles for i=1:length(comps) cellcomp(comps{i}) = i; end % draw and color connected components triplot(dt) numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)'); map = colormap(hsv(length(comps))); for i = 1:length(comps) current_comp = comps{i}; for j = 1:length(current_comp) patch(p(tri(current_comp(j),:),1),p(tri(current_comp(j),:),2),map(i,:)); % use color i. end end
  • 60. 60 %% conncomps2mod function %% finds connected cells within 2 density levels and groups them into components %% returns components, number of data points in components, area of components, %% and adjacent components function [comps,ncomps,acomps,adjcomps] = conncomps2mod(blocks,n,a,adjmat) % figuring out references to blockends blockends=[]; for temp=1:length(blocks) blockends=[blockends blocks{temp}(length(blocks{temp}))]; end % finding the connected components of a partition given by cell array blocks % two cells with the same color are in the same component cellno = length(n); color = [1:cellno]; for i=1:cellno comps{i} = [i]; end; % going through adjacencies to update components for i = 1:length(blocks) % find all pairs of cells in current block that are adjacent % and store their cell number [row,col] = find(adjmat(blocks{i},blocks{i})); % finding adjacencies between blocks{i} and blocks{i+1} if i<length(blocks) [row2, col2] = find(adjmat(blocks{i},blocks{i+1})); % reindexing row2 and col2 -> all indexes in row2 and col2 reference % a cell in either block(i) or block(i+1) for k=1:length(row2) row2_block_number = belongs_to(blocks{i}(row2(k)), blocks); col2_block_number = belongs_to(blocks{i+1}(col2(k)), blocks); if row2_block_number row2(k) = blocks{row2_block_number}(row2(k)); else row2(k) = blocks{i+1}(row2(k)); end if col2_block_number col2(k) = blocks{col2_block_number}(col2(k)); else col2(k) = blocks{i+1}(col2(k)); end
  • 61. 61 end end % grab cell number of adjacency index from block cell array row = blocks{i}(row); col = blocks{i}(col); row = vertcat(row, row2); col = vertcat(col, col2); % comment out above two lines and uncomment the below two lines % depending if the data needs horzcat versus vertcat % row = horzcat(row,transpose(row2)); % col = horzcat(col, transpose(col2)); for j = 1:length(row) % if the color of the two adjacent cells don’t already belong to same component if color(row(j)) ~= color(col(j)) % number of components in color array for cell in row is greater than number of % components in color array for cell in col if length(comps{color(row(j))}) >= length(comps{color(col(j))}); % check to see if the incoming cell(s) are within one % level density apart head_cell_block_number = belongs_to(color(row(j)),blocks); possible_guests = belongs_to(color(col(j)),blocks); difference = head_cell_block_number - possible_guests; % do nothing if components contain cells from blocks of % more than one density apart if(abs(difference) > 1) comps{color(col(j))} = comps{color(col(j))}; else % place union of components for adjacent cells in components % for row cell comps{color(row(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(col(j))}; y = color(col(j)); % store new location of cell in color color(x) = color(row(j)); comps{y} = []; end else
  • 62. 62 % check to see if the incoming cell(s) are within one % level density apart head_cell_block_number = belongs_to(color(col(j)),blocks); possible_guests = belongs_to(color(row(j)),blocks); difference = head_cell_block_number - possible_guests; % do nothing if components contain cells from blocks of % more than one density apart if(abs(difference) > 1) comps{color(row(j))} = comps{color(row(j))}; else % place union of components for adjacent cells in components % for col cell comps{color(col(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(row(j))}; y = color(row(j)); % store new location of cell in color color(x) = color(col(j)); comps{y} = []; end end end end end nonempty = []; for i = 1:cellno if length(comps{i}) > 0 nonempty = [nonempty i]; end end % keep only nonempty connected components comps = comps(nonempty);
  • 63. 63 for i = 1:length(comps) ncomps(i) = sum(n(comps{i})); acomps(i) = sum(a(comps{i})); % re-indexing color to reference new component array color(comps{i}) = i; end % all adjacencies [row,col] = find(adjmat); % create sparse matrix with same size as all adjacencies adjcomps = sparse(length(row),length(row)); for i = 1:length(row) % cells do not belong to same component if color(row(i)) ~= color(col(i)) adjcomps(color(row(i)),color(col(i))) = 1; adjcomps(color(col(i)),color(row(i))) = 1; end end %% belongs_to.m function [ block_number ] = belongs_to( cell_number, blocks ) % belongs_to: helper function to return the block number that the cell % belongs to for i=1:length(blocks) for k=1:length(blocks{i}) if blocks{i}(k) == cell_number block_number = i; return; end end end end
  • 64. 64 %% population_density.m %% takes data from population density .xls file and finds %% the bayesian blocks and connected components % Data is already sorted on import [x,y,z] = optinterval1(pop/10000,area,15) % import neighbors into matrix called Adjacencies [num_cols, num_rows] = size(adjacencies); adjmat = sparse(num_cols, num_cols); for i=1:num_cols for j=1:num_rows if isnan(adjacencies(i,j)) == 0 adjmat(i,adjacencies(i,j)) = 1; end end end blockno = length(x); blocks = cell(blockno,1); x = flipud(x); for i = 1:blockno if i == 1 first = 1; else first = x(i-1)+1; end; last = x(i); blocks{i} = (first:last); end [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,pop/1000,area,adjmat); %% bayesian_plot.m script %% plots data and colors cells in the same block the same color % importing data from CAMCOS2ddataset p = presentation2d; % computing the triangles of the Delaunay triangulation tri = delaunay(p); [n,~] = size(p); [m,~] = size(tri);
  • 65. 65 % computing the areas of the triangles for i = 1:m [~,area(i,1)] = convhull(p(tri(i,:),:)); end % sorting the triangles by density/area [r,s] = sort(area); % applying the Bayesian Blocks algorithm to the sorted data tic [x,y,z] = optinterval1(.5*ones(m,1),area(s),10); toc % use colormap for pretty colors map = colormap(hsv(length(x))); blockno = length(x); blockends = x(blockno:-1:1); dt = DelaunayTri(p); % Drawing the Delaunay triangulation triplot(dt) % coloring the blocks of the optimal partition for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last % use color i. patch(p(tri(s(j),:),1),p(tri(s(j),:),2),map(i,:)); end end 7.1.4. finding clusters using HOP algorithm % Computing the adjacencies of the components of the Bayesian Blocks and the HOP partition of these components for i = 1:835 cc{i} = find(z(i,:)); end [w1,x1,y1,z1] = HOPPlus2Fall14(x,y,cc); % size(w1) = 249 for i = 1:249 w2{i} = []; for j = 1:length(w1{i}) w2{i} = union(w2{i},w{w1{i}(j)}); end;
  • 66. 66 cellpart(w2{i}) = i; i w2{i} end for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellpart(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end % computing the cells www{j} in each component j and the Bayesian block = cellblock(k) each cell k is in for i = 1:835 www{i} = s(w{i}); end for i = 1:blockno if i == 1 j = 1; else j = blockends(i-1)+1; end; cellblock(s(j:blockends(i))) = i; end
  • 67. 67 Index = []; for i = 1:length(w1) for j = 1:length(w1{i}); Index(w1{i}(j)) = i; end end % find adjacecies of the cluster adjclusters = sparse(length(w1),length(w1)); for i = 1:length(w1) for j = 1:length(w1{i}); adjs = cc{w1{i}(j)}; for k = 1:length(adjs); adjclusters(i, Index(adjs(k))) = 1; adjclusters(Index(adjs(k)), i) = 1; end end end for i = 1:length(w1) ss{i} = find(adjclusters(i,:)); end [w3,x3,y3,z3] = HOPPlusFall14(x1,y1,ss); % figure %length(w3)=46 for i = 1:46 w4{i} = []; for j = 1:length(w3{i}) for k =1: length (w1{w3{i}(j)}) w4{i} = union(w4{i},w{w1{w3{i}(j)}(k)}); end; end; cellsecond(w4{i}) = i; i w4{i} end
  • 68. 68 for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellsecond(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end [ww,xx,yy,zz] = HOPPlus2Fall14(n,a,adjcell); %%% hop plot without color dt = DelaunayTri(p); triplot(dt) for i = 1:length(ww) cellP(ww{i})=i; end % Assign labels to the triangles. for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end numtri = size(tri,1);
  • 69. 69 plabels = arrayfun(@(n) {sprintf('%d', cellP(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off function [ X,NX,AX,cX] = HOPPlusFall14(N,A,c) % HOPPlus Partitions the data cells so that each cell is joined to its most dense neighbor % Each part of the partition contains one local maximum % sorting the cells by their densities which is O(Nlog N) [~,I] = sort(A./N); %sorting by decreasing density % changing to new coordinates obtained by sorting N=N(I); A=A(I); c=c(I); % X{i} = cells in the ith part of the HOP partition % R(i) = part of the HOP partition containing cell i % cX{i} = neighboring cells of the ith part of the HOP partition % NX(i) = number of data points in the ith part of the HOP partition % AX(i) = area/volume of the ith part of the HOP partition j = length(N); X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j); k=0; % k = number of parts in the partition for i=1:j c{i}=I(c{i}); % changing neighbors of cell i to the new coordinates m=min(c{i}); % finding the most dense neighbor of cell i if m < i % cell i is joined to its most dense neighbor in part kk kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]); AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i); cX{kk} = union(cX{kk},c{i}); else % cell i is a local maximum and a new part k is started k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i}; AX(k)= A(i); NX(k) = N(i); end end cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k); % Returning to the original indices before sorting for i = 1:k X{i} = I(X{i});
  • 70. 70 cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i}); end function [ X,NX,AX,cX] = HOPPlus2Fall14(N,A,c) % HOPPlus Partitions the data cells so that each cell is joined to its most dense neighbor % Each part of the partition contains one local maximum % Inputs N = number of data points in each cell % A = area/vol of each cell % c = cell array containing adjacencies of each cell % sorting the cells by their densities which is O(Nlog N) [~,I] = sort(A./N); %sorting by decreasing density % changing to new coordinates obtained by sorting N=N(I); A=A(I); c=c(I); % X{i} = cells in the ith part of the HOP partition % R(i) = part of the HOP partition containing cell i % cX{i} = neighboring cells of the ith part of the HOP partition % NX(i) = number of data points in the ith part of the HOP partition % AX(i) = area/volume of the ith part of the HOP partition j = length(N); X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j); k=0; % k = number of parts in the partition for i = 1:j II(I(i)) = i; end for i = 1:j c{i}=II(c{i}); % changing neighbors of cell i to the new coordinates end for i=1:j m=min(c{i}); % finding the most dense neighbor of cell i if m < i % cell i is joined to its most dense neighbor in part kk kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]); AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i); cX{kk} = union(cX{kk},c{i}); else % cell i is a local maximum and a new part k is started k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i}; % localmax = [localmax; i]; AX(k)= A(i); NX(k) = N(i); end end
  • 71. 71 cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k); % Returning to the original indices before sorting for i = 1:k X{i} = I(X{i}); cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i}); end 7.1.5. Finding Bayesian Blocks based on the area of each triangle matrix=presentation2d; tri=delaunay(matrix); [n,~]=size(matrix); [m,~]=size(tri); dt=DelaunayTri(matrix(:,1), matrix(:,2)); %triplot(dt); for i= 1:m [~,area(i,1)]= convhull(matrix(tri(i,:),:)); end [r,s]=sort(area); [blockno,blockends,opt,lastchange,cellblock]=bayesianblocks2d(.5*ones(m,1),area,10 ) dt = DelaunayTri(matrix); triplot(dt) for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(matrix(tri(s(j),:),1),matrix(tri(s(j),:),2),i); % use color i. end end
  • 72. 72 7.2 PYTHON Code Chapter 2.5.2 code is written in Python because of its expandability. We hope to use AstroML module for future research. AstroML contains a growing library of statistical and machine learning routines for analyzing astronomical data, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets. 7.2.1. How to find the block faster - Folding Method import numpy as np from scipy import stats import pylab as pl import pdb import b_blocks as bb def BB_folding(a, cntr_index): # a: dataset, cntr_index: index of center a = np.sort(a) a = np.asarray(a) # We truncate dataset around center if cntr_index > len(a)/2: a=a[(2*cntr_index -len(a))+1:] cntr_index=len(a)//2 if cntr_index < len(a)/2: a=a[:2*cntr_index+1] cntr_index=len(a)//2 edges_whole = np.concatenate([a[:1], 0.5 * (a[1:] + a[:-1]), a[-1:]]) temp= a[:cntr_index+1] # cut the midpoint of the data temp= np.asarray(temp) temp_reverse=a[cntr_index:] - a[cntr_index] # other half - center value temp_reverse= np.asarray(temp_reverse) temp_reverse= a[cntr_index] - temp_reverse[::-1] # reverse t= (temp +temp_reverse)/2 # t: averaged data (half) of symmetrical dataset t = np.sort(t) N = t.size -1 edges = np.concatenate([t[:1], 0.5 * (t[1:] + t[:-1]), t[-1:]]) block_length = t[-1] - edges
  • 73. 73 # arrays needed for the iteration nn_vec = np.ones(N) best = np.zeros(N, dtype=float) last = np.zeros(N, dtype=int) # Start with first data cell; add one cell at each iteration for K in range(N): # Compute the width and count of the final bin for all possible # locations of the K^th changepoint width = block_length[:K + 1] - block_length[K + 1] count_vec = np.cumsum(nn_vec[:K + 1][::-1])[::-1] # calculates no. of data #points in all possible blocks # evaluate fitness function for these possibilities fit_vec = count_vec * (np.log(count_vec) - np.log(width)) # objective function fit_vec -= 4 # 4 comes from the prior on the number of changepoints fit_vec[1:] += best[:K] # additive property of obj function n # find the max of the fitness: this is the K^th changepoint i_max = np.argmax(fit_vec) # Indices of the maximum values along an axis last[K] = i_max best[K] = fit_vec[i_max] # Recover changepoints by iteratively peeling off the last block change_points = np.zeros(N, dtype=int) i_cp = N ind = N while True: i_cp -= 1 change_points[i_cp] = ind if ind == 0: break ind = last[ind - 1] change_points = change_points[i_cp:] #UNFOLDING begins temp1= cntr_index - change_points temp1_reverse= cntr_index + temp1[::-1] cp_whole=np.concatenate((change_points,temp1_reverse[1:])) res = edges_whole[cp_whole] return res