Successfully reported this slideshow.
Upcoming SlideShare
×

# Written_report_Math_203_v2

246 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Written_report_Math_203_v2

1. 1. 1 Analyzing Interstellar Data CAMCOS Report San José State University by Teresa Baral Gianna Fusaro Wesley Ha Aerin Kim Vaibhav Kishore Yunzhi Lin Gulrez Pathan Xiaoli Tong Mine Zhao Xinling Zhang Bradley W. Jackson (Leader) Fall 2014
2. 2. 2 1. Introduction························································································4 1.1 Voronoi Diagram ··············································································4 1.2 Delaunay Triangulation ······································································6 1.3 Bayesian Blocks ···············································································7 1.4 HOP ······························································································9 1.4.1 HOP Algorithm ·········································································10 1.4.2 HOP Benefits & Drawbacks ·························································12 1.4.3 HOP Variants············································································13 2. Properties of Dr. Scargle’s New Objective Function ·································14 2.1 Convexity ·····················································································15 2.2 The Equal Density Property·······························································17 2.3 The Intermediate Density Property ······················································19 2.4 The Last Change Conjecture ······························································20 2.5 Experiments with Bayesian Blocks Algorithm········································24 2.5.1. Intuitive Block··········································································24 2.5.2. How to Find the Block Faster·······················································27 3. Arbitrarily Connected Bayesian Blocks··················································28 3.2 Connected Components ····································································29 3.3 Connected Components: Input/Output··················································30 3.4 Connected Components: Algorithm ·····················································31 3.5 Connected Bayesian Blocks: 2 Density Level·········································32 3.6 Testing and Analysis········································································33 4. Clusters ····························································································37 4.1 Dataset ·························································································37 4.2 Bayesian Blocks ·············································································38 4.2.1 Voronoi Diagram·······································································38 4.2.2 Delaunay Triangulation ·······························································39 4.3 Finding Clusters··············································································41 4.3.1 Connected components································································42 4.3.2 Clusters ···················································································43
3. 3. 3 4.3.3 Fewer clusters ···········································································44 4.4 Comparison ···················································································44 4.5 Conclusion ····················································································45 5. Future Directions of Research ······························································47 5.1 Properties of New Objective Function ··················································47 5.2 Clusters of Galaxies and Stars ····························································47 6. References ························································································49 7. Appendix ··························································································50 7.1 MATLAB Code··············································································50 7.1.1. Vectorized Bayesian Block Algorithm············································50 7.1.2. Last Change Conjecture Implementation of Bayesian Block Algorithm ··51 7.1.3. Code for finding Bayesian Blocks and connected components using Bayesian Block algorithm ···································································53 7.1.4. finding clusters using HOP algorithm·············································65 7.1.5. Finding Bayesian Blocks based on the area of each triangle ·················71 7.2 PYTHON Code ··············································································72 7.2.1. How to find the block faster - Folding Method ·································72
4. 4. 4 1. Introduction Given a sample of N points in D-dimensional space, there are two classes of problems that we would like to solve: density estimation and cluster ﬁnding. To infer the probability distribution function (pdf) from a sample of data is known as density estimation. It is one of the most critical components of extracting knowledge from data. For example, given a pdf estimation from point data, we can generate simulated distributions of data and compare them against observations. If we can identify regions of low probability within the pdf, we can also detect the unusual events. Clustering is grouping a set of data such that data in the same group are similar to each other, but between groups are very different. There are manyalgorithms contributing to find clusters, such as K-mean clustering, hierarchical clustering and DBSCAN. In our project we will derive a new approach to find clusters. 1.1 Voronoi Diagram A Voronoi diagram partitions space into ordered regions called Voronoi cells which allow us to infer from its properties the shortest Euclideandistance ofany point in space or the density of any Voronoi cell.
5. 5. 5 Figure 1.1 Voronoi diagram where each color region is a Voronoi cell containing exactly one point. Definitions: ● Let R be a set of distinct points p1, p2, p3,…,pn in the plane, then the Voronoi diagram of R is a subdivision of the plane into n Voronoi cells such that each cell contains exactly one point. ● Area in any given cell is closer to its generating point than to any other. ○ If a point q lies in the same region as point pi then the Euclidean distance from pi to q will be shorter than the Euclidean distance from Pj to q, where Pj is any other point in R which is not pi. In a Voronoi diagram, the Voronoi edge is a subset of locus of points equidistant from any two generating points on R. A Voronoi vertex is positioned at the circle of any empty circle generated by three or more neighboring points on the plane. Density of Voronoi cell is calculated by taking the data point and dividing it by the volume of cell.
6. 6. 6 1.2 DelaunayTriangulation Delaunay triangulationbydefinition is the triangulationof the convex hull of the points in a diagram wherein every circumcircle of a triangle is an empty circle. It is used as a method for binning data. It is also defined as the dual of a Voronoi diagram. Therefore a Delaunay triangulation can be formulated by taking the dual of a Voronoi diagram. The formulation of a Delaunay triangulation by taking the dual of a Voronoi Diagram can be seen by examining the image below (fig. 1.2). The edges of a Delaunay triangle is created by connecting the data points between two adjacent voronoi cells sharing a common edge as shown by the data points Pi and Pj in fig. 1.2. The vertices of the Delaunay triangle thenconsist of the data points. In the example given, the circle shows a Delaunay triangle which is connected by the data points Pi, Pj, and an unnamed data point. A Delaunay triangle consists of three data points, each assumed to have a 360 degree angle, and a triangle in general has an 180 degree angle; therefore, a Delaunay triangle contains half a data point. If a Delaunay triangle is considered as a cell, then its nucleus would be a Voronoi vertex. Each Voronoi vertex is adjacent to three Voronoi cells. Since a Voronoi vertex is adjacent to three cells, then it would have three edges connected to it which forms the triangular shape when its dual is taken.
7. 7. 7 Figure 1.2 Delaunay Triangulation (purple diagram) created by taking the dual of a Voronoi Diagram (blue diagram) 1.3 BayesianBlocks Real data rarely follow one simple distribution and nonparametric methods can offer a useful way of analyzing the data. For example, one of the simplest nonparametric methods to analyze a one-dimensional data set is a histogram. To construct a histogram, we need to specify a bin size, and we assume that the estimated distribution function is piecewise constant within each bin. However in a massive astronomy dataset, it sometimes may not be a true assumption. A histogram can ﬁt any shape of distribution, given enough bins. When the number of data points is small, the number of bins should be small. As the number of data points grows, the number of bins should also grow to capture the increasing amount of detail in the distribution’s shape. This is a basic idea of Dr. Scargle’s Bayesian Block methods—they are composed of simple pieces, and the number of pieces grows with the number of data points.
8. 8. 8 Choosing the right number of bins is critical. It can have a drastic effect, for example the number of bins could change the conclusion that a distribution is bimodal as opposed to single mode. Intuitively, we expect that a large bin width will destroy ﬁne-scale features in the data distribution, while a small width will result in increased counting noise per bin. Opposite to custom, it is unnecessary to choose bin size before reading the data. Dr. Scargle’s nonparametric Bayesian method was originally developed to detect bursts in space and characterize shape of the astronomical data. The algorithm produces the most probable segmentation of the observation into time intervals in which there is no statistically significant variation in the observations. The method he first described utilized a greedy algorithm. A more efficient algorithm is described in An Algorithm for Optimal Partitioning of Data on an Interval (2004). A dynamic programming algorithm is implemented. This algorithm finds a more efficient way of calculating the optimal interval without having to calculate all possible intervals. The algorithm calculates the value of the fitness function of the first n cells by using previously calculated values of the first n-1 cells from the previous iterations plus the calculated value of the last block itself. This algorithm can be applied to piecewise constant models, piecewise linear models, and piecewise exponential models. This algorithm can be thought of as an intelligent binning method where the bin sizes and the number of bins are adapted to the data.
9. 9. 9 Bayesian blocks are useful in different contexts. For the pulse problem that it was originally used for, Bayesian blocks was used to determine pulse attributes without needing to determine specific models for the pulse shapes. This information was then used to guess further attributes of the pulse attributes. For our research, Bayesianblocks was used to determine blocks of voronoi tessellations and delaunay triangles with like densities. 1.4 HOP The HOP Algorithm was formulated by Daniel J. Eisenstein and Piet Hut in 1997. It is considered to be a method for finding groups of particles for N-body simulations. It is a grouping algorithm that divides a set of data points into equivalence classes, such that, each point is a member of only one group. It has many uses in the field of astronomy for finding clusters of stars, clusters of galaxies, locating dwarf galaxies and determining whether galaxies merge. Figure 1.3: Voronoi Tessellation and Delaunay Triangulation on a set of data points
10. 10. 10 1.4.1 HOP Algorithm The HOP Algorithm is aimed at grouping data points into clusters so that the high density regions can be distinguished from the low density regions. The HOP process involves assigning an estimate of the density to every data point. The density function used to come up with these density estimates involves dividing the number of data points in a voronoi cell by its area. Figure 1.4: The HOP Algorithm begins with randomly selecting a data point Figure 1.5: Neighborhood densities of a selected data point are estimated
11. 11. 11 A random point is chosen and another point in its neighborhood which has the highest density is chosen. The data points are linked to their densest neighbor and the process of hopping to higher and higher densities continues until a point which is its own densest neighbor. This point which is its own densest neighbor is called the local maximum and all the points that hop to a local maximum form a max class. Figure 1.6: The selected data point is linked to its densest neighbor Figure 1.7: Continue hopping to higher and higher densities
12. 12. 12 Figure 1.8: Data point which is its own densest neighbor is called local maximum Figure 1.9: Data points hopping to a local maximum form a max class 1.4.2 HOP Benefits & Drawbacks The HOP algorithm is a relatively simple algorithm as compared to other grouping algorithms since it is fast and is based on easy computations. However there are certain drawbacks to this algorithm, mainly, it is overly sensitive to small fluctuations. This leads to a grouping style wherein the groups are not optimally distinguished since minute fluctuations result in the formation of a lot of mini-clusters.
13. 13. 13 1.4.3 HOP Variants The HOP algorithm we used in this project is not the only one; there are many variants to it, some simple and some very complex. A couple of variations to the HOP algorithm that exist but we did not have the time to explore are: · EHOP · JHOP · Joining a data point to its closest neighbor · Joining a data point to another along the steepest gradient · Joining a data point to a denser neighbor with the lowest density · Repeated HOP algorithm
14. 14. 14 2. Properties of Dr. Scargle’s New Objective Function The Bayesian Block algorithm that was implemented in this project made use of Dr. Scargle’s new objective function (1). The new objective function takes on three parameters, namely N, M, and c. The parameter N is the number of data points and M is the area (volume) in the data cell. The parameter c is related to the prior distribution of the number of blocks. Varying the value of the constant will change the return number of blocks in the optimal partition. The derivations of the new objective functioncan be found inScargle et. al. (2012). The basic idea is that a prior Poisson distribution was used to derive the formula. This is different from Dr. Scargle’s “old” objective function in that he previously used a Beta function as the prior. The idea of the new objective function is to model the likelihood that the data cells in a given block have the same density. This can easily be seen why it is an ideal measure for the Bayesian Block algorithm. The idea behind the Bayesian Block algorithm is to dynamically find the partition of the data space, which contains data cells of roughly equal density. The algorithm uses the values from the previously computed i cells to compute the value of the new objective function for the i+1 cell. The algorithm then
15. 15. 15 takes the maximum value returned by the new objective function as the optimal partition. 2.1 Convexity Dr. Scargle’s new objective function (1) has many properties, which are applied in the Bayesian Block algorithm. One of these properties is that the new objective function is convex. For a function to be convex means that the line segment between any two points evaluated by the function will always lie above the values of the function between those two points. In other words if we were to graph the new objective function would look concave up in two and three dimensions. Another property that is a result of a function being convex is that it satisfies the equation: 𝜆𝑓( 𝑥1, 𝑦1) + (1 − 𝜆) 𝑓( 𝑥2, 𝑦2) ≥ 𝑓(𝜆( 𝑥1, 𝑦1) + (1 − 𝜆)( 𝑥2, 𝑦2)) 𝑓𝑜𝑟 0 ≥ 𝜆 ≥ 1 This result of convexity was used in the subsequent proofs of the other properties that the new objective function satisfies. Theorem: The new objective function 𝐹( 𝑁, 𝑀) = 𝑁𝑙𝑜𝑔 𝑁 𝑀 − 𝑐 is convex and is actually strictly convex when for given two points say (𝑁1, 𝑀1) and (𝑁2, 𝑀2) this equality is false 𝑁1 𝑀1 = 𝑁2 𝑀2 (i.e. 𝑓′′( 𝑡) > 0 𝑓𝑜𝑟 0 < 𝑡 < 1).
16. 16. 16 Proof: Let (𝑁1, 𝑀1) and (𝑁2, 𝑀2) be two fixed data points where 𝑁1, 𝑁2, 𝑀1, 𝑀2 > 0 For some value 𝑡, 0 < 𝑡 < 1 𝑡𝑁1 + (1 − 𝑡) 𝑁2 > 0 and 𝑡𝑀1 + (1 − 𝑡) 𝑀2 > 0 Thus we can rewrite f(t) as 𝑓( 𝑡) = [ 𝑡𝑁1 + (1 − 𝑡) 𝑁2]𝑙𝑜𝑔 𝑡𝑁1 + (1 − 𝑡) 𝑁2 𝑡𝑀1 + (1 − 𝑡) 𝑀2 − 𝑐 It suffices to show that if 𝑓′′( 𝑡) ≥ 0 then the Dr. Scargle’s new objective function is convex. From derivative properties we get 𝑓′′( 𝑡) = ( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2 ( 𝑡𝑀1 + (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2) Since N1, N2, M1, M2 > 0 this implies ( 𝑡𝑀1+ (1 − 𝑡) 𝑀2)2( 𝑡𝑁1+ (1 − 𝑡) 𝑁2) > 0 Thus the second derivative depends on the term ( 𝑁1 ∗ 𝑀2 − 𝑁2 ∗ 𝑀1)2 which implies 𝑓′′( 𝑡) > 0 if 𝑁1 𝑀1 ≠ 𝑁2 𝑀2 otherwise 𝑓′′( 𝑡) ≥ 0 ∴ Dr. Scargle’s new objective function is convex and is strictly convex if the densities of the two data cells are not equal.
17. 17. 17 2.2 The Equal Density Property The Equal Density Property states that in an optimal partition of a collection of data cells into arbitrary blocks, if C1 and C2 are cells with the density of C1 = the density of C2, then C1 and C2 are in the same block of the optimal partition. Proof: We use a proof by contradiction. Let C1 be a cell with N1 data points and area M1. Similarly, let C2 be a cell with N2 data points and area M2. First assume that C1 and C2 are in different blocks of the optimal partition of the data space, X, such that the density of C1 and C2 are equal to each other. Call their respective blocks A and B. Let C3 represent the remainder of block A, where the remainder of block A has N3 data points and M3 area. Let C4 represent the remainder of block B, where the remainder of block B has N4 data points and M4 area. From our assumption, density of C1 = density C2, or N1/M1 = N2/M2. If the partition of X is indeed optimal, then f(C1 U C3) + f(C2 U C4) > Max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2 U C4) + f(C3)} , where f is Dr. Scargle's new objective function.
18. 18. 18 So we achieve a contradictionby showing that max{ f(C1 U C2 U C3) + f(C4), f(C1 U C2 U C4) + f(C3)} ≥ λ[f(C1 U C2 U C3) + f(C4)] + (1- λ)[f(C3) + f(C1 U C2 U C4)] = f(C1 U C3) + f(C2 U C4). We must find a λ such that 0 ≤ λ ≤ 1 and 1) λ(N1 + N2 + N3, M1 + M2 + M3) + (1- λ)(N3,M3) = (N1 + N3, M1 + M3) and 2) λ(N4, M4) + (1- λ)(N1 + N2+N4, M1 + M2 + M4) = (N2 + N4, M2 + M4) We can see that λ(N1 + N2 + N3) + (1- λ)N3 = N1 + N3 and λ(M1 + M2 + M3) + (1- λ)M3 = M1 + M3. Simplifying the equations gives us 3) λ = N1 / (N1+N2) and 4) λ = M1 / (M1 + M2) Recall the assumption that the density of C1 = density of C2, or N1 / M1 = N2 / M2. Thus N1 = (N2 * M1)/ M2. Substituting this expression for λ into 3 gives us 4. Substituting 3 into 1 and 2 provides the desired results. Thus there is an optimal partition with C1 and C2 in the same block. Thus the equal density property is proved.
19. 19. 19 2.3 The Intermediate Density Property The Intermediate Density property says that in an optimal partition of a collection of data cells into arbitrary blocks, if C1, C2, and C3 are cells with densityof C1 > density of C2 > densityof C3, then whenC1 and C3 are in the same block Bof the optimal partition, so is C2. Proof: Suppose C1,C2,C3 are cells with density of C1 > density of C2 > density of C3 among a larger collection of cells for which we are trying to find the optimal partition into connected blocks. Let C1 U C3 U C4 | C2 U C5 be the part of anoptimalpartition withC1 and C3 inone block where C4 denoted the remainder of that block, and C2 is in a second block where C5 denotes the remainder of that block. We need to prove that f( C1 U C3 U C4) + f(C2 U C5) ≥ max{f( C2 U C3 U C5) + f(C1 U C4), f( C1 U C2 U C5) + f(C3 U C4), f( C1 U C2 U C3 U C4) + f(C5)}, where f represents Dr. Scargle’s new objective function. Let Ni be the number of data points in Ci and let Mi represent the area of Ci and Ni,Mi > 0. It suffices to show that for some 0 ≤ λ1, λ2, λ3 ≤ 1 withλ1 + λ2 + λ3 = 1 the following will be true N1 + N3 + N4 = λ1(N1 + N4) + λ2(N3 + N4) + λ3(N1 + N2 + N3 +N4) and N2 + N5 = λ1(N2+N3+N5) + λ2(N1+N2+N5) + λ3(N5).
20. 20. 20 Substituting 1 - λ1 - λ2 for λ3 gives the following λ1N3 + λ2(N1 + N2) = N2 and λ1(N2 + N3) + λ2(N1 + N2) = N2 Solving this system of equations gives the results λ1 = 0, λ2 = N2 / (N1 + N2), λ3 = N1 / (N1 + N2) Thus Max{f(C2 U C3 U C5) + f(C1 U C4), f(C1 U C2 U C5) + f(C3 U C4), f(C1 U C2 U C3 U C4) + f(C5) ≥ λ1[f(C2 U C3U C5) + f(C1 U C4)] + λ2[f(C1 U C2 U C5) + f(C3 U C4)] + λ3[f(C1 U C2 U C3 U C4) + f(C5)] ≥ f(C1 U C3 U C4) + f(C2 U C5). The Intermediate Density Property is the basis of the Bayesian Blocks algorithm which finds the optimal partitionof a collectionof cells into arbitrary blocks (regardless of the dimension). The intermediate density property allows us to find this optimal partition by first sorting the cells by their densities and then only considering partitions into blocks of consecutive cells in this order thus essentially reducing the problem down to a 1-dimensional problem. 2.4 The Last Change Conjecture The Last Change Conjecture says that the entries in the last change vector are always nondecreasing when finding the optimal partition of a set of cells sorted by density into arbitrary blocks. In proving this conjecture, we need to make use of several properties of Dr. Scargle's new objective function (1), 1) Vector multiplication
21. 21. 21 k*f(x,y) = k(x*log(x/y)-c) = kx * log(kx/ky) – kc = kxlog(kx/ky) – c – (k – 1)c = f(kx,ky) – (k-1)c 2) Additive property f(a,b) + f(d,e) = 2f(a,b)/2 + f(d,e)/2) ≥ 2f((a+d)/2, (b+e)/2) = f(a+d, b+e) – c 3) 2 dimensional Convexity property f(x,y) + e(x0,y0)) + f((x,y) – e(x0,y0)) ≥ 2f(x,y) 4) 2 dimensional Convexity property for λ1 > λ2>0, f((x,y) + λ1(x0,y0)) + f((x,y) – λ1(x0,y0)) ≥ f((x,y) + λ2(x0,y0)) + f((x,y) – λ2(x0,y0)) 4’') 2 dimensional Convexity property for λ1 > λ2 > 0 and λ3 > λ4 > 0, f((x,y) + (λ1x0, λ3y0)) + f((x,y) – (λ1x0, λ3y0)) + f((x,y) – λ1x0, λ3y0)) ≥ f((x,y) + (λ2x0, λ4y0)) + f((x,y) – (λ2x0, λ4y0)) Proof:: Let Ci be the ith cell, with Ni data points and Mi area such that Ni, Mi > 0. Assume we have a data set sorted by its density such that the density of C1 > density of C2 > density of C3 > density of C4. Then by our assumption, N1/M1 > N2/M2 > N3/M3 > N4/M4. We must show that if
22. 22. 22 f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3) then f(N1 + N2, M1 + M2) + f(N3 + N4,M3 + M4) ≥ f(N1,M1) + f(N2 + N3 + N4, M2 + M3 + M4). Assume not, then if f(N1 + N2,M1 + M2) + f(N3,M3) ≥ f(N1,M1) + f(N2 + N3, M2 + M3) then f(N1 + N2, M1+ M2) + f(N3 + N4,M3 + M4) < f(N1,M1) + f(N2 + N3 + N4, N2 + M3 + M4). Equivalently f(N1+N2,M1+M2) – f(N1,M1) < f(N2+N3+N4,M2+M3+M4) – f(N3+N4,M3+M4) and f(N1+N2,M1+M2) – f(N1,M1) ≥ f(N2+N3,M2+M3) – f(N3,M3) thus f(N2+N3+N4,M2+M3+M4) – f(N3+N4,M3+M4) > f(N2+N3,M2+M3) – f(N3,M3) which implies f(N2+N3+N4,M2+M3+M4) + f(N3,M3) > f(N2+N3,M2+M3) + f(N3+N4,M3+M4) To find a contradiction, it will be sufficient to find λ1 and λ2 such that λ1(N2+N3,M2+M3) + λ2(N3+N4,M3+M4) = (N2+N3+N4,M2+M3+M4) and (1-λ1)(N2+N3,M2+M3) + (1-λ2)(N3+N4,M3+M4) = (N3,M3) From those equations, we get λ1(N2+N3) + λ2(N3+N4) = N2 + N3 +N4 and λ1(M2+M3) + λ2(M3+M4) = M2 + M3 + M4
23. 23. 23 This is a system of 2 equations with 2 unknowns(assuming Ni and Mi are known quantities), so upon solving for those 2 unknowns, we get λ1=((N2+N3+N4)(M3+M4) – (N3 +N4)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) - (N3+N4)(M2+M3)) λ2=(-(N2+N3+N4)(M2+M3)+(N2 +N3)(M2+M3+M4) ) / ( (N2 + N3) (M3+M4) - (N3+N4)(M2+M3)) After proving the last change conjecture we then went on to implement it in the Bayesian Block algorithm. The MATLAB code can be found in Appendix A2. The changes to the code only required altering a few lines of code. We then did an efficiency comparison between the original Bayesian Block algorithm and the one implementing the last change conjecture. We used a uniform example of size 10,000 where each data point was evenly separated on the interval from zero to one. The value of the constant, c was varied to change the number of blocks. The results from this comparison can be seen in figure 2.1. Table 2.1: Time Comparison between original Bayesian Block algorithm and last change conjecture algorithms
24. 24. 24 Figure 2.1: Graphical Representation of Table 2.1 The results of the example showed that for all number of blocks in the optimal interval the last change conjecture algorithm outperformed the original. The efficiency of the last change conjecture implementation is O(N2/B), where B is the number of blocks. So we can see from figure 2.1 that the efficiency improves proportional to the number of blocks. 2.5 Experiments with BayesianBlocksAlgorithm Here we would like to introduce some interesting experiments with Dr. Scargle’s Bayesian Block Algorithm. 2.5.1. Intuitive Block One of the few things that we can do to understand algorithms better is running it on a small dataset. Here we ran the Bayesian Block algorithm on a data set consisting of 5
25. 25. 25 data points [ 2, 2.01, 2.03, 10.1, 10.11 ]. There are possibly four different cuts in the dataset and the most ‘intuitive’ cut will be between 2.03 and 10.1. However when you run the Bayesian Block algorithm, it gives youan unintuitive cut (two cuts at 2.005 and 10.105) Figure 2.2: intuitive cut Figure 2.3: Algorithm’s result The best partition of the dataset would be a high intensity interval near 2, a high intensity interval near 10 and a low intensity interval between 2 and 10 which is similar to the partition that Dr. Scargle’s function describes as the most likely. To find out what causes the cut to be unintuitive, we checked every possible cut and its objective values. Figure 2.4: The element of the list represents the four positions of the dataset. 0 means there is no cut in the position and 1 means there is a cut. For example, [0, 0, 1] means there is a cut in the 3rd position.
26. 26. 26 Surprisingly, our intuitive cut turned out to have the worst objective function value. The reason is that the intuitive cut got penalized significantly because of its length factor M in the objective function (1). Objective value gets very small when M (the distance between two points) gets bigger. The goal of Dr. Scargle’s objective function is to partition the whole data space not just the data points and it does a very good job on that. Still this specific dataset illustrates the problem when a low intensity region is adjacent to high intensity regions on either end. We realized it is a shortcoming of the Voronoi diagrambecause every Voronoi cell needs to have one data point. a. Space Dust (Pseudo Points) Method To avoid penalizing the block that happens to be next to a low density region, we decided to add some extra points which aren’t data points but to distinguish them from regular data points. We call them space dust. For example, true data points would have weight 1.0 but we would give space dust a very small weight such as 1/1,000,000. Now we can form the usual Voronoi diagram in the following modified way. If two points x and y are bothdata points or both space dust then for points between x and y, the points closest to x go to the cell containing x, and the points closest to y go to the cell containing y as usual. However when say x is space dust and y is a regular data point then for a point z between x and y, z goes to the cell containing x if the distance from z to y is more than 1,000,000 times larger than the
27. 27. 27 distance from z to x otherwise z goes to the cell containing y. This is a sort of weighted version of the Voronoi diagram. In theory the weighting of the Voronoi diagram is meant for situations where there is a particle of space dust between two regular data points. If you have a dust particle z between two data points x and y which are roughly distance d apart then the cell containing d will have length roughly d/1,000,000 which means that density of the cell containing z will likelybe about the same as the densities of the cells containing x and y so that x, y, and z are in the same block of the optimal partition so a dust particle between two data points isn't likely to change the optimal partition. This method raised two questions, which offer perspectives for future research. First, we need to decide how many particles of space dust should you add and second, how to choose the window of dusts. 2.5.2. How to Find the Block Faster Suppose there is a symmetrical gamma ray burst. Gamma ray bursts emit highly energetic photons and we canthink of it as 1-dimensional time-tagged event (TTE) data which is symmetrical at its maximum emission. Under the assumption that TTE data is symmetrical, we can use some ‘tricks’ to identify significantly distinct pulses in a more efficient way. a. Folding Method Assuming that we know the center of symmetry, thenthe BayesianBlock algorithm can run much faster than usual.
28. 28. 28 For example, if we have a set of times t1, t2, … , tn when photons are detected on an interval [0,1] and suppose that you know that the gamma ray burst is symmetrical at t = ½. If times satisfy t1 < t2 < … < tm < ½ < tm+1 < tm+2 < … < tn, we can fold the interval at t = ½, average matching data points and apply the usual 1-dimensional dynamic programming algorithm on the interval [0.½ ]. After that, unfold the data and use the partition obtained on [0,½ ] and its reverse on [½ , 1]. Python code for this Folding method is given in Appendix 7.2. 3. Arbitrarily Connected Bayesian Blocks Given a 2-dimensional dataset, we can find the optimal partition using Dr. Scargle’s new objective function. This allows us to adaptively bin areas of space that are roughly equal in density. The optimal partition is chosen using dynamic programming to decide which arrangement of cells maximizes Dr. Scargle’s fitness function. Using a binning method dependent on the composition of the data, also known as “Bayesian Blocks”, allows for more detailed and significant results. Before applying ‘optinterval1.m’ to a dataset, included in Appendix A3, we first partition the data into either a Voronoi diagram or Delaunay triangulation so we can determine the cells of the dataset and their respective sizes. The “optinterval” function takes inputs ‘n’, ‘a’, and ‘c’, where ‘n’ is the number of data points per cell, ‘a’ is the
29. 29. 29 sorted areas of each cell, and ‘c’ is a constant based on the previous distribution. Dr. Scargle’s new objective function is a 1-dimensional application, thus in order to apply the function to any multi-dimensional data, the data must be reduced to a single dimension. By manipulating input ‘a’ to be either areas, volumes, or a side of a polygon or triangle, this 1-dimensional algorithm can be applied to multi-dimensional data. After running the objective function, the resulting blocks in the optimal partition contain cells of roughly equal density, and thus the density of each block is nearly constant. These Bayesian Blocks are organized by highest to lowest density, such that the first block has the highest relative density and the last block has the lowest relative density. While the optimal configuration of the dataset groups cells together based on size, cells do not need to be connected in order to belong to the same block. This is why the results of ‘optinterval1.m’ are arbitrarily connected Bayesian Blocks; the algorithm doesn’t take connectedness of cells into account. 3.2 ConnectedComponents Our research focused on finding connected components of space that are roughly equal or similar in density. To accomplish finding optimal or near-optimal partitions of a 2-dimensional dataset into connected components, we used Dr. Scargle’s new objective function to first organize our data by density, and then examined connected components (unions of cells) within k density levels.
30. 30. 30 The basic idea for finding connected components of relative density is to take the data binned into blocks by relative density, where Density(Block 1) >= Density(Block 2) >= …>= Density(Block n) Each block represents a different density level. We then iterate over each block to find connected components within k density levels. 3.3 ConnectedComponents:Input/Output The ‘concomponents2.m’ function written by Dr. Bradley Jackson, found in Appendix A3, can be called using the function [x, y, z, w] = concomponents2(blocks, n, a , adjmat) The function takes in the inputs: blocks, n, a, and adjmat. The ‘blocks’ input is a cell array that contains the cells for each Bayesian Block. Input ‘n’ is an array that contains the number of data points per cell. The ‘a’ input is a sorted array that contains the areas of each cell. The last input, ’adjmat’, is a sparse matrix containing adjacent pairs of cells. Inputs ‘n’, ‘a’, and ‘adjmat’ share the same index for each respective cell. The ‘adjmat’ input was obtained using a function written by Dr. Jackson called ‘Delaunay2Adjacencies2.m’. The ‘conncomponents2’ function has four outputs, x, y, z, and w. Output ‘x’ is a cell array of the resulting connected components. ‘y’ is an array containing the number of data points in each component. The ‘z’ output is an array containing the areas of each component. Lastly, ‘w’ is a sparse matrix of component adjacency pairs.
33. 33. 33 3.6 Testing and Analysis We used two different 2-dimensional datasets to test and analyze optimal or near-optimal partitions into connected components. The ”Counties of California By Population Density” (CCPD) dataset has dimensions 58 x 2, numbering all California counties with their population densities. The Sloan Digital Sky Survey (SDSS) dataset has dimensions 1523 x 2. We used CCPD as a small, known dataset to test our logic when making changes to the algorithms. The CCPD data set is already binned and sorted, but the SDSS data is not. We used Delaunay triangulation to format the SDSS data, limiting the number of cell adjacencies to 3 per cell. This is preferred over using a Voronoi diagram, which can return 3 or more adjacencies per cell. To call and plot ‘optinterval1’, ‘conncomps2’, and ‘conncomps2mod’, we used ‘conncomp_script.m’, ‘bayesian_plot.m’, ‘population_density.m’, and ‘belongs_to.m’. All helper functions and scripts are included in Appendix A3. After running ‘optinterval1’ onboth CCPD and SDSS, we obtained a similar number of partitions relative to the dataset. CCPD yields 8 blocks, and SDSS is represented by 6 blocks. Both datasets have smaller sized blocks that represent the highest and lowest density regions, while the intermediate blocks contain the majority of cells. For example, the block partition for SDSS is [215 779 1458 2129 2731 3017], and each of these blocks have the following sizes, respectively [215, 564, 679, 671, 602, 286]. The lowest and highest density regions, block 1 and block 6, have significantly less cells
34. 34. 34 compared to blocks 2, 3, 4, and 5. Thus, the arbitrarily connected blocks algorithm finds significance in blocks that have a sufficiently dissimilar density(either higher or lower) from the majority. However, these blocks contain cells that are not necessarily connected. After running ‘conncomps2.m’ on both CCPD and SDSS, the resulting partitions varied greatly from the arbitrarily connected Bayesian Blocks. In CCPD, 27 connected components in the same density level were found. For the larger dataset, SDSS returned 835 connected components. The number of components increased considerably relative to the size of our data. Unlike the arbitrarily connected blocks, which only filter blocks by relative density, the connected Bayesian Blocks must be both relatively close in density and connected. The strict criteria for grouping cells increased the number of partitions to such a degree that we lost significance in our results. We ran ‘conncomps2mod.m’ to add another density level to our connected components. Again, bothdatasets had similar results witha decrease in the number ofpartitions from ‘conncomps2.m’. 11 connected components were found in CCPD, along with 205 connected components in SDSS. Because we included more cells in our search for adjacencies, more connected cells were grouped together that had similar relative densities. At 2 density levels, the number of components is higher than that of the arbitrarily connected blocks, yet lower than density level 1. This is because the connected components at 2 density levels signify both connectedness and relative density with less restriction than density level 1.
35. 35. 35 The CCPD and SDSS datasets were used to test three algorithms: arbitrarily connected Bayesianblocks, connected Bayesianblocks at 1 density level, and connected Bayesian blocks at 2 density levels. Fig. 3.2 shows the results of running these algorithms on CCPD data. The arbitrarily connected Bayesian blocks have the smallest number of partitions. In connected Bayesianblocks at 1 density level, cells in the same component must not only have the same density but also be connected. This causes an increase in the number of components. When CCPD is analyzed for connected Bayesian blocks at 2 density levels, the number of components decreases from density level 1, but still has a higher number of components than the arbitrarily connected Bayesian blocks. The criteria for cells in each component is relaxed from requiring the same level density to similar level densities, allowing more cells in the same component. The same observation was detected with the SDSS data, as shown in figure 3.3. Figure 3.1: Output for the Counties of California by Population Density dataset
36. 36. 36 Figure 3.2: Output for the Sloan Digital Sky Survey dataset
37. 37. 37 4. Clusters A galaxy is a huge collection of gas, dust and stars. In this section, we are interested in finding clusters of galaxies in a collection of stars using the Bayesian Blocks algorithm and the HOP algorithm. 4.1 Dataset The data set we use in this section is a 2- dimensional data slice. The data set was obtained from Sloan Digital Sky Survey. It’s worth mentioning that Sloan Digital Sky Survey has been one of the most successful surveys in the history of astronomy. Our dataset contains 1523 data points, which can be shown in the figure 4.1 below. Figure 4.1: 2- dimensional Sloan Digital Sky Survey Raw Data
38. 38. 38 We could see that the data points are not scattered uniformly in figure 4.1. There are some high density regions and some low density regions. We will classify the regions into different clusters using algorithms based on the calculation of density. 4.2 BayesianBlocks We use the Bayesian Block algorithm to find the optimal partitions based on equal density properties. There are two ways to get the density of the data points: Voronoi diagram and Delaunay triangulation. 4.2.1 Voronoi Diagram As discussed in the previous section, “a Voronoi diagram is a partition of a plane into regions based on distance to points in a specific subset of the plane”(wiki, 2014). In each Voronoi cell, there is only one data point. The density of each Voronoi cell is obtained by the number of the data points, which is one, divided by the area of the Voronoi cell. By using the Bayesian Blocks algorithm, the cells with the same density are grouped into one block. Then we have 8 blocks for our data sets. The blocks are shown in the figure 4.2 below. Cells of the same color are in the same blocks. The blocks with red color are low density blocks, while the blocks with blue color are high density blocks. We notice that the cells in the same block are not necessarily connected. In addition, in this part we ignore the Voronoi cells that have infinity edges or whose edges are out of the boundary. By doing this, we could get limited areas of the cells, thus there are some limitations for the method. We lost some data points near the boundaries.
39. 39. 39 Figure 4.2: Blocks based on Voronoi diagram 4.2.2 Delaunay Triangulation The Delaunay triangulation is dual to the Voronoi diagram and there is one Voronoi vertex in the center of each Delaunay triangle. A Delaunay triangle has three data points as its vertices. In every Delaunay triangle, there is one half data point. By using a Delaunay triangle, we would not lose a data point from our original data set. One way to calculate the density is to take the number of data points, which is one half, and divide it by the area of each triangle. Like the Voronoi diagram, triangulations with the same density are put into one block. Then we got 10 blocks, which is shown in figure 3. The triangulations with the same color are in one block. Also, different colors represent different blocks. However, some triangulations with long edges are separated into different blocks because of the small areas.
40. 40. 40 Figure 4.3: Bayesian blocks based on Delaunay triangulation using area Another way to calculate the density is to take the number of data points and divide it by the maximum lengthof each triangle. By applying the Bayesian Block algorithm, we got 6 Bayesian Blocks. Figure 4.4 shows 6 different blocks, each block has a different color. It is clear to see that the red color block represents low density level and the blue color represents a high density level. Figure 4.4: Blocks based on Delaunay triangulation with maxi-length
41. 41. 41 Figure 4.5 below is a zoomed in picture of Figure 4.4. The advantage of this method could be illustrated from this figure. On the left side of the picture, the triangles with long edges are classified into the same block with their neighbors. By doing this, we reduce the number of the Bayesian Blocks from 10 to 6. Figure 4.5: Zoom in picture 4.3 Finding Clusters As we can see the Bayesian Blocks method is very efficient to find optimalpartitions. It especially works well on the partitions with components that are not necessarily connected. A lot of applications have been found, such as detecting gamma ray bursts, and flares in active galactic nuclei and other variable objects such as the Crab Nebula. Although the Bayesian Blocks method can be used to find partitions efficiently, the application is restricted when we want to find clusters, in which the components are connected. So we introduce a different algorithm (HOP) for identifying clusters of connected components based on the results from the Bayesian Blocks algorithm.
42. 42. 42 To find clusters using the HOP algorithm, we need two steps. First, we have to find connected components of the Bayesianblocks. Second, we apply the HOP algorithm on the connected components to find clusters. Inaddition, if we want to reduce the number of the clusters, we can apply the HOP algorithm again on those clusters. The details of each step are described later one by one. 4.3.1 Connected components There are many ways to define a connected component. In our research, we use the most straightforward definition: two cells in the same block sharing an edge are connected components. Using this definition, we found 835 connected components illustrated in figure 4.6. Different colors represent different Bayesian blocks, and the number on each cell tells its connected components index. We can see that the three highlighted triangles have the same background color, which means they are in the same block. Notice that the number on the orange one and white one are the same, which means they are in the same connected component, because they are sharing a common edge. But the number on the purple triangle is different from the other two, which means this triangle is included in a different connected component, because it does not have any common edge with the other two.
43. 43. 43 Figure 4.6: Connected Components 4.3.2 Clusters How can we find clusters? We can apply the HOP algorithm on those connected components. The density is defined to be the number of data points in the connected component divided by the maximum length of that connected component. Every connected component is joined to its most dense neighbor with higher density. Taking the circled area in figure 4.7 as an example, the dark blue area has higher density than the light blue area and they are in different connected components. Visually, we think they might in the same cluster. After applying HOP, they are eventually in the same cluster as shown in figure 4.8. In figure 4.8, the number on each cell means cluster number. We finally get 249 clusters. It seems there are too many clusters, and we need to merge some clusters to reduce the number of clusters.
44. 44. 44 Figure 4.7: Connected Components Figure 4.8: HOP on Connected Components 4.3.3 Fewer clusters In order to reduce the number of clusters, we apply HOP again on the clusters found. The density is defined as the number of data points in the cluster divided by the maximum length of that cluster. Every cluster is joined to its most dense neighbor with higher density. Then every cell now is assigned to a new cluster number shown in figure 4.9. In the end, we reduced the number of the clusters down to 46. Figure 4.9: HOP on Clusters 4.4 Comparison We also did a comparison between using a HOP only algorithm versus HOP algorithm
45. 45. 45 on Bayesian Blocks. The HOP also seems to be overly sensitive to minor fluctuations in the data which can create a large number of local maximums. In our research, it gives 390 clusters, while HOP algorithm on Bayesian Blocks gives 249 clusters. If we examine a small area, we can also see how the number of HOP only clusters will be affected by the density of each cell. For instance, in the highlighted area in figure 4.10, three triangles form three different clusters according to their densities. However, in figure 4.11, they are in the same cluster. To sum up, HOP algorithm on Bayesian Blocks approach is more robust. Figure 4.10: HOP only (390 clusters) Figure 4.11: HOP on Bayesian Blocks(249 clusters) 4.5 Conclusion HOP algorithm is probably the most efficient cluster-finding algorithm. But when it is applied alone, for astronomical data in particular, it seems overly sensitive to small fluctuations in the density and often results in a lot of mini clusters. However, whenwe apply the HOP algorithm on the connected components of Bayesian Blocks, it leads fewer larger clusters, although it still takes a huge amount of time to run
46. 46. 46 the Bayesian Blocks algorithm. Compared with 835 clusters resulted from the HOP algorithm alone, the method of “HOP + Connected components” gives only 249 clusters. Nevertheless, reducing the number of the clusters is still desired for analyzing the astronomical data. Therefore, we repeat HOP algorithm on connected components of Bayesian Block and then obtain 46 clusters. Repeating HOP algorithm onconnected components might be the best method we have used during our research. But there is risk with such repetitive uses of the HOP algorithm. By repeating the HOP algorithm, we may end up joining two clusters at a very low levelof density, which have a very high-level local maximum. In most cases, it might be more appropriate for them to remain separated from each other. Furthermore, both HOP and repeated HOP onconnected component of Bayesian Block are probably not effective in finding filamentary clusters because they partition based only ondensity without considering the shape of clusters, while filamentaryclusters are usually a vital proportion of the astronomical data.
47. 47. 47 5. Future Directions of Research In this section we present topics that we were interested in, but did not have enough time to complete. These ideas could be implemented by a curious researcher. 5.1 Properties of New Objective Function One possible avenue of future research could be to look into other possible ways of increasing the efficiency of the Bayesian Block algorithm. The two methods that we looked into, vectorization and implementing property, were good methods, but definitely not the only ones. Another aspect of the project that we would have liked to spend more time on was looking into which algorithm for the symmetric Bayesian Blocks performed the best. Also we would have liked to find out other possible, and maybe more efficient algorithms, for the symmetric Bayesian Block problem. 5.2 Clusters of Galaxies and Stars In our research, we were focusing on 1-d and 2-d voronoi diagram. Using the Bayesian Blocks and HOP algorithm on 3-d voronoi diagrams to find clusters is therefore one of the possible future direction for this research. And which HOP algorithm to apply is also worthy for further research. Other than the HOP algorithm implemented in our research, there are some other versions, for example, HOP algorithm of a) Joining a cell to its closest neighbor;
48. 48. 48 b) Or joining a cell to another along the steepest gradient, which is an improvement on hopping to a cell’s densest neighbor; c) Or joining a cell to a denser neighbor with the lowest density. Moreover, it’s worthy to look for the possibility of further speeding up the Bayesian Block algorithm of which we apply repeated HOP on connected component. Finally, because of the nature of repeatedly using the HOP algorithm, which would eventually result in having a single cluster, we are curious about when to stop using the HOP algorithm so as to avoid combining two clusters which should remain separate.
49. 49. 49 6. References Eisenstein, D. J. & Hut, P. (1998). HOP: A New Group-Finding Algorithm for N-Body Simulations. The Astrophysical Journal, 498, 137-142. Jackson, B., et. al. (2004). An Algorithm for Optimal Partitioning of Data onan Interval. Signal Processing Letters, IEEE, 12 (2) 105-108. Jackson, B., et. al. (2003). Optimal Partitions of Data in Higher Dimensions. Scargle, J. D., et. al. (1998). Studies in Astronomical Times Series Analysis V. Bayesian Blocks, A New Method Structure in Photon Counting Data. Astronomical Journal, 504, 405. Scargle, J. F., et. al. (2012). Studies in Astronomical Time Series Analysis VI. Bayesian Block Representation. The Astrophysical Journal, 764 (2).
50. 50. 50 7. Appendix 7.1 MATLAB Code This section includes all of the MATLAB code that we created or implemented during this project. All of the code in this section (and the Python code section) should suffice to replicate our work. 7.1.1. Vectorized BayesianBlock Algorithm To improve the efficiency of an algorithm one easy solution is to implement a vectorized version of the original. This algorithm performs the same operations as the original Bayesian Block algorithm, but replaces the inner for loop with a vectorized function call and a vectorized calculation. % %%Vectorized version of the Bayesian Block Algorithm %This vectorized version should increase the efficiency %of the original BB algorithm % function [part,opt,lastchange] = optinterval1(N,A,c) % % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval % c is a specified constant % % An O(N^2) algorithm for finding the optimal partition of N data points on an interval n = length(N);%gets the number of cells opt = zeros(1,n+1); %initialization lastchange = ones(1,n); %initialization changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n); optint = zeros(1,n);
51. 51. 51 % Consider all possible last blocks for the optimal partition % % endN = the number of points in the possible last blocks % endA = the lengths of the possible last blocks for i = 1:n endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i); endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i); % Computing the values for all possible last blocks in endobj newc=c*ones(size(endN)); endobj(1:i)= arrayfun(@newobjective,endN(1:i),endA(1:i),newc); % Computing the values for all possible optimal partitions in optint optint(1:i) = opt(1:i)+endobj(1:i); % The optimal partition is the one with the maximum value % opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0. % lastchange(i) is the first cell in the last block of an optimal partition [opt(i+1),lastchange(i)] = max(optint(1:i)); end % backtracking to find the blocks of the optimal partition and its change points i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i,1) = k; end end 7.1.2. Last Change Conjecture Implementation of BayesianBlock Algorithm Another solution to increase the efficiencyof the Bayesian Block algorithm was to code in a property of the new objective function. The last change conjecture (Section 2.4) says that when the data set (data cells) are in increasing order the last chance point is always nondecreasing. This property allows us to eliminate some of the unnecessary computation in the inner loop. % %%Last Change Conjecture implementation of Bayesian Block Algorithm %a dynamic programming algorithm with efficiency O(N^2/B)
52. 52. 52 %where B is the number of blocks %Makes use of vectorization % function [part,opt,lastchange] = newoptinterval1(N,A,c) % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval % c is a specified constant % % An O(N^2) algorithm for finding the optimal partition of N data points on an interval %gets the number of cells n = length(N); %initialization opt = zeros(1,n+1); lastchange = ones(1,n); changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n); optint = zeros(1,n); % Consider all possible last blocks for the optimal partition % % endN = the number of points in the possible last blocks % endA = the lengths of the possible last blocks % % k = the value of the last change point % % algorithm only checks up to k for the current change point for i = 1:n if i==1 k=1; else k= lastchange(i-1); end endN = ones(1,i); endN(k:i) = endN(k:i)*changeN(i+1)-changeN(k:i); endA = ones(1,i); endA(k:i) = endA(k:i)*changeA(i+1)-changeA(k:i); % Computing the values for all possible last blocks in endobj newc=c*ones(size(endN(k:i))); endobj(k:i)= arrayfun(@newobjective,endN(k:i),endA(k:i),newc);
53. 53. 53 % Computing the values for all possible optimal partitions in optint optint(k:i) = opt(k:i)+endobj(k:i); % The optimal partition is the one with the maximum value % opt(i+1) is the optimal value of a partition of the first i cells, opt(1) = 0. % lastchange(i) is the first cell in the last block of an optimal partition [opt(i+1),lastchange(i)] = max(optint(k:i)); %Adjust last change point values to correspond to correct points lastchange(i)=lastchange(i)+k-1; end % backtracking to find the blocks of the optimal partition and its changepoints i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i,1) = k; end end 7.1.3. Code for finding Bayesian Blocks and connected components using BayesianBlock algorithm p=presentation2d; tri=delaunay(p); [n,~]=size(p); [m,~]=size(tri); for i=1:m maxdist(i) = max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t ri(i,2),:))]); end [r,s] = sort(maxdist); [x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10); dt = DelaunayTri(p); triplot(dt) blockno = 6; blockends = [215 779 1458 2129 2731 3017];
54. 54. 54 for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); cellblock(s(first:last))=i ; end voronoi(p(:,1),p(:,2)) for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); triangleno = [1:3017]; plabels = arrayfun(@(n) {sprintf('%d', triangleno(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off %computing the adjacies of the Delaunay triangulion and the components of %Baysian blocks [adjmat,adjcell] = Delaunay2dadjacencies(p); n =.5*ones(3017,1); a = transpose(maxdist); [r,s] = sort(a./n); blocks = cell(6,1); for i = 1:6 if i == 1 first = 1;
55. 55. 55 else first = blockends(i-1)+1; end; last = blockends(i); blocks{i} = s(first:last); end [w,x,y,z] = conncomponents2(blocks,n,a,adjmat); Assign labels to the triangles. for i=1:835 cellcomp(w{i}) = i; end for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end function [part,opt,lastchange] = optinterval1(N,A,c) % An O(N^2) algorithm for finding the optimal partition of N data points on an interval % N is a vector containing the number of data points in each interval % A is a vector containing the length of each interval
56. 56. 56 n = length(N); opt = zeros(1,n+1); lastchange = ones(1,n); changeA = zeros(1,n+1); changeN = zeros(1,n+1); changeA(2:n+1)=cumsum(A); changeN(2:n+1) = cumsum(N); endobj = zeros(1,n+1); optint = zeros(1,n+1); for i = 1:n endN = ones(1,i); endN = endN*changeN(i+1)-changeN(1:i); endA = ones(1,i); endA = endA*changeA(i+1)-changeA(1:i); for j = 1:i endobj(j) = newobjective(endN(j),endA(j),c); optint(j) = opt(j)+endobj(j); end [opt(i+1),lastchange(i)] = max(optint(1:i)); end i = 1; part(1,1)=n; k = n; while k > 0 i=i+1; k = lastchange(k)-1; if k > 0 part(i) = k; end end function [value] = newobjective(x,y,c) % A function that computes the objective function value of a block % x is the length of the block and y is the number of data points in the block value = x*(log(x/y)) - c; function [adjmat,adjcell] = Delaunay2dadjacencies(p) % p is a 2 x n matrix of 2-d points % computing adjacencies of Delaunay triangles from Voronoi edges [vert,bound] = voronoin(p); tri = delaunay(p); [j,~] = size(vert); adjmat = sparse(j,j); adjcell = cell(j,1); [n,~] = size(p); %storing adjacencies in a sparse adjacency matrix for i=1:n num = length(bound{i}); for h = 1:num if h < num adjmat(bound{i}(h),bound{i}(h+1))=1; adjmat(bound{i}(h+1),bound{i}(h))=1; else
58. 58. 58 end end nonempty = []; for i = 1:cellno if length(comps{i}) > 0 nonempty = [nonempty i]; end end comps = comps(nonempty); for i = 1:length(comps) ncomps(i) = sum(n(comps{i})); acomps(i) = sum(a(comps{i})); color(comps{i}) = i; end [row,col] = find(adjmat); adjcomps = sparse(length(row),length(row)); for i = 1:length(row) if color(row(i)) ~= color(col(i)) adjcomps(color(row(i)),color(col(i))) = 1; adjcomps(color(col(i)),color(row(i))) = 1; end end %%% conncomp_script.m % Import Data from F14PresentationData p = presentation2d; % computing the Bayesian Blocks partition of the Delaunay triangulation using maxdist for the size of a cell tri = delaunay(p); [tri_size, ~] = size(tri); for i = 1:tri_size maxdist(i) = max([norm(p(tri(i,1),:)-p(tri(i,2),:)),norm(p(tri(i,1),:)-p(tri(i,3),:)),norm(p(tri(i,3),:)-p(t ri(i,2),:))]); end % use maxdist between points in the triangle as area and sort area [r,s] = sort(maxdist); % create optimal partition
59. 59. 59 [x,y,z] = optinterval1(.5*ones(length(s),1),maxdist(s),10); % blockends blockno = length(x); blockends = fliplr(transpose(x)); % conncomps2 takes transpose n =.5*ones(3017,1); a = transpose(maxdist); [r,s] = sort(a./n); % initialize blocks blocks = cell(blockno,1); % assign cells blocks cell array for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); blocks{i} = s(first:last); end % Computing the adjacencies of the Delaunay triangulation and the components of the Bayesian Blocks [adjmat,adjcell] = Delaunay2dadjacencies(p); % calculate connected components [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,n,a,adjmat); % Assign labels to the triangles for i=1:length(comps) cellcomp(comps{i}) = i; end % draw and color connected components triplot(dt) numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellcomp(n))}, (1:numtri)'); map = colormap(hsv(length(comps))); for i = 1:length(comps) current_comp = comps{i}; for j = 1:length(current_comp) patch(p(tri(current_comp(j),:),1),p(tri(current_comp(j),:),2),map(i,:)); % use color i. end end
60. 60. 60 %% conncomps2mod function %% finds connected cells within 2 density levels and groups them into components %% returns components, number of data points in components, area of components, %% and adjacent components function [comps,ncomps,acomps,adjcomps] = conncomps2mod(blocks,n,a,adjmat) % figuring out references to blockends blockends=[]; for temp=1:length(blocks) blockends=[blockends blocks{temp}(length(blocks{temp}))]; end % finding the connected components of a partition given by cell array blocks % two cells with the same color are in the same component cellno = length(n); color = [1:cellno]; for i=1:cellno comps{i} = [i]; end; % going through adjacencies to update components for i = 1:length(blocks) % find all pairs of cells in current block that are adjacent % and store their cell number [row,col] = find(adjmat(blocks{i},blocks{i})); % finding adjacencies between blocks{i} and blocks{i+1} if i<length(blocks) [row2, col2] = find(adjmat(blocks{i},blocks{i+1})); % reindexing row2 and col2 -> all indexes in row2 and col2 reference % a cell in either block(i) or block(i+1) for k=1:length(row2) row2_block_number = belongs_to(blocks{i}(row2(k)), blocks); col2_block_number = belongs_to(blocks{i+1}(col2(k)), blocks); if row2_block_number row2(k) = blocks{row2_block_number}(row2(k)); else row2(k) = blocks{i+1}(row2(k)); end if col2_block_number col2(k) = blocks{col2_block_number}(col2(k)); else col2(k) = blocks{i+1}(col2(k)); end
61. 61. 61 end end % grab cell number of adjacency index from block cell array row = blocks{i}(row); col = blocks{i}(col); row = vertcat(row, row2); col = vertcat(col, col2); % comment out above two lines and uncomment the below two lines % depending if the data needs horzcat versus vertcat % row = horzcat(row,transpose(row2)); % col = horzcat(col, transpose(col2)); for j = 1:length(row) % if the color of the two adjacent cells don’t already belong to same component if color(row(j)) ~= color(col(j)) % number of components in color array for cell in row is greater than number of % components in color array for cell in col if length(comps{color(row(j))}) >= length(comps{color(col(j))}); % check to see if the incoming cell(s) are within one % level density apart head_cell_block_number = belongs_to(color(row(j)),blocks); possible_guests = belongs_to(color(col(j)),blocks); difference = head_cell_block_number - possible_guests; % do nothing if components contain cells from blocks of % more than one density apart if(abs(difference) > 1) comps{color(col(j))} = comps{color(col(j))}; else % place union of components for adjacent cells in components % for row cell comps{color(row(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(col(j))}; y = color(col(j)); % store new location of cell in color color(x) = color(row(j)); comps{y} = []; end else
62. 62. 62 % check to see if the incoming cell(s) are within one % level density apart head_cell_block_number = belongs_to(color(col(j)),blocks); possible_guests = belongs_to(color(row(j)),blocks); difference = head_cell_block_number - possible_guests; % do nothing if components contain cells from blocks of % more than one density apart if(abs(difference) > 1) comps{color(row(j))} = comps{color(row(j))}; else % place union of components for adjacent cells in components % for col cell comps{color(col(j))} = union(comps{color(row(j))},comps{color(col(j))}); x = comps{color(row(j))}; y = color(row(j)); % store new location of cell in color color(x) = color(col(j)); comps{y} = []; end end end end end nonempty = []; for i = 1:cellno if length(comps{i}) > 0 nonempty = [nonempty i]; end end % keep only nonempty connected components comps = comps(nonempty);
63. 63. 63 for i = 1:length(comps) ncomps(i) = sum(n(comps{i})); acomps(i) = sum(a(comps{i})); % re-indexing color to reference new component array color(comps{i}) = i; end % all adjacencies [row,col] = find(adjmat); % create sparse matrix with same size as all adjacencies adjcomps = sparse(length(row),length(row)); for i = 1:length(row) % cells do not belong to same component if color(row(i)) ~= color(col(i)) adjcomps(color(row(i)),color(col(i))) = 1; adjcomps(color(col(i)),color(row(i))) = 1; end end %% belongs_to.m function [ block_number ] = belongs_to( cell_number, blocks ) % belongs_to: helper function to return the block number that the cell % belongs to for i=1:length(blocks) for k=1:length(blocks{i}) if blocks{i}(k) == cell_number block_number = i; return; end end end end
64. 64. 64 %% population_density.m %% takes data from population density .xls file and finds %% the bayesian blocks and connected components % Data is already sorted on import [x,y,z] = optinterval1(pop/10000,area,15) % import neighbors into matrix called Adjacencies [num_cols, num_rows] = size(adjacencies); adjmat = sparse(num_cols, num_cols); for i=1:num_cols for j=1:num_rows if isnan(adjacencies(i,j)) == 0 adjmat(i,adjacencies(i,j)) = 1; end end end blockno = length(x); blocks = cell(blockno,1); x = flipud(x); for i = 1:blockno if i == 1 first = 1; else first = x(i-1)+1; end; last = x(i); blocks{i} = (first:last); end [comps,ncomps,acomps,adjcomps] = conncomponents2(blocks,pop/1000,area,adjmat); %% bayesian_plot.m script %% plots data and colors cells in the same block the same color % importing data from CAMCOS2ddataset p = presentation2d; % computing the triangles of the Delaunay triangulation tri = delaunay(p); [n,~] = size(p); [m,~] = size(tri);
65. 65. 65 % computing the areas of the triangles for i = 1:m [~,area(i,1)] = convhull(p(tri(i,:),:)); end % sorting the triangles by density/area [r,s] = sort(area); % applying the Bayesian Blocks algorithm to the sorted data tic [x,y,z] = optinterval1(.5*ones(m,1),area(s),10); toc % use colormap for pretty colors map = colormap(hsv(length(x))); blockno = length(x); blockends = x(blockno:-1:1); dt = DelaunayTri(p); % Drawing the Delaunay triangulation triplot(dt) % coloring the blocks of the optimal partition for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last % use color i. patch(p(tri(s(j),:),1),p(tri(s(j),:),2),map(i,:)); end end 7.1.4. finding clusters using HOP algorithm % Computing the adjacencies of the components of the Bayesian Blocks and the HOP partition of these components for i = 1:835 cc{i} = find(z(i,:)); end [w1,x1,y1,z1] = HOPPlus2Fall14(x,y,cc); % size(w1) = 249 for i = 1:249 w2{i} = []; for j = 1:length(w1{i}) w2{i} = union(w2{i},w{w1{i}(j)}); end;
66. 66. 66 cellpart(w2{i}) = i; i w2{i} end for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellpart(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end % computing the cells www{j} in each component j and the Bayesian block = cellblock(k) each cell k is in for i = 1:835 www{i} = s(w{i}); end for i = 1:blockno if i == 1 j = 1; else j = blockends(i-1)+1; end; cellblock(s(j:blockends(i))) = i; end
67. 67. 67 Index = []; for i = 1:length(w1) for j = 1:length(w1{i}); Index(w1{i}(j)) = i; end end % find adjacecies of the cluster adjclusters = sparse(length(w1),length(w1)); for i = 1:length(w1) for j = 1:length(w1{i}); adjs = cc{w1{i}(j)}; for k = 1:length(adjs); adjclusters(i, Index(adjs(k))) = 1; adjclusters(Index(adjs(k)), i) = 1; end end end for i = 1:length(w1) ss{i} = find(adjclusters(i,:)); end [w3,x3,y3,z3] = HOPPlusFall14(x1,y1,ss); % figure %length(w3)=46 for i = 1:46 w4{i} = []; for j = 1:length(w3{i}) for k =1: length (w1{w3{i}(j)}) w4{i} = union(w4{i},w{w1{w3{i}(j)}(k)}); end; end; cellsecond(w4{i}) = i; i w4{i} end
68. 68. 68 for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end triplot(dt) % Assign labels to the triangles. numtri = size(tri,1); plabels = arrayfun(@(n) {sprintf('%d', cellsecond(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off blockno = 6; blockends = [215 779 1458 2129 2731 3017]; for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(p(tri(s(j),:),1),p(tri(s(j),:),2),i); % use color i. end end [ww,xx,yy,zz] = HOPPlus2Fall14(n,a,adjcell); %%% hop plot without color dt = DelaunayTri(p); triplot(dt) for i = 1:length(ww) cellP(ww{i})=i; end % Assign labels to the triangles. for i = 1:3017 pp(i,:) = [(p(tri(i,1),:) + p(tri(i,2),:) + p(tri(i,3),:))/3]; end numtri = size(tri,1);
69. 69. 69 plabels = arrayfun(@(n) {sprintf('%d', cellP(n))}, (1:numtri)'); hold on Hpl = text(pp(:,1), pp(:,2), plabels, 'FontWeight', ... 'bold', 'HorizontalAlignment','center', ... 'BackgroundColor', 'none'); hold off function [ X,NX,AX,cX] = HOPPlusFall14(N,A,c) % HOPPlus Partitions the data cells so that each cell is joined to its most dense neighbor % Each part of the partition contains one local maximum % sorting the cells by their densities which is O(Nlog N) [~,I] = sort(A./N); %sorting by decreasing density % changing to new coordinates obtained by sorting N=N(I); A=A(I); c=c(I); % X{i} = cells in the ith part of the HOP partition % R(i) = part of the HOP partition containing cell i % cX{i} = neighboring cells of the ith part of the HOP partition % NX(i) = number of data points in the ith part of the HOP partition % AX(i) = area/volume of the ith part of the HOP partition j = length(N); X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j); k=0; % k = number of parts in the partition for i=1:j c{i}=I(c{i}); % changing neighbors of cell i to the new coordinates m=min(c{i}); % finding the most dense neighbor of cell i if m < i % cell i is joined to its most dense neighbor in part kk kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]); AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i); cX{kk} = union(cX{kk},c{i}); else % cell i is a local maximum and a new part k is started k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i}; AX(k)= A(i); NX(k) = N(i); end end cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k); % Returning to the original indices before sorting for i = 1:k X{i} = I(X{i});
70. 70. 70 cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i}); end function [ X,NX,AX,cX] = HOPPlus2Fall14(N,A,c) % HOPPlus Partitions the data cells so that each cell is joined to its most dense neighbor % Each part of the partition contains one local maximum % Inputs N = number of data points in each cell % A = area/vol of each cell % c = cell array containing adjacencies of each cell % sorting the cells by their densities which is O(Nlog N) [~,I] = sort(A./N); %sorting by decreasing density % changing to new coordinates obtained by sorting N=N(I); A=A(I); c=c(I); % X{i} = cells in the ith part of the HOP partition % R(i) = part of the HOP partition containing cell i % cX{i} = neighboring cells of the ith part of the HOP partition % NX(i) = number of data points in the ith part of the HOP partition % AX(i) = area/volume of the ith part of the HOP partition j = length(N); X = cell(j); R=ones(1,j); cX = cell(j); NX = zeros(1,j); AX = zeros(1,j); k=0; % k = number of parts in the partition for i = 1:j II(I(i)) = i; end for i = 1:j c{i}=II(c{i}); % changing neighbors of cell i to the new coordinates end for i=1:j m=min(c{i}); % finding the most dense neighbor of cell i if m < i % cell i is joined to its most dense neighbor in part kk kk=R(m); R(i)=kk; X{kk} = union(X{kk},[i]); AX(kk) = AX(kk)+A(i); NX(kk) = NX(kk)+N(i); cX{kk} = union(cX{kk},c{i}); else % cell i is a local maximum and a new part k is started k=k+1; R(i)=k; X{k}=[i]; cX{k} = c{i}; % localmax = [localmax; i]; AX(k)= A(i); NX(k) = N(i); end end
71. 71. 71 cX = cX(1:k); AX = AX(1:k); NX = NX(1:k); X = X(1:k); % Returning to the original indices before sorting for i = 1:k X{i} = I(X{i}); cX{i} = I(cX{i}); cX{i} = setdiff(cX{i},X{i}); end 7.1.5. Finding Bayesian Blocks based on the area of each triangle matrix=presentation2d; tri=delaunay(matrix); [n,~]=size(matrix); [m,~]=size(tri); dt=DelaunayTri(matrix(:,1), matrix(:,2)); %triplot(dt); for i= 1:m [~,area(i,1)]= convhull(matrix(tri(i,:),:)); end [r,s]=sort(area); [blockno,blockends,opt,lastchange,cellblock]=bayesianblocks2d(.5*ones(m,1),area,10 ) dt = DelaunayTri(matrix); triplot(dt) for i = 1:blockno if i == 1 first = 1; else first = blockends(i-1)+1; end last = blockends(i); for j = first:last patch(matrix(tri(s(j),:),1),matrix(tri(s(j),:),2),i); % use color i. end end
72. 72. 72 7.2 PYTHON Code Chapter 2.5.2 code is written in Python because of its expandability. We hope to use AstroML module for future research. AstroML contains a growing library of statistical and machine learning routines for analyzing astronomical data, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets. 7.2.1. How to find the block faster - Folding Method import numpy as np from scipy import stats import pylab as pl import pdb import b_blocks as bb def BB_folding(a, cntr_index): # a: dataset, cntr_index: index of center a = np.sort(a) a = np.asarray(a) # We truncate dataset around center if cntr_index > len(a)/2: a=a[(2*cntr_index -len(a))+1:] cntr_index=len(a)//2 if cntr_index < len(a)/2: a=a[:2*cntr_index+1] cntr_index=len(a)//2 edges_whole = np.concatenate([a[:1], 0.5 * (a[1:] + a[:-1]), a[-1:]]) temp= a[:cntr_index+1] # cut the midpoint of the data temp= np.asarray(temp) temp_reverse=a[cntr_index:] - a[cntr_index] # other half - center value temp_reverse= np.asarray(temp_reverse) temp_reverse= a[cntr_index] - temp_reverse[::-1] # reverse t= (temp +temp_reverse)/2 # t: averaged data (half) of symmetrical dataset t = np.sort(t) N = t.size -1 edges = np.concatenate([t[:1], 0.5 * (t[1:] + t[:-1]), t[-1:]]) block_length = t[-1] - edges
73. 73. 73 # arrays needed for the iteration nn_vec = np.ones(N) best = np.zeros(N, dtype=float) last = np.zeros(N, dtype=int) # Start with first data cell; add one cell at each iteration for K in range(N): # Compute the width and count of the final bin for all possible # locations of the K^th changepoint width = block_length[:K + 1] - block_length[K + 1] count_vec = np.cumsum(nn_vec[:K + 1][::-1])[::-1] # calculates no. of data #points in all possible blocks # evaluate fitness function for these possibilities fit_vec = count_vec * (np.log(count_vec) - np.log(width)) # objective function fit_vec -= 4 # 4 comes from the prior on the number of changepoints fit_vec[1:] += best[:K] # additive property of obj function n # find the max of the fitness: this is the K^th changepoint i_max = np.argmax(fit_vec) # Indices of the maximum values along an axis last[K] = i_max best[K] = fit_vec[i_max] # Recover changepoints by iteratively peeling off the last block change_points = np.zeros(N, dtype=int) i_cp = N ind = N while True: i_cp -= 1 change_points[i_cp] = ind if ind == 0: break ind = last[ind - 1] change_points = change_points[i_cp:] #UNFOLDING begins temp1= cntr_index - change_points temp1_reverse= cntr_index + temp1[::-1] cp_whole=np.concatenate((change_points,temp1_reverse[1:])) res = edges_whole[cp_whole] return res