The document describes research on scalable machine learning algorithms for massive astronomical datasets. It discusses 10 common data analysis problems in astronomy like querying, density estimation, regression, classification, etc. For each problem, it lists relevant statistical methods and their computational costs. It emphasizes choosing the right method for the scientific question and developing fast algorithms for bottleneck operations like generalized N-body problems and graphical model inference to enable machine learning on datasets with billions of data points and features.
1. How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
23. kd -trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?
54. “ How many valid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = ?
55. “ How many valid triangles a-b-c (where ) could there be? count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right } A B C r
56. “ How many valid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right }
57. “ How many valid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = ?
58. “ How many valid triangles a-b-c (where ) could there be? A B C r Exclusion count{ A , B , C } = 0!
59. “ How many valid triangles a-b-c (where ) could there be? A B C count{ A , B , C } = ? r
60. “ How many valid triangles a-b-c (where ) could there be? A B C Inclusion count{ A , B , C } = |A| x |B| x |C| r Inclusion Inclusion
61. 3-point runtime (biggest previous: 20K ) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n =2: O(N) n =3: O(N log3 ) n =4: O(N 2 )
62.
63.
64.
65.
66.
67.
68.
69.
70.
71. QUIC-SVD speedup 38 days 1.4 hrs, 10% rel. error 40 days 2 min, 10% rel. error
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
Editor's Notes
the immediate consequence is a similar growth for datasets notice that these all different kinds of datasets, corresponding to different kinds of instruments
here’s one of those instruments responsible this is the sloan digital sky survey, which I’m affiliated with
Inbio! WAlmart Transactions! trend in science: toward data; trend in data: more of it, more measurements; once it’s about data, the focus starts to fall upon tools for analyzing data [sciences are becoming more data-driven; statistics can directly solve science problems]
here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
there are 2 types of challenges at the edge of statistics and machine learning the usual issues of statistical theory, which have to do with formulation and validation of models the other challenge, which is actually always present, it hits us right in the face with massive datasets, but actually it’s always there, is the computational challenge. of the 2 types, this is where the least progress has been made, by far. my interests are in both categories, but this talk is going to be all about the computational problems. we can keep working on sophisticated statistics, but we already have sophisticated methods that we can’t compute
so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
now here’s
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
now that was a pointwise concept – we were asking for the density, or the height of the distribution, at a query point q. now let’s look at a way to get a global summarization of an entire distribution. we can ask how many pairs of points in the dataset have distance less than some radius r. if we plot this count for different values of the radius we get a curve like this, which we call the 2-point correlation function. this function is a characterization of the entire distribution.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
it turns out that n=2 is not enough for our purposes. here are two very different distributions that have exactly the same 2-point function and can only be distinguished by examining their 3-point function. the 3-point is very important to us because it turns out that if the standard model of cosmology and physics is correct, when we compute the 3-point function of the observed universe we should see that the 3-point (actually the reduced 3-point) function and all higher terms are zero.
our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
the biggest previous scientifically valid 3-point computation was on about 20000 points. bob ran our code on 75 million points, which we estimate would take about 150 years using the exhaustive method. now as computer scientists we’d love to know its runtime analytically. but we made a painful choice when we decided to go with these adaptive data structures. we went with them because they’re the fastest practical approach in general dimension. however there does not exist a mathematical expression which realistically describes the runtime of a proximity search with such a structure. nonetheless, we can postulate a runtime based on the recurrence suggested by our algorithm, which would be N to the log of the tuple order. that turns out to match what we see empirically pretty well. so that looks pretty good, in fact the only thing wrong with an 8 orders of magnitude speedup is that you might not ever be able to do it again in your whole career.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.