How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab:  Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlab F undamental  A lgorithmic and  S tatistical  T ools Laboratory Arkadas Ozakin:  Research scientist , PhD Theoretical Physics Dong Ryeol Lee:  PhD student , CS + Math Ryan Riegel:  PhD student , CS + Math Parikshit Ram:  PhD student , CS + Math William March:  PhD student , Math + CS James Waters:  PhD student , Physics + CS Hua Ouyang:  PhD student , CS Sooraj Bhat:  PhD student , CS Ravi Sastry:  PhD student , CS Long Tran:  PhD student , CS Michael Holmes:  PhD student , CS + Physics (co-supervised) Nikolaos Vasiloglou:  PhD student , EE (co-supervised) Wei Guan:  PhD student , CS (co-supervised) Nishant Mehta:  PhD student , CS (co-supervised) Wee Chin Wong:  PhD student , ChemE (co-supervised) Abhimanyu Aditya:  MS student , CS Yatin Kanetkar:  MS student , CS Praveen Krishnaiah:  MS student , CS Devika Karnik:  MS student , CS Prasad Jakka:  MS student , CS
Exponential growth in dataset sizes 1990 COBE  1,000 2000 Boomerang 10,000 2002 CBI   50,000 2003 WMAP  1 Million 2008 Planck  10 Million Data:  CMB Maps Data:  Local Redshift Surveys 1986 CfA  3,500 1996 LCRS  23,000 2003 2dF  250,000 2005 SDSS 800,000 Data:  Angular Surveys 1970 Lick  1M 1990 APM  2M 2005 SDSS  200M 2010 LSST  2B Instruments [ Science , Szalay & J. Gray, 2001]
1993-1999: DPOSS 1999-2008: SDSS Coming: Pan-STARRS, LSST
Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations  (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec
How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy R. Nichol,  Inst. Cosmol. Gravitation A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky , Caltech G. Kulkarni,  Inst. Cosmol. Gravitation D. Wake,  Inst. Cosmol. Gravitation R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astronomy G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics
How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 3 ) O(c D T(N)) R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,   Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics Kernel density estimator  n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering
R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,  Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics Kernel density estimator  n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,  Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics Kernel density estimator  n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? But I have 1 million points Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
State-of-the-art statistical methods… Best accuracy with fewest assumptions with orders-of-mag more efficiency. Large  N   (#data),   D   (#features),   M   (#models) The challenge D N M Reduce data?  Use simpler model?  Approximation with poor/no error bounds?    Poor results
How to do Machine Learning on  Massive Astronomical Datasets? Choose the appropriate  statistical task and method  for the scientific question Use the fastest  algorithm and data structure  for the statistical method Put it in  software
How to do Machine Learning on  Massive Astronomical Datasets? Choose the appropriate  statistical task and method  for the scientific question Use the fastest  algorithm and data structure  for the statistical method Put it in  software
10 data analysis problems, and scalable tools we’d like for them Querying  (e.g. characterizing a region of space) :   spherical range-search  O(N) orthogonal range-search  O(N) k-nearest-neighbors  O(N) all-k-nearest-neighbors  O(N 2 ) Density estimation  (e.g. comparing galaxy types) :   mixture of Gaussians kernel density estimation  O(N 2 ) L 2  density tree  [Ram and Gray in prep] manifold kernel density estimation  O(N 3 ) [Ozakin and Gray 2008, to be submitted] hyper-kernel density estimation  O(N 4 ) [Sastry and Gray 2008, submitted]
10 data analysis problems, and scalable tools we’d like for them 3.  Regression  (e.g. photometric redshifts) :   linear regression  O(D 2 ) kernel regression  O(N 2 ) Gaussian process regression/kriging  O(N 3 ) 4. Classification  (e.g. quasar detection, star-galaxy separation) :   k-nearest-neighbor classifier  O(N 2 ) nonparametric Bayes classifier  O(N 2 ) support vector machine (SVM)  O(N 3 ) non-negative SVM  O(N 3 ) [Guan and Gray, in prep] false-positive-limiting SVM  O(N 3 ) [Sastry and Gray, in prep] separation map  O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted]
10 data analysis problems, and scalable tools we’d like for them Dimension reduction  (e.g. galaxy or spectra characterization) :   principal component analysis  O(D 2 ) non-negative matrix factorization kernel PCA  O(N 3 ) maximum variance unfolding  O(N 3 ) co-occurrence embedding  O(N 3 ) [Ozakin and Gray, in prep] rank-based manifolds  O(N 3 ) [Ouyang and Gray 2008, ICML] isometric non-negative matrix factorization  O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted] Outlier detection  (e.g. new object types, data cleaning) : by density estimation, by dimension reduction by robust L p  estimation  [Ram, Riegel and Gray, in prep]
10 data analysis problems, and scalable tools we’d like for them 7. Clustering  (e.g. automatic Hubble sequence) by dimension reduction, by density estimation k-means mean-shift segmentation  O(N 2 ) hierarchical clustering (“friends-of-friends”)  O(N 3 ) 8. Time series analysis  (e.g. asteroid tracking, variable objects) :   Kalman filter  O(D 2 ) hidden Markov model  O(D 2 ) trajectory tracking  O(N n ) Markov matrix factorization  [Tran, Wong, and Gray 2008, submitted] functional independent component analysis  [Mehta and Gray 2008, submitted]
10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality  (e.g. which features predict star/galaxy) LASSO regression L 1  SVM Gaussian graphical model inference and structure search discrete graphical model inference and structure search 0-1 feature-selecting SVM  [Guan and Gray, in prep] L 1  Gaussian graphical model inference and structure search  [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching  (e.g. cosmological validation, multiple surveys) :   minimum spanning tree  O(N 3 ) n-point correlation  O(N n ) bipartite matching/Gaussian graphical model inference  O(N 3 ) [Waters and Gray, in prep]
How to do Machine Learning on  Massive Astronomical Datasets? Choose the appropriate  statistical task and method  for the scientific question Use the fastest  algorithm and data structure  for the statistical method Put it in  software
Core computational problems What are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for?
Core computational problems Aggregations Generalized N-body problems Graphical model inference Linear algebra Optimization
Core computational problems Aggregations,   GNPs,   graphical models,   linear algebra,   optimization Querying:   nearest-neighbor, sph range-search, ortho range-search,  all-nn Density estimation:   kernel density estimation, mixture of Gaussians Regression:   linear regression,   kernel regression,   Gaussian  process regression   Classification:   nearest-neighbor classifier, nonparametric Bayes classifier,   support vector   machine Dimension reduction:   principal component analysis,   non-negative matrix factorization,  kernel   PCA ,  maximum variance unfolding Outlier detection:  by  robust L 2  estimation , by density estimation, by dimension reduction Clustering:   k-means,  mean-shift, hierarchical clustering (“friends-of-friends”),  by dimension reduction, by density estimation Time series analysis:   Kalman filter,   hidden   Markov model ,  trajectory tracking Feature selection and causality:  LASSO   regression,  L 1  support vector   machine,  Gaussian graphical models,  discrete graphical models 2-sample testing and matching:   n-point correlation,  bipartite matching
Aggregations How it appears:  nearest-neighbor, sph range-search, ortho range-search Common methods:  locality sensitive hashing, kd-trees, metric trees, disk-based trees Mathematical challenges:  high dimensions, provable runtime, distribution-dependent analysis, parallel indexing Mathematical topics:  computational geometry, randomized algorithms
kd -trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?
A  kd- tree: level 1
A  kd -tree: level 2
A  kd -tree: level 3
A  kd- tree: level 4
A  kd -tree: level 5
A  kd -tree: level 6
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm Pruned! (inclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm Pruned! (exclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our  algorithms can use  any tree
Aggregations Interesting approach:  Cover-trees [Beygelzimer et al 2004] Provable runtime Consistently good performance, even in higher dimensions Interesting approach:   Learning trees [Cayton et al 2007] Learning data-optimal data structures Improves performance over kd-trees Interesting approach:   MapReduce [Dean and Ghemawat 2004] Brute-force But makes HPC automatic for a certain problem form Interesting approach:   approximation in rank  [Ram, Ouyang and Gray] Approximate NN in terms of distance conflicts with known theoretical results Is approximation in rank feasible?
Generalized N-body Problems How it appears:  kernel density estimation, mixture of Gaussians, kernel regression, Gaussian process regression, nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine, kernel PCA, hierarchical clustering, trajectory tracking, n-point correlation  Common methods:  FFT, Fast Gauss Transform, Well-Separated Pair Decomposition Mathematical challenges:  high dimensions, query-dependent relative error guarantee, parallel, beyond pairwise potentials Mathematical topics:  approximation theory, computational physics, computational geometry
Generalized N-body Problems Interesting approach:  Generalized Fast Multipole Method, aka multi-tree methods  [Gray and Moore 2001, NIPS; Riegel, Boyer and Gray] Fastest practical algorithms for the problems to which it has been applied Hard query-dependent relative error bounds Automatic parallelization ( THOR: Tree-based Higher-Order Reduce )  [Boyer, Riegel and Gray to be submitted]
Characterization of an entire distribution? 2-point correlation r “ How many  pairs  have distance < r ?” 2-point correlation function
The  n -point correlation functions Spatial inferences:  filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation  for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily  in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:
3-point correlation “ How many triples have  pairwise distances < r ?” r 3 r 1 r 2 Standard model: n>0 terms  should be zero!
How can we count  n -tuples efficiently? “ How many triples have  pairwise distances < r ?”
Use  n  trees! [Gray & Moore, NIPS 2000]
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = ?
“ How many valid triangles a-b-c (where  )  could there be?  count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right } A B C r
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right }
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = ?
“ How many valid triangles a-b-c (where  )  could there be?  A B C r Exclusion count{ A , B , C } = 0!
“ How many valid triangles a-b-c (where  )  could there be?  A B C count{ A , B , C } = ? r
“ How many valid triangles a-b-c (where  )  could there be?  A B C Inclusion count{ A , B , C } = |A| x |B| x |C| r Inclusion Inclusion
3-point runtime (biggest previous: 20K ) VIRGO  simulation data, N =  75,000,000 naïve: 5x10 9  sec. (~150 years) multi-tree:  55 sec. (exact) n =2:  O(N) n =3:  O(N log3 ) n =4:  O(N 2 )
Generalized N-body Problems Interesting approach (for n-point):  n-tree algorithms  [Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky] First efficient exact algorithm for n-point correlations Interesting approach (for n-point):  Monte Carlo n-tree  [Waters, Riegel and Gray] Orders of magnitude faster
Generalized N-body Problems Interesting approach (for EMST):  dual-tree Boruvka algorithm  [March and Gray] Note this is a cubic problem Interesting approach (N-body decision problems):  dual-tree bounding with hybrid tree expansion  [Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] An exact classification algorithm
Generalized N-body Problems Interesting approach (Gaussian kernel):  dual-tree with multipole/Hermite expansions  [Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI] Ultra-accurate fast kernel summations Interesting approach (arbitrary kernel):  automatic derivation of hierarchical series expansions  [Lee and Gray] For large class of kernel functions
Generalized N-body Problems Interesting approach (summative forms):  multi-scale Monte Carlo  [Holmes, Gray, Isbell 2006 NIPS; Holmes, Gray, Isbell 2007, UAI] Very fast bandwidth learning Interesting approach (summative forms):  Monte Carlo multipole methods  [Lee and Gray 2008, NIPS] Uses SVD tree
Generalized N-body Problems Interesting approach (for multi-body potentials in physics):  higher-order multipole methods  [Lee, Waters, Ozakin, and Gray, et al.] First fast algorithm for higher-order potentials Interesting approach (for quantum-level simulation):  4-body treatment of Hartree-Fock  [March and Gray, et al.]
Graphical model inference How it appears:  hidden Markov models, bipartite matching, Gaussian and discrete graphical models  Common methods:  belief propagation, expectation propagation Mathematical challenges:  large cliques, upper and lower bounds, graphs with loops, parallel Mathematical topics:  variational methods, statistical physics, turbo codes
Graphical model inference Interesting method (for discrete models):  Survey propagation [Mezard et al 2002] Good results for combinatorial optimization Based on statistical physics ideas Interesting method (for discrete models):  Expectation propagation [Minka 2001] Variational method based on moment-matching idea Interesting method (for Gaussian models):  L p  structure search, solve linear system for inference  [Tran, Lee, Holmes, and Gray]
Linear algebra How it appears:  linear regression, Gaussian process regression, PCA, kernel PCA, Kalman filter  Common methods:  QR, Krylov, … Mathematical challenges:  numerical stability, sparsity preservation, … Mathematical topics:  linear algebra, randomized algorithms, Monte Carlo
Linear algebra Interesting method (for probably-approximate k-rank SVD):  Monte Carlo k-rank SVD [Frieze, Drineas, et al. 1998-2008] Sample either columns or rows, from squared length distribution For rank-k matrix approx; must know k Interesting method (for probably-approximate full SVD):  QUIC-SVD  [Holmes, Gray, Isbell 2008, NIPS];  QUIK-SVD  [Holmes and Gray] Sample using cosine trees and stratification Builds tree as needed Full SVD: automatically sets rank based on desired error
QUIC-SVD speedup 38 days    1.4 hrs, 10% rel. error 40 days    2 min, 10% rel. error
Optimization How it appears:  support vector machine, maximum variance unfolding, robust L 2  estimation  Common methods:  interior point, Newton’s method Mathematical challenges:  ML-specific objective functions, large number of variables / constraints, global optimization, parallel Mathematical topics:  optimization theory, linear algebra, convex analysis
Optimization Interesting method:  Sequential minimization optimization (SMO) [Platt 1999] Much more efficient than interior-point, for SVM QPs Interesting method:  Stochastic quasi-Newton [Schraudolf 2007] Does not require scan of entire data Interesting method:   Sub-gradient methods [Vishwanathan and Smola 2006] Handles kinks in regularized risk functionals Interesting method:   Approximate inverse preconditioning using QUIC-SVD for energy minimization and interior-point  [March, Vasiloglou, Holmes, Gray] Could potentially treat a large number of optimization problems
Now fast! very fast  as fast as possible (conjecture) Querying:   nearest-neighbor,   sph range-search, ortho range-search, all-nn Density estimation:   kernel density estimation,  mixture of Gaussians Regression:   linear regression, kernel regression,   Gaussian process regression   Classification:   nearest-neighbor classifier , nonparametric Bayes classifier,  support vector machine Dimension reduction:   principal component analysis,  non-negative matrix factorization ,   kernel PCA,   maximum variance unfolding Outlier detection:  by robust L 2  estimation Clustering:   k-means, mean-shift, hierarchical clustering (“friends-of-friends”) Time series analysis:   Kalman filter, hidden Markov model,   trajectory tracking Feature selection and causality:  LASSO regression, L 1  support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching:   n-point correlation ,  bipartite matching
Astronomical applications All-k-nearest-neighbors: O(N 2 )     O(N), exact.  Used in  [Budavari et al., in prep] Kernel density estimation: O(N 2 )     O(N), rel err.  Used in  [Balogh et al. 2004] Nonparametric Bayes classifier (KDA): O(N 2 )     O(N), exact.  Used  in  [Richards et al. 2004,2009], [Scranton et al. 2005] n-point correlations: O(N n )     O(N logn ), exact.  Used in  [Wake et al. 2004], [Giannantonio et al 2006],[Kulkarni et al 2007]
Astronomical highlights Dark energy  evidence,  Science  2003, Top Scientific Breakthrough of the year (n-point) 2007 biggest 3-point calculation ever Cosmic magnification  verification  Nature  2005 (nonparam. Bayes clsf) 2008 largest quasar catalog ever
A few others to note… very fast  as fast as possible (conjecture) Querying:   nearest-neighbor,   sph range-search, ortho range-search, all-nn Density estimation:   kernel density estimation,  mixture of Gaussians Regression:   linear regression,  kernel regression ,   Gaussian process regression  Classification:   nearest-neighbor classifier , nonparametric Bayes classifier,  support vector machine Dimension reduction:   principal component analysis ,  non-negative matrix factorization ,   kernel PCA,   maximum variance unfolding Outlier detection:  by robust L 2  estimation Clustering:   k-means, mean-shift,  hierarchical clustering (“friends-of-friends”) Time series analysis:   Kalman filter,  hidden Markov model ,   trajectory tracking Feature selection and causality:  LASSO regression, L 1  support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching:   n-point correlation ,  bipartite matching
How to do Machine Learning on  Massive Astronomical Datasets? Choose the appropriate  statistical task and method  for the scientific question Use the fastest  algorithm and data structure  for the statistical method Put it in  software
Keep in mind the machine Memory hierarchy:  cache, RAM, out-of-core Dataset bigger than one machine:  parallel/distributed Everything is becoming  multicore Cloud computing:  software as a service
Keep in mind the overall system Databases  can be more useful than ASCII files (e.g. CAS) Workflows  can be more useful than brittle perl scripts Visual analytics  connects visualization/HCI with data analysis (e.g. In-SPIRE)
Our upcoming products MLPACK:  “the LAPACK of machine learning” – Dec. 2008  [FASTlab] THOR:  “the MapReduce of Generalized N-body Problems” – Apr. 2009  [Boyer, Riegel, Gray] CAS Analytics:  fast data analysis in CAS (SQL Server) – Apr. 2009  [Riegel, Aditya, Krishnaiah, Jakka, Karnik, Gray] LogicBlox:  all-in-one business intelligence  [Kanetkar, Riegel, Gray]
Keep in mind the software complexity Automatic  code generation  (e.g. MapReduce) Automatic  tuning  (e.g. OSKI) Automatic  algorithm derivation  (e.g. SPIRAL, AutoBayes)  [Gray et al. 2004; Bhat, Riegel, Gray, Agarwal]
The end We have/will have fast algorithms for most data analysis methods in MLPACK Many opportunities for applied math and computer science in large-scale data analysis Caveat: Must treat the right problem Computational astronomy workshop and large-scale data analysis workshop coming soon Alexander Gray   [email_address] (email is best; webpage sorely out of date)

ppt

  • 1.
    How to doMachine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
  • 2.
    The FASTlab Fundamental A lgorithmic and S tatistical T ools Laboratory Arkadas Ozakin: Research scientist , PhD Theoretical Physics Dong Ryeol Lee: PhD student , CS + Math Ryan Riegel: PhD student , CS + Math Parikshit Ram: PhD student , CS + Math William March: PhD student , Math + CS James Waters: PhD student , Physics + CS Hua Ouyang: PhD student , CS Sooraj Bhat: PhD student , CS Ravi Sastry: PhD student , CS Long Tran: PhD student , CS Michael Holmes: PhD student , CS + Physics (co-supervised) Nikolaos Vasiloglou: PhD student , EE (co-supervised) Wei Guan: PhD student , CS (co-supervised) Nishant Mehta: PhD student , CS (co-supervised) Wee Chin Wong: PhD student , ChemE (co-supervised) Abhimanyu Aditya: MS student , CS Yatin Kanetkar: MS student , CS Praveen Krishnaiah: MS student , CS Devika Karnik: MS student , CS Prasad Jakka: MS student , CS
  • 3.
    Exponential growth indataset sizes 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Data: CMB Maps Data: Local Redshift Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 Data: Angular Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2010 LSST 2B Instruments [ Science , Szalay & J. Gray, 2001]
  • 4.
    1993-1999: DPOSS 1999-2008:SDSS Coming: Pan-STARRS, LSST
  • 5.
    Happening everywhere! Molecularbiology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec
  • 6.
    How did galaxiesevolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky , Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics
  • 7.
    How did galaxiesevolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 3 ) O(c D T(N)) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering
  • 8.
    R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
  • 9.
    R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Kernel density estimator n-point spatial statistics Nonparametric Bayes classifier Support vector machine Nearest-neighbor statistics Gaussian process regression Hierarchical clustering How did galaxies evolve? What was the early universe like? Does dark energy exist? Is our model (GR+inflation) right? But I have 1 million points Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
  • 10.
    State-of-the-art statistical methods…Best accuracy with fewest assumptions with orders-of-mag more efficiency. Large N (#data), D (#features), M (#models) The challenge D N M Reduce data? Use simpler model? Approximation with poor/no error bounds?  Poor results
  • 11.
    How to doMachine Learning on Massive Astronomical Datasets? Choose the appropriate statistical task and method for the scientific question Use the fastest algorithm and data structure for the statistical method Put it in software
  • 12.
    How to doMachine Learning on Massive Astronomical Datasets? Choose the appropriate statistical task and method for the scientific question Use the fastest algorithm and data structure for the statistical method Put it in software
  • 13.
    10 data analysisproblems, and scalable tools we’d like for them Querying (e.g. characterizing a region of space) : spherical range-search O(N) orthogonal range-search O(N) k-nearest-neighbors O(N) all-k-nearest-neighbors O(N 2 ) Density estimation (e.g. comparing galaxy types) : mixture of Gaussians kernel density estimation O(N 2 ) L 2 density tree [Ram and Gray in prep] manifold kernel density estimation O(N 3 ) [Ozakin and Gray 2008, to be submitted] hyper-kernel density estimation O(N 4 ) [Sastry and Gray 2008, submitted]
  • 14.
    10 data analysisproblems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts) : linear regression O(D 2 ) kernel regression O(N 2 ) Gaussian process regression/kriging O(N 3 ) 4. Classification (e.g. quasar detection, star-galaxy separation) : k-nearest-neighbor classifier O(N 2 ) nonparametric Bayes classifier O(N 2 ) support vector machine (SVM) O(N 3 ) non-negative SVM O(N 3 ) [Guan and Gray, in prep] false-positive-limiting SVM O(N 3 ) [Sastry and Gray, in prep] separation map O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted]
  • 15.
    10 data analysisproblems, and scalable tools we’d like for them Dimension reduction (e.g. galaxy or spectra characterization) : principal component analysis O(D 2 ) non-negative matrix factorization kernel PCA O(N 3 ) maximum variance unfolding O(N 3 ) co-occurrence embedding O(N 3 ) [Ozakin and Gray, in prep] rank-based manifolds O(N 3 ) [Ouyang and Gray 2008, ICML] isometric non-negative matrix factorization O(N 3 ) [Vasiloglou, Gray, and Anderson 2008, submitted] Outlier detection (e.g. new object types, data cleaning) : by density estimation, by dimension reduction by robust L p estimation [Ram, Riegel and Gray, in prep]
  • 16.
    10 data analysisproblems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) by dimension reduction, by density estimation k-means mean-shift segmentation O(N 2 ) hierarchical clustering (“friends-of-friends”) O(N 3 ) 8. Time series analysis (e.g. asteroid tracking, variable objects) : Kalman filter O(D 2 ) hidden Markov model O(D 2 ) trajectory tracking O(N n ) Markov matrix factorization [Tran, Wong, and Gray 2008, submitted] functional independent component analysis [Mehta and Gray 2008, submitted]
  • 17.
    10 data analysisproblems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) LASSO regression L 1 SVM Gaussian graphical model inference and structure search discrete graphical model inference and structure search 0-1 feature-selecting SVM [Guan and Gray, in prep] L 1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys) : minimum spanning tree O(N 3 ) n-point correlation O(N n ) bipartite matching/Gaussian graphical model inference O(N 3 ) [Waters and Gray, in prep]
  • 18.
    How to doMachine Learning on Massive Astronomical Datasets? Choose the appropriate statistical task and method for the scientific question Use the fastest algorithm and data structure for the statistical method Put it in software
  • 19.
    Core computational problemsWhat are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for?
  • 20.
    Core computational problemsAggregations Generalized N-body problems Graphical model inference Linear algebra Optimization
  • 21.
    Core computational problemsAggregations, GNPs, graphical models, linear algebra, optimization Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA , maximum variance unfolding Outlier detection: by robust L 2 estimation , by density estimation, by dimension reduction Clustering: k-means, mean-shift, hierarchical clustering (“friends-of-friends”), by dimension reduction, by density estimation Time series analysis: Kalman filter, hidden Markov model , trajectory tracking Feature selection and causality: LASSO regression, L 1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation, bipartite matching
  • 22.
    Aggregations How itappears: nearest-neighbor, sph range-search, ortho range-search Common methods: locality sensitive hashing, kd-trees, metric trees, disk-based trees Mathematical challenges: high dimensions, provable runtime, distribution-dependent analysis, parallel indexing Mathematical topics: computational geometry, randomized algorithms
  • 23.
    kd -trees: mostwidely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?
  • 24.
    A kd-tree: level 1
  • 25.
    A kd-tree: level 2
  • 26.
    A kd-tree: level 3
  • 27.
    A kd-tree: level 4
  • 28.
    A kd-tree: level 5
  • 29.
    A kd-tree: level 6
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
    Range-count recursive algorithmfastest practical algorithm [Bentley 1975] our algorithms can use any tree
  • 46.
    Aggregations Interesting approach: Cover-trees [Beygelzimer et al 2004] Provable runtime Consistently good performance, even in higher dimensions Interesting approach: Learning trees [Cayton et al 2007] Learning data-optimal data structures Improves performance over kd-trees Interesting approach: MapReduce [Dean and Ghemawat 2004] Brute-force But makes HPC automatic for a certain problem form Interesting approach: approximation in rank [Ram, Ouyang and Gray] Approximate NN in terms of distance conflicts with known theoretical results Is approximation in rank feasible?
  • 47.
    Generalized N-body ProblemsHow it appears: kernel density estimation, mixture of Gaussians, kernel regression, Gaussian process regression, nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine, kernel PCA, hierarchical clustering, trajectory tracking, n-point correlation Common methods: FFT, Fast Gauss Transform, Well-Separated Pair Decomposition Mathematical challenges: high dimensions, query-dependent relative error guarantee, parallel, beyond pairwise potentials Mathematical topics: approximation theory, computational physics, computational geometry
  • 48.
    Generalized N-body ProblemsInteresting approach: Generalized Fast Multipole Method, aka multi-tree methods [Gray and Moore 2001, NIPS; Riegel, Boyer and Gray] Fastest practical algorithms for the problems to which it has been applied Hard query-dependent relative error bounds Automatic parallelization ( THOR: Tree-based Higher-Order Reduce ) [Boyer, Riegel and Gray to be submitted]
  • 49.
    Characterization of anentire distribution? 2-point correlation r “ How many pairs have distance < r ?” 2-point correlation function
  • 50.
    The n-point correlation functions Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: 3pcf definition:
  • 51.
    3-point correlation “How many triples have pairwise distances < r ?” r 3 r 1 r 2 Standard model: n>0 terms should be zero!
  • 52.
    How can wecount n -tuples efficiently? “ How many triples have pairwise distances < r ?”
  • 53.
    Use n trees! [Gray & Moore, NIPS 2000]
  • 54.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = ?
  • 55.
    “ How manyvalid triangles a-b-c (where ) could there be? count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right } A B C r
  • 56.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right }
  • 57.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C r count{ A , B , C } = ?
  • 58.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C r Exclusion count{ A , B , C } = 0!
  • 59.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C count{ A , B , C } = ? r
  • 60.
    “ How manyvalid triangles a-b-c (where ) could there be? A B C Inclusion count{ A , B , C } = |A| x |B| x |C| r Inclusion Inclusion
  • 61.
    3-point runtime (biggestprevious: 20K ) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n =2: O(N) n =3: O(N log3 ) n =4: O(N 2 )
  • 62.
    Generalized N-body ProblemsInteresting approach (for n-point): n-tree algorithms [Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky] First efficient exact algorithm for n-point correlations Interesting approach (for n-point): Monte Carlo n-tree [Waters, Riegel and Gray] Orders of magnitude faster
  • 63.
    Generalized N-body ProblemsInteresting approach (for EMST): dual-tree Boruvka algorithm [March and Gray] Note this is a cubic problem Interesting approach (N-body decision problems): dual-tree bounding with hybrid tree expansion [Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] An exact classification algorithm
  • 64.
    Generalized N-body ProblemsInteresting approach (Gaussian kernel): dual-tree with multipole/Hermite expansions [Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI] Ultra-accurate fast kernel summations Interesting approach (arbitrary kernel): automatic derivation of hierarchical series expansions [Lee and Gray] For large class of kernel functions
  • 65.
    Generalized N-body ProblemsInteresting approach (summative forms): multi-scale Monte Carlo [Holmes, Gray, Isbell 2006 NIPS; Holmes, Gray, Isbell 2007, UAI] Very fast bandwidth learning Interesting approach (summative forms): Monte Carlo multipole methods [Lee and Gray 2008, NIPS] Uses SVD tree
  • 66.
    Generalized N-body ProblemsInteresting approach (for multi-body potentials in physics): higher-order multipole methods [Lee, Waters, Ozakin, and Gray, et al.] First fast algorithm for higher-order potentials Interesting approach (for quantum-level simulation): 4-body treatment of Hartree-Fock [March and Gray, et al.]
  • 67.
    Graphical model inferenceHow it appears: hidden Markov models, bipartite matching, Gaussian and discrete graphical models Common methods: belief propagation, expectation propagation Mathematical challenges: large cliques, upper and lower bounds, graphs with loops, parallel Mathematical topics: variational methods, statistical physics, turbo codes
  • 68.
    Graphical model inferenceInteresting method (for discrete models): Survey propagation [Mezard et al 2002] Good results for combinatorial optimization Based on statistical physics ideas Interesting method (for discrete models): Expectation propagation [Minka 2001] Variational method based on moment-matching idea Interesting method (for Gaussian models): L p structure search, solve linear system for inference [Tran, Lee, Holmes, and Gray]
  • 69.
    Linear algebra Howit appears: linear regression, Gaussian process regression, PCA, kernel PCA, Kalman filter Common methods: QR, Krylov, … Mathematical challenges: numerical stability, sparsity preservation, … Mathematical topics: linear algebra, randomized algorithms, Monte Carlo
  • 70.
    Linear algebra Interestingmethod (for probably-approximate k-rank SVD): Monte Carlo k-rank SVD [Frieze, Drineas, et al. 1998-2008] Sample either columns or rows, from squared length distribution For rank-k matrix approx; must know k Interesting method (for probably-approximate full SVD): QUIC-SVD [Holmes, Gray, Isbell 2008, NIPS]; QUIK-SVD [Holmes and Gray] Sample using cosine trees and stratification Builds tree as needed Full SVD: automatically sets rank based on desired error
  • 71.
    QUIC-SVD speedup 38days  1.4 hrs, 10% rel. error 40 days  2 min, 10% rel. error
  • 72.
    Optimization How itappears: support vector machine, maximum variance unfolding, robust L 2 estimation Common methods: interior point, Newton’s method Mathematical challenges: ML-specific objective functions, large number of variables / constraints, global optimization, parallel Mathematical topics: optimization theory, linear algebra, convex analysis
  • 73.
    Optimization Interesting method: Sequential minimization optimization (SMO) [Platt 1999] Much more efficient than interior-point, for SVM QPs Interesting method: Stochastic quasi-Newton [Schraudolf 2007] Does not require scan of entire data Interesting method: Sub-gradient methods [Vishwanathan and Smola 2006] Handles kinks in regularized risk functionals Interesting method: Approximate inverse preconditioning using QUIC-SVD for energy minimization and interior-point [March, Vasiloglou, Holmes, Gray] Could potentially treat a large number of optimization problems
  • 74.
    Now fast! veryfast as fast as possible (conjecture) Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier , nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization , kernel PCA, maximum variance unfolding Outlier detection: by robust L 2 estimation Clustering: k-means, mean-shift, hierarchical clustering (“friends-of-friends”) Time series analysis: Kalman filter, hidden Markov model, trajectory tracking Feature selection and causality: LASSO regression, L 1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation , bipartite matching
  • 75.
    Astronomical applications All-k-nearest-neighbors:O(N 2 )  O(N), exact. Used in [Budavari et al., in prep] Kernel density estimation: O(N 2 )  O(N), rel err. Used in [Balogh et al. 2004] Nonparametric Bayes classifier (KDA): O(N 2 )  O(N), exact. Used in [Richards et al. 2004,2009], [Scranton et al. 2005] n-point correlations: O(N n )  O(N logn ), exact. Used in [Wake et al. 2004], [Giannantonio et al 2006],[Kulkarni et al 2007]
  • 76.
    Astronomical highlights Darkenergy evidence, Science 2003, Top Scientific Breakthrough of the year (n-point) 2007 biggest 3-point calculation ever Cosmic magnification verification Nature 2005 (nonparam. Bayes clsf) 2008 largest quasar catalog ever
  • 77.
    A few othersto note… very fast as fast as possible (conjecture) Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression , Gaussian process regression Classification: nearest-neighbor classifier , nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis , non-negative matrix factorization , kernel PCA, maximum variance unfolding Outlier detection: by robust L 2 estimation Clustering: k-means, mean-shift, hierarchical clustering (“friends-of-friends”) Time series analysis: Kalman filter, hidden Markov model , trajectory tracking Feature selection and causality: LASSO regression, L 1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation , bipartite matching
  • 78.
    How to doMachine Learning on Massive Astronomical Datasets? Choose the appropriate statistical task and method for the scientific question Use the fastest algorithm and data structure for the statistical method Put it in software
  • 79.
    Keep in mindthe machine Memory hierarchy: cache, RAM, out-of-core Dataset bigger than one machine: parallel/distributed Everything is becoming multicore Cloud computing: software as a service
  • 80.
    Keep in mindthe overall system Databases can be more useful than ASCII files (e.g. CAS) Workflows can be more useful than brittle perl scripts Visual analytics connects visualization/HCI with data analysis (e.g. In-SPIRE)
  • 81.
    Our upcoming productsMLPACK: “the LAPACK of machine learning” – Dec. 2008 [FASTlab] THOR: “the MapReduce of Generalized N-body Problems” – Apr. 2009 [Boyer, Riegel, Gray] CAS Analytics: fast data analysis in CAS (SQL Server) – Apr. 2009 [Riegel, Aditya, Krishnaiah, Jakka, Karnik, Gray] LogicBlox: all-in-one business intelligence [Kanetkar, Riegel, Gray]
  • 82.
    Keep in mindthe software complexity Automatic code generation (e.g. MapReduce) Automatic tuning (e.g. OSKI) Automatic algorithm derivation (e.g. SPIRAL, AutoBayes) [Gray et al. 2004; Bhat, Riegel, Gray, Agarwal]
  • 83.
    The end Wehave/will have fast algorithms for most data analysis methods in MLPACK Many opportunities for applied math and computer science in large-scale data analysis Caveat: Must treat the right problem Computational astronomy workshop and large-scale data analysis workshop coming soon Alexander Gray [email_address] (email is best; webpage sorely out of date)

Editor's Notes

  • #4 the immediate consequence is a similar growth for datasets notice that these all different kinds of datasets, corresponding to different kinds of instruments
  • #5 here’s one of those instruments responsible this is the sloan digital sky survey, which I’m affiliated with
  • #6 Inbio! WAlmart Transactions! trend in science: toward data; trend in data: more of it, more measurements; once it’s about data, the focus starts to fall upon tools for analyzing data [sciences are becoming more data-driven; statistics can directly solve science problems]
  • #7 here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
  • #8 here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
  • #9 then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
  • #10 then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
  • #11 there are 2 types of challenges at the edge of statistics and machine learning the usual issues of statistical theory, which have to do with formulation and validation of models the other challenge, which is actually always present, it hits us right in the face with massive datasets, but actually it’s always there, is the computational challenge. of the 2 types, this is where the least progress has been made, by far. my interests are in both categories, but this talk is going to be all about the computational problems. we can keep working on sophisticated statistics, but we already have sophisticated methods that we can’t compute
  • #12 so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  • #13 so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  • #14 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #15 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #16 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #17 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #18 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #19 so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  • #20 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #21 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #22 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #23 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #24 our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
  • #31 now here’s
  • #47 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #48 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #49 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #50 now that was a pointwise concept – we were asking for the density, or the height of the distribution, at a query point q. now let’s look at a way to get a global summarization of an entire distribution. we can ask how many pairs of points in the dataset have distance less than some radius r. if we plot this count for different values of the radius we get a curve like this, which we call the 2-point correlation function. this function is a characterization of the entire distribution.
  • #51 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #52 it turns out that n=2 is not enough for our purposes. here are two very different distributions that have exactly the same 2-point function and can only be distinguished by examining their 3-point function. the 3-point is very important to us because it turns out that if the standard model of cosmology and physics is correct, when we compute the 3-point function of the observed universe we should see that the 3-point (actually the reduced 3-point) function and all higher terms are zero.
  • #53 our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
  • #55 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #56 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #57 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #58 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #59 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #60 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #61 we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  • #62 the biggest previous scientifically valid 3-point computation was on about 20000 points. bob ran our code on 75 million points, which we estimate would take about 150 years using the exhaustive method. now as computer scientists we’d love to know its runtime analytically. but we made a painful choice when we decided to go with these adaptive data structures. we went with them because they’re the fastest practical approach in general dimension. however there does not exist a mathematical expression which realistically describes the runtime of a proximity search with such a structure. nonetheless, we can postulate a runtime based on the recurrence suggested by our algorithm, which would be N to the log of the tuple order. that turns out to match what we see empirically pretty well. so that looks pretty good, in fact the only thing wrong with an 8 orders of magnitude speedup is that you might not ever be able to do it again in your whole career.
  • #63 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #64 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #65 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #66 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #67 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #68 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #69 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #70 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #71 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #73 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #74 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #75 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #76 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #78 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #79 so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  • #80 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #81 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #82 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #83 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  • #84 so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.