SlideShare a Scribd company logo
1 of 83
How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab:  Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlab F undamental  A lgorithmic and  S tatistical  T ools Laboratory ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Exponential growth in dataset sizes 1990 COBE  1,000 2000 Boomerang 10,000 2002 CBI   50,000 2003 WMAP  1 Million 2008 Planck  10 Million Data:  CMB Maps Data:  Local Redshift Surveys 1986 CfA  3,500 1996 LCRS  23,000 2003 2dF  250,000 2005 SDSS 800,000 Data:  Angular Surveys 1970 Lick  1M 1990 APM  2M 2005 SDSS  200M 2010 LSST  2B Instruments [ Science , Szalay & J. Gray, 2001]
1993-1999: DPOSS 1999-2008: SDSS Coming: Pan-STARRS, LSST
Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations  (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec
[object Object],[object Object],[object Object],[object Object],Astrophysicist Machine learning/ statistics guy R. Nichol,  Inst. Cosmol. Gravitation A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky , Caltech G. Kulkarni,  Inst. Cosmol. Gravitation D. Wake,  Inst. Cosmol. Gravitation R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astronomy G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics
[object Object],[object Object],[object Object],[object Object],Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 2 ) O(N 3 ) O(c D T(N)) R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,   Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,  Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
R. Nichol,  Inst. Cosmol. Grav. A. Connolly,  U. Pitt Physics C. Miller,  NOAO R. Brunner,  NCSA G. Djorgovsky,  Caltech G. Kulkarni,  Inst. Cosmol. Grav. D. Wake,  Inst. Cosmol. Grav. R. Scranton,   U. Pitt Physics M. Balogh,  U. Waterloo Physics I. Szapudi,  U. Hawaii Inst. Astro. G. Richards,  Princeton Physics A. Szalay,  Johns Hopkins Physics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],But I have 1 million points Astrophysicist Machine learning/ statistics guy O(N n ) O(N 2 ) O(N 2 ) O(N 3 ) O(N 2 ) O(N 3 ) O(N 3 )
[object Object],[object Object],[object Object],[object Object],The challenge D N M Reduce data?  Use simpler model?  Approximation with poor/no error bounds?    Poor results
How to do Machine Learning on  Massive Astronomical Datasets? ,[object Object],[object Object],[object Object]
How to do Machine Learning on  Massive Astronomical Datasets? ,[object Object],[object Object],[object Object]
10 data analysis problems, and scalable tools we’d like for them ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
10 data analysis problems, and scalable tools we’d like for them ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
10 data analysis problems, and scalable tools we’d like for them ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
10 data analysis problems, and scalable tools we’d like for them ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
10 data analysis problems, and scalable tools we’d like for them ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How to do Machine Learning on  Massive Astronomical Datasets? ,[object Object],[object Object],[object Object]
Core computational problems ,[object Object]
Core computational problems ,[object Object],[object Object],[object Object],[object Object],[object Object]
Core computational problems Aggregations,   GNPs,   graphical models,   linear algebra,   optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Aggregations ,[object Object],[object Object],[object Object],[object Object]
kd -trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?
A  kd- tree: level 1
A  kd -tree: level 2
A  kd -tree: level 3
A  kd- tree: level 4
A  kd -tree: level 5
A  kd -tree: level 6
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm Pruned! (inclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm Pruned! (exclusion)
Range-count recursive algorithm
Range-count recursive algorithm
Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our  algorithms can use  any tree
Aggregations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Characterization of an entire distribution? 2-point correlation r “ How many  pairs  have distance < r ?” 2-point correlation function
The  n -point correlation functions ,[object Object],[object Object],[object Object],2pcf definition: 3pcf definition:
3-point correlation “ How many triples have  pairwise distances < r ?” r 3 r 1 r 2 Standard model: n>0 terms  should be zero!
How can we count  n -tuples efficiently? “ How many triples have  pairwise distances < r ?”
Use  n  trees! [Gray & Moore, NIPS 2000]
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = ?
“ How many valid triangles a-b-c (where  )  could there be?  count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right } A B C r
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = count{ A , B , C.left } + count{ A , B , C.right }
“ How many valid triangles a-b-c (where  )  could there be?  A B C r count{ A , B , C } = ?
“ How many valid triangles a-b-c (where  )  could there be?  A B C r Exclusion count{ A , B , C } = 0!
“ How many valid triangles a-b-c (where  )  could there be?  A B C count{ A , B , C } = ? r
“ How many valid triangles a-b-c (where  )  could there be?  A B C Inclusion count{ A , B , C } = |A| x |B| x |C| r Inclusion Inclusion
3-point runtime (biggest previous: 20K ) VIRGO  simulation data, N =  75,000,000 naïve: 5x10 9  sec. (~150 years) multi-tree:  55 sec. (exact) n =2:  O(N) n =3:  O(N log3 ) n =4:  O(N 2 )
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object],[object Object]
Generalized N-body Problems ,[object Object],[object Object],[object Object]
Graphical model inference ,[object Object],[object Object],[object Object],[object Object]
Graphical model inference ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Linear algebra ,[object Object],[object Object],[object Object],[object Object]
Linear algebra ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
QUIC-SVD speedup 38 days    1.4 hrs, 10% rel. error 40 days    2 min, 10% rel. error
Optimization ,[object Object],[object Object],[object Object],[object Object]
Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Now fast! very fast  as fast as possible (conjecture) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Astronomical applications ,[object Object],[object Object],[object Object],[object Object]
Astronomical highlights ,[object Object],[object Object],[object Object],[object Object]
A few others to note… very fast  as fast as possible (conjecture) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How to do Machine Learning on  Massive Astronomical Datasets? ,[object Object],[object Object],[object Object]
Keep in mind the machine ,[object Object],[object Object],[object Object],[object Object]
Keep in mind the overall system ,[object Object],[object Object],[object Object]
Our upcoming products ,[object Object],[object Object],[object Object],[object Object]
Keep in mind the software complexity ,[object Object],[object Object],[object Object]
The end ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

Viewers also liked

14.vina serevina fandi cahya
14.vina serevina fandi cahya14.vina serevina fandi cahya
14.vina serevina fandi cahyavinaserevina
 
George Cross Electromagnetism Charge Model Lecture26 (2)
George Cross Electromagnetism Charge Model Lecture26 (2)George Cross Electromagnetism Charge Model Lecture26 (2)
George Cross Electromagnetism Charge Model Lecture26 (2)George Cross
 
George Cross Electromagnetism Electric Field Lecture27 (2)
George Cross Electromagnetism Electric Field Lecture27 (2)George Cross Electromagnetism Electric Field Lecture27 (2)
George Cross Electromagnetism Electric Field Lecture27 (2)George Cross
 
Ias pcs-bank exam notes-all general awareness questions with answers-2
Ias pcs-bank exam notes-all  general awareness questions with answers-2Ias pcs-bank exam notes-all  general awareness questions with answers-2
Ias pcs-bank exam notes-all general awareness questions with answers-2letsguru guru
 
Energy density in electrostatic field
Energy density in electrostatic field Energy density in electrostatic field
Energy density in electrostatic field Jay Patel
 
Taking the ap physics b exam
Taking the ap physics b examTaking the ap physics b exam
Taking the ap physics b examjosoborned
 
Electrostatics
ElectrostaticsElectrostatics
Electrostaticsfourangela
 
Physics Quiz
Physics QuizPhysics Quiz
Physics Quize4envy
 

Viewers also liked (11)

14.vina serevina fandi cahya
14.vina serevina fandi cahya14.vina serevina fandi cahya
14.vina serevina fandi cahya
 
George Cross Electromagnetism Charge Model Lecture26 (2)
George Cross Electromagnetism Charge Model Lecture26 (2)George Cross Electromagnetism Charge Model Lecture26 (2)
George Cross Electromagnetism Charge Model Lecture26 (2)
 
George Cross Electromagnetism Electric Field Lecture27 (2)
George Cross Electromagnetism Electric Field Lecture27 (2)George Cross Electromagnetism Electric Field Lecture27 (2)
George Cross Electromagnetism Electric Field Lecture27 (2)
 
Ias pcs-bank exam notes-all general awareness questions with answers-2
Ias pcs-bank exam notes-all  general awareness questions with answers-2Ias pcs-bank exam notes-all  general awareness questions with answers-2
Ias pcs-bank exam notes-all general awareness questions with answers-2
 
Energy density in electrostatic field
Energy density in electrostatic field Energy density in electrostatic field
Energy density in electrostatic field
 
Electrostatics -1
Electrostatics -1Electrostatics -1
Electrostatics -1
 
Taking the ap physics b exam
Taking the ap physics b examTaking the ap physics b exam
Taking the ap physics b exam
 
Electrostatics
ElectrostaticsElectrostatics
Electrostatics
 
Physics
PhysicsPhysics
Physics
 
Electrostatics 1
Electrostatics 1Electrostatics 1
Electrostatics 1
 
Physics Quiz
Physics QuizPhysics Quiz
Physics Quiz
 

Similar to Machine Learning Massive Astronomical Data

Data Science Education: Needs & Opportunities in Astronomy
Data Science Education: Needs & Opportunities in AstronomyData Science Education: Needs & Opportunities in Astronomy
Data Science Education: Needs & Opportunities in AstronomyJoshua Bloom
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervisedMadhav Sigdel
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsMario Juric
 
2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learningUniversity of Groningen
 
Imecs2012 pp440 445
Imecs2012 pp440 445Imecs2012 pp440 445
Imecs2012 pp440 445Rasha Orban
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Computational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyComputational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyJoshua Bloom
 
APS GDS data science talk by Trevor Rhone
APS GDS data science talk by Trevor RhoneAPS GDS data science talk by Trevor Rhone
APS GDS data science talk by Trevor RhoneTrevorDavidRhone
 
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...mlaij
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.ppsbutest
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Natalio Krasnogor
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)butest
 
Locating Hidden Exoplanets in ALMA Data Using Machine Learning
Locating Hidden Exoplanets in ALMA Data Using Machine LearningLocating Hidden Exoplanets in ALMA Data Using Machine Learning
Locating Hidden Exoplanets in ALMA Data Using Machine LearningSérgio Sacani
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...AIST
 
A machine learning approach in the dynamics of asteroids
A machine learning approach in the dynamics of asteroidsA machine learning approach in the dynamics of asteroids
A machine learning approach in the dynamics of asteroidsEvgeny Smirnov
 

Similar to Machine Learning Massive Astronomical Data (20)

Data Science Education: Needs & Opportunities in Astronomy
Data Science Education: Needs & Opportunities in AstronomyData Science Education: Needs & Opportunities in Astronomy
Data Science Education: Needs & Opportunities in Astronomy
 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
 
LSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your QuestionsLSST Solar System Science: MOPS Status, the Science, and Your Questions
LSST Solar System Science: MOPS Status, the Science, and Your Questions
 
2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning2017: Prototype-based models in unsupervised and supervised machine learning
2017: Prototype-based models in unsupervised and supervised machine learning
 
Imecs2012 pp440 445
Imecs2012 pp440 445Imecs2012 pp440 445
Imecs2012 pp440 445
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Computational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyComputational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data Literacy
 
MUMS Opening Workshop -Data-Driven Discovery of Governing Physical Laws and t...
MUMS Opening Workshop -Data-Driven Discovery of Governing Physical Laws and t...MUMS Opening Workshop -Data-Driven Discovery of Governing Physical Laws and t...
MUMS Opening Workshop -Data-Driven Discovery of Governing Physical Laws and t...
 
APS GDS data science talk by Trevor Rhone
APS GDS data science talk by Trevor RhoneAPS GDS data science talk by Trevor Rhone
APS GDS data science talk by Trevor Rhone
 
Data Mining The Sky
Data Mining The SkyData Mining The Sky
Data Mining The Sky
 
Data Mining The Sky
Data Mining The SkyData Mining The Sky
Data Mining The Sky
 
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...
AUTOMATIC SPECTRAL CLASSIFICATION OF STARS USING MACHINE LEARNING: AN APPROAC...
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
 
CV_Robinjeet_Singh
CV_Robinjeet_SinghCV_Robinjeet_Singh
CV_Robinjeet_Singh
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)
 
Locating Hidden Exoplanets in ALMA Data Using Machine Learning
Locating Hidden Exoplanets in ALMA Data Using Machine LearningLocating Hidden Exoplanets in ALMA Data Using Machine Learning
Locating Hidden Exoplanets in ALMA Data Using Machine Learning
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
 
A machine learning approach in the dynamics of asteroids
A machine learning approach in the dynamics of asteroidsA machine learning approach in the dynamics of asteroids
A machine learning approach in the dynamics of asteroids
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning Massive Astronomical Data

Editor's Notes

  1. the immediate consequence is a similar growth for datasets notice that these all different kinds of datasets, corresponding to different kinds of instruments
  2. here’s one of those instruments responsible this is the sloan digital sky survey, which I’m affiliated with
  3. Inbio! WAlmart Transactions! trend in science: toward data; trend in data: more of it, more measurements; once it’s about data, the focus starts to fall upon tools for analyzing data [sciences are becoming more data-driven; statistics can directly solve science problems]
  4. here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
  5. here’s a cartoon of a conversation which is implicitly happening in all of these areas this is bob nichol, who wants to use this new data to ask astrophysicist-type questions here’s a machine learning/statistics type; who says ‘ah, that’s a density estimation problem’, why don’t you try a kernel density estimator – that’s the best method available – here are some of our best methods for different problems. ah, by the way, these are their computational costs, where N is the number of data. generally the best methods reuqire the most computation.
  6. then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
  7. then bob says, yeah but for scientific reasons this analysis requires at least 1 million points and the conversation pretty much stops there
  8. there are 2 types of challenges at the edge of statistics and machine learning the usual issues of statistical theory, which have to do with formulation and validation of models the other challenge, which is actually always present, it hits us right in the face with massive datasets, but actually it’s always there, is the computational challenge. of the 2 types, this is where the least progress has been made, by far. my interests are in both categories, but this talk is going to be all about the computational problems. we can keep working on sophisticated statistics, but we already have sophisticated methods that we can’t compute
  9. so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  10. so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  11. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  12. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  13. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  14. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  15. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  16. so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  17. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  18. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  19. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  20. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  21. our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
  22. now here’s
  23. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  24. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  25. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  26. now that was a pointwise concept – we were asking for the density, or the height of the distribution, at a query point q. now let’s look at a way to get a global summarization of an entire distribution. we can ask how many pairs of points in the dataset have distance less than some radius r. if we plot this count for different values of the radius we get a curve like this, which we call the 2-point correlation function. this function is a characterization of the entire distribution.
  27. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  28. it turns out that n=2 is not enough for our purposes. here are two very different distributions that have exactly the same 2-point function and can only be distinguished by examining their 3-point function. the 3-point is very important to us because it turns out that if the standard model of cosmology and physics is correct, when we compute the 3-point function of the observed universe we should see that the 3-point (actually the reduced 3-point) function and all higher terms are zero.
  29. our situation is that we have the data, and we get to pre-organize it however we want, for example build a data structure on it, so that whenever someone comes along with a query point and a radius, we can quickly return the count to them. this is the most popular data structure for this kind of task.
  30. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  31. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  32. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  33. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  34. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  35. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  36. we’re going to create a recursive algorithm where at every point in the recursion we’re looking at 3 nodes, one from each tree (if we’re doing cross-correlations they could correspond to different datasets, otherwise they may just be pointers to different nodes of the same tree). now we’re going to ask how many how many matching triplets there are drawn from these 3 nodes, one point from each.
  37. the biggest previous scientifically valid 3-point computation was on about 20000 points. bob ran our code on 75 million points, which we estimate would take about 150 years using the exhaustive method. now as computer scientists we’d love to know its runtime analytically. but we made a painful choice when we decided to go with these adaptive data structures. we went with them because they’re the fastest practical approach in general dimension. however there does not exist a mathematical expression which realistically describes the runtime of a proximity search with such a structure. nonetheless, we can postulate a runtime based on the recurrence suggested by our algorithm, which would be N to the log of the tuple order. that turns out to match what we see empirically pretty well. so that looks pretty good, in fact the only thing wrong with an 8 orders of magnitude speedup is that you might not ever be able to do it again in your whole career.
  38. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  39. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  40. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  41. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  42. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  43. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  44. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  45. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  46. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  47. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  48. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  49. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  50. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  51. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  52. so we’re going to take the hard road, and not cop out, which is to make faster algorithms – compute the same thing, but in less cpu time. the goal of this talk is to consider how we can speed up each of these 7 machine learning and statistics methods.
  53. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  54. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  55. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  56. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.
  57. so these are statistics that characterize a dataset, and you can use them to infer all sorts of things about distributions. the 2-point, which appears under many names in many fields, is just the n=2 term of the n-point correlation function hierarchy, which forms the foundation for spatial statistics, and is used heavily in many sciences.