Large-Scale Inference in Time Domain Astrophysics

  • 368 views
Uploaded on

Presented at the 2014 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2014), June 19, 2014 (Berkeley, CA): …

Presented at the 2014 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2014), June 19, 2014 (Berkeley, CA):

The scientific promise of modern astrophysical surveys - from exoplanets to gravity waves - is palpable. Yet extracting insight from the data deluge is neither guaranteed nor trivial: existing paradigms for analysis are already beginning to breakdown under the data velocity. I will describe our efforts to apply statistical machine learning to large-scale astronomy datasets both in batch and streaming mode. From the discovery of supernovae to the characterization of tens of thousands of variable stars such approaches are leading the way to novel inference. Specific discoveries concerning precision distance measurements and using LSST as a pseudo-spectrograph will be discussed.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
368
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. “I#love#working#with#astronomers,# since#their#data#is#worthless.” 8#Jim#Gray,#Microso'
  • 2. Large&Scale*Inference*in* Time&Domain*Astrophysics Joshua'Bloom' UC'Berkeley,'Astronomy @pro>sb MMDS#2014#8#Berkeley#8#19#June#2014
  • 3. c. 1890Harvard College Observatory
  • 4. c. 1890Harvard College Observatory Binary stars deeply interesting: Reveal the fundamental properties of stars
  • 5. Large*Synop:c*Survey*Telescope*(LSST)*&*2020+* # Light#curves#for#800M#sources#every#3#days #####106#supernovae/yr,#105#eclipsing#binaries #####3.2#gigapixel#camera,#20#TB/night Gaia*space*astrometry*mission*&*2015+ ####1#billion#stars#observed#∼70#Tmes#over#5#years #######Will#observe#20K#supernovae Many#other#astronomical#surveys#are#already#producing#data Astronomical*Data*Deluge
  • 6. Goal# all8sky#radio#map#of#“epoch#of#reionizaTon” 2006#simulaTon Antenna Data#Rate Max#Internal Data#Rate Correlated Data#Volume Compressed Data#Volume PAPER8128 0.4#Tbps 210#Tbps 200#TB 30#TB HERA8331 2.6#Tbps 3500#Tbps 2800#TB 140#TB HERA8576 4.6#Tbps 10,000#Tbps 8400#TB 420#TB Data*challenge Hybrid*Compute*Solu:on: custom/FPGA GPU cluster Prof.#A.#Parsons#(PI;#UC#Berkeley)
  • 7. Bayesian FrequenTst Theory/Hypothesis Driven Data Driven non-parametric parametric Data*Inference*Space Hardware###laptops#→#clusters/supercomputers Sofware###Python/Scipy,#R,#... Carbonware###(astro)#grad#students,#postdocs
  • 8. Variable Source Taxonomy: A Mess
  • 9. •#noisy,# irregularly# sampled Considerable*Complica:ons*with*Time&Series*Data
  • 10. •#noisy,# irregularly# sampled •#spurious#data Considerable*Complica:ons*with*Time&Series*Data
  • 11. •#noisy,# irregularly# sampled •#telltale# signature#event# may#not#have# happened#yet •#spurious#data Considerable*Complica:ons*with*Time&Series*Data
  • 12. variability*metrics: e.g.#Stetson#indices,#χ2/dof# (constant#hypothesis) periodic*metrics:e.g.#dominant#frequencies#in#Lomb8Scargle,#phase#offsets#between#periods shape*analysis e.g.#skewness,#kurtosis,# Gaussianity context*metrics e.g.#distance#to#nearest#galaxy,# type#of#nearest#galaxy,#locaTon# in#the#eclipTc#plane Machine&Learning*Approach*to*Classifica:on Wózniak#et#al.#2004;#Protopapas+06,#Willemsen#&#Eyer#2007;#Debosscher#et#al.#2007;#Mahabal#et#al.# 2008;#Sarro#et#al.#2009;#Blomme#et#al.#2010;#Kim+11,#Richards+11 Engineered#features*homogenize#data#→# p Describe#Tme8domain#characterisTcs#&#context#of#a#source p#≈#100#features#computed#in#<#1#sec#per#88core#machine# (including#periodogram#analysis)
  • 13. Structured Classification Structured Classification: Let class taxonomy guide classifier. HSC: Hierarchical single-label classification. I Fit separate classifier at each non-terminal node. HMC: Hierarchical multi-label classification. I Fit one classifier, where L(y, f (x)) wdepth 0 Structured*Learning Richards+11 5%*gross*mis& classifica:on* rate!
  • 14. Results: All-Sky Automated Survey Classifications 28-class variable star classification problem with 50,000 stars ! ! ! ! ! ! ! ! ! ! 0 2 4 6 8 0.660.680.700.720.740.760.780.80 AL Iteration PercentAgreementwithACVS ! ! ! ! ! ! ! ! ! ! 0 2 4 6 8 0.150.200.250.300.350.40 AL Iteration PercentofConfidentASASRFLabelsOff-the-shelf RF Error Rate = 34.5% RF w/ Active Learning Error Rate = 20.5% 3-fold increase in classifier confidence Note: No other method yielded improvement in classification Active Learning
  • 15. Machine*Learned*Classifica:on 258class#variable#star Data:#50k#from#ASAS,#810#with#known#labels# (Tmeseries,#colors) PRRL#=#0.94 Richards+12
  • 16. Machine*Learned*Classifica:on 258class#variable#star Data:#50k#from#ASAS,#810#with#known#labels# (Tmeseries,#colors) PRRL#=#0.94 Richards+12 74# dimensional# feature#set# for#learning featurizaTon#is# the#bopleneck# (but# embarrassingly# parallel) Astrophysical Journal Supplement Series, 203:32 (27pp), 2012 December Richards et al. y. W Ursae Maj. x. Beta Lyrae w. Beta Persei v. Ellipsoidal u. S Doradus t. Herbig AE/BE s3. RS CVn s2. Weak−line T Tauri s1. Class. T Tauri r1. RCB q. Chem. Peculiar p. RSG o. Pulsating Be l. Beta Cephei j1. SX Phe j. Delta Scuti i. RR Lyrae, DM h. RR Lyrae, FO g. RR Lyrae, FM f. Multi. Mode Cepheid e. Pop. II Cepheid d. Classical Cepheid c. RV Tauri b4. LSP b3. SARG B b2. SARG A b1. Semireg PV a. Mira a b1 b2 b3 b4 c d e f g h i j j1 l o p q r1 s1 s2 s3 t u v w x y PredictedClass True Class 91 68 15 29 54 24 89 13 4 86 29 2 28 1 18 5 35 23 17 4 20 17 8 1 1 27 33 68 0.011 0.066 0.923 0.044 0.029 0.074 0.015 0.824 0.015 0.067 0.067 0.267 0.6 0.034 0.069 0.586 0.034 0.276 0.87 0.019 0.111 0.042 0.042 0.75 0.125 0.042 0.011 0.011 0.011 0.955 0.011 0.077 0.077 0.308 0.538 0.5 0.25 0.25 0.012 0.012 0.965 0.012 0.069 0.034 0.897 0.5 0.5 0.036 0.036 0.107 0.786 0.036 1 0.722 0.278 0.2 0.2 0.4 0.2 0.686 0.171 0.086 0.057 0.043 0.913 0.043 0.059 0.706 0.059 0.176 0.25 0.25 0.25 0.25 0.05 0.7 0.1 0.1 0.05 0.059 0.118 0.353 0.059 0.118 0.059 0.059 0.059 0.059 0.059 0.375 0.125 0.125 0.375 1 1 0.074 0.889 0.037 0.091 0.667 0.182 0.061 0.882 0.044 0.015 0.015 0.015 0.015 0.015 re 5. Cross-validated confusion matrix for all 810 ASAS training sources. Columns are normalized to sum to unity, with the total number of true objects of each listed along the bottom axis. The overall correspondence rate for these sources is 80.25%, with at least 70% correspondence for half of the classes. Classes with correspondence are those with fewer than 10 training sources or classes which are easily confused. Red giant classes tend to be confused with other red giant es and eclipsing classes with other eclipsing classes. There is substantial power in the top-right quadrant, where rotational and eruptive classes are misclassified
  • 17. hOp://bigmacc.infoMachine8learned'Variable'Star'catalog:
  • 18. Doing*Science*with*Probabilis:c*Catalogs Demographics#(with#liple#followup): ####trading#high#purity#at#the#cost#of#lower#efficiency ))))e.g.,)using)RRL)to)find)new)Galac9c)structure Novelty#Discovery#(with#lots#of#followup): ####trading#high#efficiency#for#lower#purity ))))e.g.,)discovering)new)instances)of)rare)classes))
  • 19. Discovery of Bright Galactic R Coronae Borealis and DY Persei Variables: Rare Gems Mined from ASAS A. A. Miller1,⇤ , J. W. Richards1,2 , J. S. Bloom1 , S. B. Cenko1 , J. M. Silverman1 , D. L. Starr1 , and K. G. Stassun3,4 ABSTRACT We present the results of a machine-learning (ML) based search for new R Coronae Borealis (RCB) stars and DY Persei-like stars (DYPers) in the Galaxy using cataloged light curves obtained by the All-Sky Automated Survey (ASAS). RCB stars—a rare class of hydrogen-deficient carbon-rich supergiants—are of great interest owing to the insights they can provide on the late stages of stellar evolution. DYPers are possibly the low-temperature, low-luminosity analogs to the RCB phenomenon, though additional examples are needed to fully estab- lish this connection. While RCB stars and DYPers are traditionally identified by epochs of extreme dimming that occur without regularity, the ML search framework more fully captures the richness and diversity of their photometric behavior. We demonstrate that our ML method recovers ASAS candidates that would have been missed by traditional search methods employing hard cuts on amplitude and periodicity. Our search yields 13 candidates that we consider likely RCB stars/DYPers: new and archival spectroscopic observations confirm that four of these candidates are RCB stars and four are DYPers. Our discovery of four new DYPers increases the number of known Galactic DYPers from two to six; noteworthy is that one of the new DYPers has a measured parallax and is m ⇡ 7 mag, making it the brightest known DYPer to date. Future observations of these new DYPers should prove instrumental in establishing the RCB con- nection. We consider these results, derived from a machine-learned probabilistic 1 Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA 2 Statistics Department, University of California, Berkeley, CA, 94720-7450, USA 3 arXiv:1204.4181v1[astro-ph.SR]18Apr2012 – 13 – Fig. 2.— ASAS V -band light curves of newly discovery RCB stars and DYPers. Note t di↵ering magnitude ranges shown for each light curve. Spectroscopic observations confi the top four candidates to be RCB stars, while the bottom four are DYPers. – 13 – Fig. 2.— ASAS V -band light curves of newly discovery RCB stars and DYPers. Note t di↵ering magnitude ranges shown for each light curve. Spectroscopic observations confi the top four candidates to be RCB stars, while the bottom four are DYPers.17#known#GalacTc#RCB/DY#Per
  • 20. E.)Ramirez?Ruiz)(UCSC) 50 100 150 200 Days Since Explosion Type Ia NS + NS Mergers Type IIp NS + RSG Collision IMBH + WD Collision Pair Production Supernovae -10 -12 -14 -16 -18 -20 -22 MH z=0.45 200Mpc -log(brightness) Extragalac:c*Transient*Universe:*Explosive*Systems
  • 21. strategy scheduling observing reducTon finding discovery classificaTon followup inference Towards(a(Fully(Automated(Scien5fic(Stack for(Transients } current state)of)the)art stack automated#(e.g.iPTF)# not#(yet)#automated typing papers NSF/CDI NSF/BIGDATA
  • 22. 4 H. Brink et al. Figure 1. Examples of bogus (top) and real (bottom) thumbnails. Note that the shapes of the bogus sources can be quite varied, which poses a challenge in developing features that can accurately values lie between 1 and 1. As the pixel values fo didates can take on a wide range of values depend astrophysical source and observing conditions, th ization ensures that our features are not overly se the peak brightness of the residual nor the residu background flux, and instead capture the sizes and the subtraction residual. Starting with the raw su thumbnail, I, normalization is achieved by first ing the median pixel value from the subtraction and then dividing by the maximum absolute value median-subtracted pixels via IN(x, y) = ⇢ I(x, y) med[I(x, y)] max{abs[I(x, y)]} . Analysis of the features derived from these norm and bogus subtraction images showed that the mation in (1) is superior to other alternatives the Frobenius norm ( p trace(IT I)) and truncatio where extreme pixel values are removed. Using Figure 1 as a guide, our first intuit real candidates is that their subtractions are typ imuthally symmetric in nature, and well-represe 2-dimensional Gaussian function, whereas bogus c are not well behaved. To this end, we define a sp Gaussian, G(x, y), over pixels x, y as G(x, y) = A · exp ⇢ 1 2  (cx x)2 + (cy y)2 which we fit to the normalized PTF subtraction i of each candidate by minimizing the sum-of-squa ence between the model Gaussian image and the “bogus” “real” PTF)subtrac9ons Goal: build#a#framework#to# discover#variable/ transient#sources# without#people •#fast#(compared#to#people) •#parallelizeable •#transparent •#determinisTc •#versionable 1000)to)1)needle)in)the) haystack)problem Discovery*Engine
  • 23. “Discovery”*is*Imperfect useful at all is surprising, but we can clearly see that there are a higher probability of the candidates CDs. my literature ( | joey: algorithm can be found ethod aggregates a col- s of classification trees, outputs the fraction of fraction is greater than classifies the candidate be bogus. ve no missed detections with zero false positives stic classifier will typi- e two types of errors. A (ROC) curve is a com- ys the missed detection ve rate (FPR) of a clas- ace a trade-o↵ between hreshold ⌧ by which we he lower the MDR but Varying ⌧ maps out the sifier, and we can com- classifiers by comparing the lower the curve the erit (FoM) for selecting SVM with a radial basis kernel, a common alternative for non-linear classification problems. A line is plotted to show the 1% FPR to which our figure of merit is fixed. Fig. 3.— Comparison of a few well known classification algo- rithms applied to the full dataset. ROC curves enable a trade-o↵ between false positives and missed detections, but the best classi- fier pushes closer towards the origin. Linear models (Logistic Re- gression or Linear SVMs) perform poorly as expected, while non- linear models (SVMs with radial basis function kernels or Random Real or Bogus? 5 Fig. 2.— Histogram of a selection of features divided in real (purple) and bogus (cyan) populations. First two newly introduced features gauss and amp, the goodness-of-fit and amplitude of the Gaussian fit. Then mag ref, the magnitude of the source in the reference image, flux ratio, the ratio of the fluxes in the new and reference images and lastly, ccid, the ID of the camera CCD where the source was detected. The fact that this feature is useful at all is surprising, but we can clearly see that there are a higher probability of the candidates beeing real or bogus on some of the CCDs. els of performance in the astronomy literature ( | joey: add refs | ). A description of the algorithm can be found in Breiman (2001). Briefly, the method aggregates a col- lection of hundreds to thousands of classification trees, and for a given new candidate, outputs the fraction of classifiers that vote real. If this fraction is greater than some threshold ⌧, random forest classifies the candidate as real; otherwise it is deemed to be bogus. While an ideal classifier will have no missed detections (i.e., no real identified as bogus), with zero false positives (bogus identified as real), a realistic classifier will typi- cally o↵er a trade-o↵ between the two types of errors. A receiver operating characteristic (ROC) curve is a com- monly used diagram which displays the missed detection rate (MDR) versus the false positive rate (FPR) of a clas- sifier6 . With any classifier, we face a trade-o↵ between MDR and FPR: the larger the threshold ⌧ by which we SVM with a radial basis kernel, a common alternative for non-linear classification problems. A line is plotted to show the 1% FPR to which our figure of merit is fixed. Brink+2012 Real#and#Bogus# objects#in#our# training#set#of# 78k#detecTons,# 428dimensional# image#and# context#features# on#each# candidate but)some)classifiers)work)beNer)than)others
  • 24. Caffe:#ConvoluTonal#Architecture#for#Fast#Feature#Embedding C++/CUDA#framework#for#deep#learning#&#vision http://www.nersc.gov/users/computational-systems/testbeds/dirac/node-and-gpu-configuration/ Learning*without*Feature*Engineering? “Dirac” 50 node cluster with NVIDIA Fermi GPU https://github.com/BVLC/caffe/pulse Code “Carver” IBM iDataPlex 9,984 cores Hardware
  • 25. PTF11kly)(SN)2011fe) ©Peter)Nugent Supernova#Discovery#in#the#Pinwheel#Galaxy ####11#hr#afer#explosion nearest(SN(Ia(in(>3(decades Discovered'by'our'machine8learning'framework in#PTF:#>10,000#events#in#>#0.2#PB#of#imaging#→#50+#journal#arTcles
  • 26. Bloom+2012
  • 27. Real8Tme#ClassificaTons...
  • 28. Real8Tme#ClassificaTons...
  • 29. LETTER doi:10.1038/nature13304 A Wolf–Rayet-like progenitor of SN 2013cu from spectral observations of a stellar wind Avishay Gal-Yam1 , I. Arcavi1 , E. O. Ofek1 , S. Ben-Ami1 , S. B. Cenko2 , M. M. Kasliwal3 , Y. Cao4 , O. Yaron1 , D. Tal1 , J. M. Silverman5 , −5 0 5 10 15 20 25 30 35 40 −19 −18 −17 −16 −15 −14 −13 MJD−56414.93 [days] Absolutemagnitude SN2013cu r 3σ upper limits Swift U,UVW1 parabolic fit 0 6 12 18 24 30 36 42 48 17 18 19 20 21 22 Time since explosion [hours] Observedmagnitude Keck Keck Extended Data Figure 1 | The r-band light curve of SN 2013cu. A parabolic model of the flux–time (red solid curve) describes the pre-peak data (1s error bars) very well. Backward extrapolation indicates an explosion date of UTC 2013 May 2.93 6 0.11 (MJD 5 56414.93; 5.7 h before the first iPTF detection; see inset); we estimate the uncertainty from the scatter generated by modifying the subset of points used in the fit. Our first Keck spectru obtained about 15.5 h after explosion (vertical dotted line). Early Sw ultraviolet photometry (diamonds) places a lower limit of T 5 25,000 black-body temperature measured 40 h after explosion. RESEARCH LETTER He II 5,412 He I 6,678 [S II] 6,716 N IV 7,123 N IV 7,109N IV 5,047 LETTER RESEARCH Last%month...
  • 30. Machine*Learning*Workflows*for*the*Sake*of*Science
  • 31. Berkeley Institute for Data Science http://bitly.com/bundles/fperezorg/1 “Bold new partnership launches to harness potential of data scientists and big data” Founded#in#December#2013#as#a#result#of#a#year+#long#naTonal# selecTon#process $37.8M#over#5#years,#along#with#University#of#Washington#&#NYU ‣ An#accelerator#for#data8driven#discovery ‣ An#agent*of*change#in#the#modern#university#as#Data# Science#takes#hold ‣ An#incubator#for#the#next#generaTon#of#Data#Science# technology#and#pracTce
  • 32. Par5ng(Thoughts • Astronomy’s#data#deluge#demands#an#abstracTon# of#the#tradiTonal#roles#in#the#scienTfic#process.# • Parallel,#Distributed#CompuTng/Algorithms#and# Machine#Learning#at#the#heart#of#what#we#do • Berkeley#InsTtute#for#Data#Science#(BIDS)#to#be#an# intersecTon#point#between#physical#science#&# computaTonal/algorithmic#efforts MMDS#2014#8#Berkeley#8#June#19#2014
  • 33. @pro>sb Thank*you. MMDS
  • 34. Turning*Imagers*into*Spectrographs Miller,#JSB+14 Data:#5000#variables#in#SDSS#Stripe#82#with#spectra ###~80#dimensional#regression#with#Random#Forest Time#variability#+#colors#→#fundamental#stellar#parameters