Presented at the 2014 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2014), June 19, 2014 (Berkeley, CA):
The scientific promise of modern astrophysical surveys - from exoplanets to gravity waves - is palpable. Yet extracting insight from the data deluge is neither guaranteed nor trivial: existing paradigms for analysis are already beginning to breakdown under the data velocity. I will describe our efforts to apply statistical machine learning to large-scale astronomy datasets both in batch and streaming mode. From the discovery of supernovae to the characterization of tens of thousands of variable stars such approaches are leading the way to novel inference. Specific discoveries concerning precision distance measurements and using LSST as a pseudo-spectrograph will be discussed.
13. Structured Classification
Structured Classification: Let class taxonomy guide classifier.
HSC: Hierarchical single-label
classification.
I Fit separate classifier at
each non-terminal node.
HMC: Hierarchical multi-label
classification.
I Fit one classifier, where
L(y, f (x)) wdepth
0
Structured*Learning
Richards+11
5%*gross*mis&
classifica:on*
rate!
14. Results: All-Sky Automated Survey Classifications
28-class variable star classification problem with 50,000 stars
!
!
!
!
!
!
!
!
!
!
0 2 4 6 8
0.660.680.700.720.740.760.780.80
AL Iteration
PercentAgreementwithACVS
!
!
!
!
!
!
!
!
!
!
0 2 4 6 8
0.150.200.250.300.350.40
AL Iteration
PercentofConfidentASASRFLabelsOff-the-shelf RF
Error Rate = 34.5%
RF w/ Active Learning
Error Rate = 20.5%
3-fold increase
in classifier
confidence
Note: No other method yielded improvement in classification
Active Learning
16. Machine*Learned*Classifica:on
258class#variable#star
Data:#50k#from#ASAS,#810#with#known#labels#
(Tmeseries,#colors)
PRRL#=#0.94
Richards+12
74#
dimensional#
feature#set#
for#learning
featurizaTon#is#
the#bopleneck#
(but#
embarrassingly#
parallel)
Astrophysical Journal Supplement Series, 203:32 (27pp), 2012 December Richards et al.
y. W Ursae Maj.
x. Beta Lyrae
w. Beta Persei
v. Ellipsoidal
u. S Doradus
t. Herbig AE/BE
s3. RS CVn
s2. Weak−line T Tauri
s1. Class. T Tauri
r1. RCB
q. Chem. Peculiar
p. RSG
o. Pulsating Be
l. Beta Cephei
j1. SX Phe
j. Delta Scuti
i. RR Lyrae, DM
h. RR Lyrae, FO
g. RR Lyrae, FM
f. Multi. Mode Cepheid
e. Pop. II Cepheid
d. Classical Cepheid
c. RV Tauri
b4. LSP
b3. SARG B
b2. SARG A
b1. Semireg PV
a. Mira
a b1 b2 b3 b4 c d e f g h i j j1 l o p q r1 s1 s2 s3 t u v w x y
PredictedClass
True Class
91 68 15 29 54 24 89 13 4 86 29 2 28 1 18 5 35 23 17 4 20 17 8 1 1 27 33 68
0.011
0.066
0.923
0.044
0.029
0.074
0.015
0.824
0.015
0.067
0.067
0.267
0.6
0.034
0.069
0.586
0.034
0.276
0.87
0.019
0.111
0.042
0.042
0.75
0.125
0.042
0.011
0.011
0.011
0.955
0.011
0.077
0.077
0.308
0.538
0.5
0.25
0.25
0.012
0.012
0.965
0.012
0.069
0.034
0.897
0.5
0.5
0.036
0.036
0.107
0.786
0.036
1
0.722
0.278
0.2
0.2
0.4
0.2
0.686
0.171
0.086
0.057
0.043
0.913
0.043
0.059
0.706
0.059
0.176
0.25
0.25
0.25
0.25
0.05
0.7
0.1
0.1
0.05
0.059
0.118
0.353
0.059
0.118
0.059
0.059
0.059
0.059
0.059
0.375
0.125
0.125
0.375
1
1
0.074
0.889
0.037
0.091
0.667
0.182
0.061
0.882
0.044
0.015
0.015
0.015
0.015
0.015
re 5. Cross-validated confusion matrix for all 810 ASAS training sources. Columns are normalized to sum to unity, with the total number of true objects of each
listed along the bottom axis. The overall correspondence rate for these sources is 80.25%, with at least 70% correspondence for half of the classes. Classes with
correspondence are those with fewer than 10 training sources or classes which are easily confused. Red giant classes tend to be confused with other red giant
es and eclipsing classes with other eclipsing classes. There is substantial power in the top-right quadrant, where rotational and eruptive classes are misclassified
19. Discovery of Bright Galactic R Coronae Borealis and DY Persei
Variables: Rare Gems Mined from ASAS
A. A. Miller1,⇤
, J. W. Richards1,2
, J. S. Bloom1
, S. B. Cenko1
, J. M. Silverman1
,
D. L. Starr1
, and K. G. Stassun3,4
ABSTRACT
We present the results of a machine-learning (ML) based search for new R
Coronae Borealis (RCB) stars and DY Persei-like stars (DYPers) in the Galaxy
using cataloged light curves obtained by the All-Sky Automated Survey (ASAS).
RCB stars—a rare class of hydrogen-deficient carbon-rich supergiants—are of
great interest owing to the insights they can provide on the late stages of stellar
evolution. DYPers are possibly the low-temperature, low-luminosity analogs to
the RCB phenomenon, though additional examples are needed to fully estab-
lish this connection. While RCB stars and DYPers are traditionally identified
by epochs of extreme dimming that occur without regularity, the ML search
framework more fully captures the richness and diversity of their photometric
behavior. We demonstrate that our ML method recovers ASAS candidates that
would have been missed by traditional search methods employing hard cuts on
amplitude and periodicity. Our search yields 13 candidates that we consider
likely RCB stars/DYPers: new and archival spectroscopic observations confirm
that four of these candidates are RCB stars and four are DYPers. Our discovery
of four new DYPers increases the number of known Galactic DYPers from two
to six; noteworthy is that one of the new DYPers has a measured parallax and is
m ⇡ 7 mag, making it the brightest known DYPer to date. Future observations
of these new DYPers should prove instrumental in establishing the RCB con-
nection. We consider these results, derived from a machine-learned probabilistic
1
Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA
2
Statistics Department, University of California, Berkeley, CA, 94720-7450, USA
3
arXiv:1204.4181v1[astro-ph.SR]18Apr2012
– 13 –
Fig. 2.— ASAS V -band light curves of newly discovery RCB stars and DYPers. Note t
di↵ering magnitude ranges shown for each light curve. Spectroscopic observations confi
the top four candidates to be RCB stars, while the bottom four are DYPers.
– 13 –
Fig. 2.— ASAS V -band light curves of newly discovery RCB stars and DYPers. Note t
di↵ering magnitude ranges shown for each light curve. Spectroscopic observations confi
the top four candidates to be RCB stars, while the bottom four are DYPers.17#known#GalacTc#RCB/DY#Per
20. E.)Ramirez?Ruiz)(UCSC)
50 100 150 200
Days Since Explosion
Type Ia
NS + NS Mergers
Type IIp
NS + RSG Collision
IMBH + WD Collision
Pair Production Supernovae
-10
-12
-14
-16
-18
-20
-22
MH
z=0.45
200Mpc
-log(brightness)
Extragalac:c*Transient*Universe:*Explosive*Systems
23. 4 H. Brink et al.
Figure 1. Examples of bogus (top) and real (bottom) thumbnails.
Note that the shapes of the bogus sources can be quite varied,
which poses a challenge in developing features that can accurately
values lie between 1 and 1. As the pixel values fo
didates can take on a wide range of values depend
astrophysical source and observing conditions, th
ization ensures that our features are not overly se
the peak brightness of the residual nor the residu
background flux, and instead capture the sizes and
the subtraction residual. Starting with the raw su
thumbnail, I, normalization is achieved by first
ing the median pixel value from the subtraction
and then dividing by the maximum absolute value
median-subtracted pixels via
IN(x, y) =
⇢
I(x, y) med[I(x, y)]
max{abs[I(x, y)]}
.
Analysis of the features derived from these norm
and bogus subtraction images showed that the
mation in (1) is superior to other alternatives
the Frobenius norm (
p
trace(IT I)) and truncatio
where extreme pixel values are removed.
Using Figure 1 as a guide, our first intuit
real candidates is that their subtractions are typ
imuthally symmetric in nature, and well-represe
2-dimensional Gaussian function, whereas bogus c
are not well behaved. To this end, we define a sp
Gaussian, G(x, y), over pixels x, y as
G(x, y) = A · exp
⇢
1
2
(cx x)2
+
(cy y)2
which we fit to the normalized PTF subtraction i
of each candidate by minimizing the sum-of-squa
ence between the model Gaussian image and the
“bogus”
“real”
PTF)subtrac9ons
Goal:
build#a#framework#to#
discover#variable/
transient#sources#
without#people
•#fast#(compared#to#people)
•#parallelizeable
•#transparent
•#determinisTc
•#versionable
1000)to)1)needle)in)the)
haystack)problem
Discovery*Engine
24. “Discovery”*is*Imperfect
useful at all is surprising, but we can clearly see that there are a higher probability of the candidates
CDs.
my literature ( | joey:
algorithm can be found
ethod aggregates a col-
s of classification trees,
outputs the fraction of
fraction is greater than
classifies the candidate
be bogus.
ve no missed detections
with zero false positives
stic classifier will typi-
e two types of errors. A
(ROC) curve is a com-
ys the missed detection
ve rate (FPR) of a clas-
ace a trade-o↵ between
hreshold ⌧ by which we
he lower the MDR but
Varying ⌧ maps out the
sifier, and we can com-
classifiers by comparing
the lower the curve the
erit (FoM) for selecting
SVM with a radial basis kernel, a common alternative
for non-linear classification problems. A line is plotted
to show the 1% FPR to which our figure of merit is fixed.
Fig. 3.— Comparison of a few well known classification algo-
rithms applied to the full dataset. ROC curves enable a trade-o↵
between false positives and missed detections, but the best classi-
fier pushes closer towards the origin. Linear models (Logistic Re-
gression or Linear SVMs) perform poorly as expected, while non-
linear models (SVMs with radial basis function kernels or Random
Real or Bogus? 5
Fig. 2.— Histogram of a selection of features divided in real (purple) and bogus (cyan) populations. First two newly introduced features
gauss and amp, the goodness-of-fit and amplitude of the Gaussian fit. Then mag ref, the magnitude of the source in the reference image,
flux ratio, the ratio of the fluxes in the new and reference images and lastly, ccid, the ID of the camera CCD where the source was
detected. The fact that this feature is useful at all is surprising, but we can clearly see that there are a higher probability of the candidates
beeing real or bogus on some of the CCDs.
els of performance in the astronomy literature ( | joey:
add refs | ). A description of the algorithm can be found
in Breiman (2001). Briefly, the method aggregates a col-
lection of hundreds to thousands of classification trees,
and for a given new candidate, outputs the fraction of
classifiers that vote real. If this fraction is greater than
some threshold ⌧, random forest classifies the candidate
as real; otherwise it is deemed to be bogus.
While an ideal classifier will have no missed detections
(i.e., no real identified as bogus), with zero false positives
(bogus identified as real), a realistic classifier will typi-
cally o↵er a trade-o↵ between the two types of errors. A
receiver operating characteristic (ROC) curve is a com-
monly used diagram which displays the missed detection
rate (MDR) versus the false positive rate (FPR) of a clas-
sifier6
. With any classifier, we face a trade-o↵ between
MDR and FPR: the larger the threshold ⌧ by which we
SVM with a radial basis kernel, a common alternative
for non-linear classification problems. A line is plotted
to show the 1% FPR to which our figure of merit is fixed.
Brink+2012
Real#and#Bogus#
objects#in#our#
training#set#of#
78k#detecTons,#
428dimensional#
image#and#
context#features#
on#each#
candidate
but)some)classifiers)work)beNer)than)others
30. LETTER doi:10.1038/nature13304
A Wolf–Rayet-like progenitor of SN 2013cu from
spectral observations of a stellar wind
Avishay Gal-Yam1
, I. Arcavi1
, E. O. Ofek1
, S. Ben-Ami1
, S. B. Cenko2
, M. M. Kasliwal3
, Y. Cao4
, O. Yaron1
, D. Tal1
, J. M. Silverman5
,
−5 0 5 10 15 20 25 30 35 40
−19
−18
−17
−16
−15
−14
−13
MJD−56414.93 [days]
Absolutemagnitude
SN2013cu r
3σ upper limits
Swift U,UVW1
parabolic fit
0 6 12 18 24 30 36 42 48
17
18
19
20
21
22
Time since explosion [hours]
Observedmagnitude
Keck
Keck
Extended Data Figure 1 | The r-band light curve of SN 2013cu. A parabolic
model of the flux–time (red solid curve) describes the pre-peak data (1s
error bars) very well. Backward extrapolation indicates an explosion date of
UTC 2013 May 2.93 6 0.11 (MJD 5 56414.93; 5.7 h before the first iPTF
detection; see inset); we estimate the uncertainty from the scatter generated by
modifying the subset of points used in the fit. Our first Keck spectru
obtained about 15.5 h after explosion (vertical dotted line). Early Sw
ultraviolet photometry (diamonds) places a lower limit of T 5 25,000
black-body temperature measured 40 h after explosion.
RESEARCH LETTER
He II 5,412
He I 6,678
[S II] 6,716 N IV 7,123
N IV 7,109N IV 5,047
LETTER RESEARCH
Last%month...
32. Berkeley Institute for
Data Science
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded#in#December#2013#as#a#result#of#a#year+#long#naTonal#
selecTon#process
$37.8M#over#5#years,#along#with#University#of#Washington#&#NYU
‣ An#accelerator#for#data8driven#discovery
‣ An#agent*of*change#in#the#modern#university#as#Data#
Science#takes#hold
‣ An#incubator#for#the#next#generaTon#of#Data#Science#
technology#and#pracTce