main

University of New Mexico
Data Driven Sample Generator Model with Application to
Classiﬁcation
Supervisor
Dr. Erik Erhardt
Candidate
Alvaro Ulloa
April 15, 2016

Outline
Introduction
Motivation
Thesis statement
Contributions
Materials
Machine Learning methods
Random Variable Samplers
Matrix Factorization
Data Driven Sample Generator
Case Study
Results
Conclusion
2 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classiﬁcation

Introduction
• Machine Learning
◦ Automate decision making
◦ Learn from experience
◦ Generalize data properties from a
subset
• Regularization
◦ Weight sparness: L1, L2
◦ Weight averaging: Dropout
◦ Weight variation: Noise insertion.
• Rely on design and previous
knowledge of the data
• Data size: Big and Small
3 of 51

Introduction
Big Data
Large number of samples vs
number of features.
Crowd sourced.
Cheap to collect.
Images, text, video, and
sound.
Generally, helps ML
methods to not overﬁt.
Expensive to compute.
Small data
Small number of samples vs
number of features
Expensive to collect.
Often overﬁts ML methods.
Biomedical data
Not necessarily expensive to
compute
4 of 51

Motivation
• Mental Illness
◦ In 2014, there were an estimated 9.8
million adults in the US with severe
mental illness. [1]
• Structural MRI
◦ Large number of voxels (∼50’000)
◦ Few number of samples (∼400)
◦ Small data scenario
Need for regularization models to
alleviate overﬁting eﬀects when
investigating SMRI for mental illness
5 of 51

Thesis statement
• Augmenting a small dataset artificially may lead to improved
classification scores.
• ML methods may benefit from the induced variability, avoid
overfitting, and improve classification scores.
6 of 51

Contributions
• Data-driven sample generation technique
• Optimized rejection sampler
• Enable deep-learning for classiﬁcation of SMRI data
7 of 51

Materials
Machine Learning Methods
8 of 51

Materials: Non Parametric Classiﬁers
Nearest Neighbors
• Search for the k-closest points
and vote
Decision Tree
• Sequence of decision rules
based on each feature
Random Forest
• Several decision trees that vote
Coﬀee
Bad
1 year
Good
Tropical
Bad
Polar
Good
Mediterranean
Organic
Bad
Non organic
≤ 1 year
9 of 51

Materials: Linear Classiﬁers
Logistic Regression
log
p(y|x)
1 − p(y|x)
= c + x · θ.
min
w,c
||w||L + C n
i=1 log(exp(−yi (xT
i w + c) + 1)
Linear SVM
Search for a plane wx + c = 0
Primal: min
c,w,ζ
||w||L + C n
i ζi subject to
yi (wT φ(xi ) + b) ≥ 1 − ζi , ζi ≥ 0, i = 1, ..., n
10 of 51

Materials: Non-Linear Classifiers
Naive Bayes
• p(y|x) = p(y)p(x|y)
p(x)
• Assumes Gaussian
distribution and
independence
Polynomial, Radial SVM
• Polynomial:
K(x, x ) = (xT x + c)d
• Radial:
K(x, x ) = exp(||x−x ||2
2σ2 )
Multilayer Perceptron
• Flexible
• Hard to train
• Each layer improves its ability to
fit more complex data
• Highly prone to overfitting
Input #1
Input #2
Input #3
Input #4
Output
Output
Output
Hidden
layer 1
Hidden
layer 2
Hidden
layer 3
Input
layer
Output
layer
11 of 51

Materials
Random Variable Samplers
12 of 51

Materials: Rejection Sampler
e(x): Envelop function,
e(x) = αh(x|θ)
α: Scale
h(x|theta): PDF easy to
sample from
repeat
Sample y ∼ h(y)
Sample u ∼ Uniform(0, e(y))
if u f (y) then
Reject y
else
Accept y as a sample from f (x)
end if
until the desired number of samples is
accepted
13 of 51

Materials:rejection Sampler
0 2 4 6 8 10 12 14 16
0.00
0.05
0.10
0.15
0.20
0.25
0.30
f(x) = exp( −(x −1)2
2x
)x + 1
12
f(x)
0 2 4 6 8 10 12 14 16
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Histogram of generated samples
f(x)
14 of 51

Materials:rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
rejected accepted
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
15 of 51

Materials: Optimized Rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
ˆθ, ˆα = argmin
θ,α
(αh(x|θ) − f (x))dx, s.t. e(x) − f (x) ≥ 0, ∀x ∈ R
Since h(·) and f (·) are PDFs, it reduces to
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − f (x) ≥ 0, ∀x ∈ Domain{f }
16 of 51

Materials: Optimized Rejection Sampler
Let f (x) = Beta(2, 2) = 6x(1 − x), x ∈ [0, 1] and
h(x|θ) = Uniform(0, θ) =
1/θ, if 0 ≤ x ≤ θ
0, otherwise
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − 6x(1 − x) ≥ 0, ∀x ∈ [0, 1]
For θ 1, there is no solution. Thus, θ ≥ 1 for the constrain to hold.
α
θ
≥ 6x(1 − x) →
α
θ
≥ 1.5.
ˆθ, ˆα = argmin
θ,α
α, s.t.
α
θ
≥ 1.5, and θ ≥ 1
17 of 51

Using Lagrangian multipliers,
L(α, θ, λ, γ) =
α − λ(
α
θ
− 1.5) − γ(θ − 1), we
solve
∂L
∂α
= 1 +
λ
θ
= 0
∂L
∂θ
= −λ
α
θ2
+ γ = 0
∂L
∂λ
=
α
θ
− 1.5 = 0
∂L
∂γ
= θ − 1 = 0
Then, the solution is θ = 1, and
α = 1.5.
which results in the optimal
e(x) = 1.5 Uniform(0, 1)
This is correct since the maximum
value for Beta(2, 2) is 1.5.
0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
f(x)
Optimal e(x)
e(x)
f(x)
18 of 51

Materials: Multivariate Normal
• Compute the sample mean and sample covariance matrix
• Generate samples with the same mean and covariance
19 of 51

Materials
Matrix Factorization
X = AS
20 of 51

Materials: PCA
• Introduced by Hotelling in 1933, still widely used.
• Algebraically: linear combinations of X.
• Geometrically: coordinate system rotation.
• E[XXT ] = UΛUT , S = Λ−1
2 UT X, and A = UΛ
1
2
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True Sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2 Mixed Sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
Estimated Sources (ˆS)
21 of 51

Materials: ICA
• Introduced by Herault et. al in 1983 as an extension of PCA.
• ICA searches for independence
• Independent sources, and no more than one Gaussian distributed
source
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2
Mixed sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
ICA (ˆS)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
PCA (ˆS)
22 of 51

Materials: Infomax ICA
• Joint Entropy: H(x) = − f (x) log f (x)dx., where f (x) is a joint
PDF.
• Mutual Information: I(x) = −H(g(x)) + E i log|gi (xi )|
fi (xi ) , where
g(x) = 1
1+exp−x ,
• Infomax: W = argmax
W
H(g(WX)), where W = A−1
23 of 51

Proposed method
Data Driven Sample
Generator
24 of 51

• Generate augmented datasets for ML methods to train on.
• ML tend to fail for datasets that are rich in features but short of
samples.
• Two assumptions:
◦ The input dataset is reducible, i.e reconstruction error from matrix
factorization is minimal.
◦ A group of samples with a common diagnosis shares statistical
properties that are reﬂected in their loading coeﬃcients (A).
25 of 51

Block diagram
26 of 51

Classiﬁcation framework
27 of 51

Case Study
Case Study: Schizophrenia
28 of 51

Case Study: Dataset
Patient Control Total
Male 121 97 218
Female 77 94 171
Age 39.68±12.12 40.26±15.02
Total 198 191 389
29 of 51

Case Study: ANOVA
Age Grouping
Healthy Patient
Age Male Female Male Female Total
Young (16-33) 39 35 37 19 130
Adult (34-43) 27 25 51 25 128
Senior (44-81) 31 34 33 33 131
Total 97 94 121 77 389
30 of 51

Case Study: ANOVA
Full Model:
GMC = µ. + Diagnosis + Age + Gender + Diagnosis ∗ Age +
Diagnosis ∗ Gender + Age ∗ Gender + Diagnosis ∗ Age ∗ Gender +
Reduced Model:
• Check the three way interaction (age-gender-diagnosis)
significance level.
• If three-way interaction is not significant, then conduct a model
comparison test (generalized linear F-test) to assess the reduced
model.
• If the test suggests to reduce the model, then we reduce it.
• Repeat with the less significant two-way interaction.
31 of 51

Case Study: Classiﬁcation Framework
Method Parameter Values
Nearest Neighbors Number of neighbors [1, 5, 10, 20]
Decision Tree Maximum number of
features
’auto’
Random Forest Number of estimators [5...20]
Naive Bayes Kernel Gaussian
Logistic Regression C [0.001, 0.1, 1]
Support Vector Machines
Kernel [radial, polynomial]
C [0.01, 0.1, 1]
Linear SVM
C [0.01, 0.1, 1]
Penalty [’L1’, ’L2’]
Depth [3, 4, 5]
Number of hidden
units
[50, 100, 200]
32 of 51

Results
Results
33 of 51

Results: Generator sample
Real sample
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
0.72
Which is real?
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
0.72
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
34 of 51

Results: ANOVA
Diagnosis
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
Age
2.4
3.2
4.0
4.8
5.6
6.4
7.2
8.0
8.8
Gender
1.36
1.44
1.52
1.60
1.68
1.76
1.84
1.92
2.00
Gender-Diagnosis
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
Age-Diagnosis
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
3.8
Age-Gender-Diagnosis
2.04
2.10
2.16
2.22
2.28
2.34
2.40
2.46
2.52
35 of 51

Results: ANOVA
36 of 51

Results: ANOVA
Three way ANOVA group means for the main effects of schizophrenia dataset.
Effect Brain Region Group Means (×10−2
)
Diagnosis
Control Patient
57.7 52.9
60.2 55.4
39.3 35.7
Right Superior Temporal Gyrus
Left Superior Temporal Gyrus
Superior Frontal Gyrus
Young Adult Senior
30.8 33.7 35.9
34.1 37.5 39.8
68.3 73.7∗
73.3∗
66.8 71.3∗
71.7∗
Age
Left Thalamus
Right Thalamus
Right Parahippocampal Gyrus
Left Parahippocampal Gyrus
Gender None
∗
Not statistically different.
37 of 51

Results: ANOVA
Three way ANOVA group means for the eﬀects of interactions on
schizophrenia dataset.
Eﬀect Brain Region Group Means (×10−2
)
Gender-Diagnosis Right Fusiform Gyrus
Control Patient
Male 27.7(a)
29.7(b)
Female 29.4(a,b)
28.1(a,b)
Age-Diagnosis
Right Inferior Parietal Lobule
Young Adult Senior
Control 54.5(c)
53.0(b,c)
47.1(a)
Patient 47.2(a)
50.7(a,b)
48.8(a)
Left Inferior Parietal lobule
Young Adult Senior
Control 51.9(b,c)
52.2(c)
47.1(a)
Patient 45.4(a)
50.2(a,b)
48.3(a,b)
Age-Gender-Diagnosis Left Precuneus
Senior Female Patient Others
43.0 47.1
38 of 51

Results: Classiﬁcation
Method Raw ICA PCA Augmented
Logistic Regression 72.1 ± 3.5 66.4 ± 7.6 67.5 ± 3.9 71.0 ± 3.0
Multilayer Perceptron 60.2 ± 12.5 67.9 ± 5.2 66.6 ± 3.7 75.0 ± 4.5
SVM (radial, poly) 70.5 ± 5.9 57.0 ± 4.7 64.0 ± 5.5 70.1 ± 4.0
Linear SVM 69.1 ± 6.7 68.2 ± 7.5 67.4 ± 4.3 71.3 ± 3.9
Naive Bayes 60.3 ± 6.0 59.8 ± 8.6 65.2 ± 5.8 58.3 ± 3.7
Decision Tree 55.5 ± 4.9 54.3 ± 5.1 56.0 ± 5.6 55.2 ± 3.3
Random Forest 60.1 ± 3.4 62.3 ± 5.7 65.6 ± 3.9 63.3 ± 2.3
Nearest Neighbors 62.7 ± 3.5 58.6 ± 6.2 65.1 ± 3.8 60.3 ± 3.5
39 of 51

Results: Classiﬁcation
Raw ICA PCA Generator
50
55
60
65
70
75
80
AUC
Non parametric
Linear
Non Linear
Classification Method
Decision Tree
Linear SVM
Logistic Regression
Naive Bayes
Nearest Neighbors
Random Forest
SVM
40 of 51

Results: Non-parametric Classiﬁers Size Eﬀect
101 102 103 104
50
60
70
80
90
100
ROC AUC
Nearest Neighbors
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Decision Tree
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Random Forest
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
41 of 51

Results: Linear Classiﬁers Size Eﬀect
101 102 103 104
50
60
70
80
90
100ROC AUC Linear SVM
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Logistic Regression
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
42 of 51

Results: Non-Linear Size Eﬀect
101 102 103 104
50
60
70
80
90
100
ROC AUC
MLP
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Poly SVM
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Naive Bayes
101 102 103 104
0
2
4
6
8
10
Standard Deviation
Train
Test
43 of 51

Conclusion
Conclusion
44 of 51

Conclusion
• The generator provides reasonably looking data.
• ANOVA results replicate findings.
• MLP benefits the most from the augmented dataset.
• The augmented dataset provides comparable scores as in raw
data.
• The proposed method enables deep-learning methods for
classification of small datasets.
• More components → more likely to find correlated components →
use MVN
• Few components → less likely to find correlated components →
use Rejection
45 of 51

Software
Software
46 of 51

Polyssiﬁer: http://github.com/alvarouc/polyssifier
• Bash:
poly data.npy label.npy –name schizophrenia –concurrency 8
• Python:
from polyssiﬁer import poly, plot
scores, confusions, predictions = poly(data, label, n folds=8,
concurrency=4)
plot(scores)
47 of 51

MLP: http://github.com/alvarouc/mlp
from mlp import MLP
from sklearn.cross validation import cross val score
clf = MLP(n hidden=10, n deep=3, l1 norm=0, drop=0.1,
verbose=0)
scores = cross val score(clf, data, label, cv=5, n jobs=1,
scoring=’roc auc’)
48 of 51

Brain Graphics: http://github.com/alvarouc/brain_utils
from brain utils import plot source
plot source(source, template, np.where(mask), th=th, vmin=th,
vmax=np.max(t), cmap=’hot’, xyz=xyz)
Diagnosis
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
49 of 51

Funding and Acknoledgements
• This project was funded by grants P20GM103472 and
NIH-R01EB005846.
• We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Tesla K40 GPUs used for this research
50 of 51

Bibliography
Center for Behavioral Health Statistics and Quality.
Behavioral health trends in the united states: Results from the 2014
national survey on drug use and health (hhs publication no. sma
15-4927, nsduh series h-50), 2015.
Retrieved from http://www.samhsa.gov/data/.
51 of 51

main

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to main

Similar to main (20)

main