1. University of New Mexico
Data Driven Sample Generator Model with Application to
Classification
Supervisor
Dr. Erik Erhardt
Candidate
Alvaro Ulloa
April 15, 2016
3. Introduction
• Machine Learning
◦ Automate decision making
◦ Learn from experience
◦ Generalize data properties from a
subset
• Regularization
◦ Weight sparness: L1, L2
◦ Weight averaging: Dropout
◦ Weight variation: Noise insertion.
• Rely on design and previous
knowledge of the data
• Data size: Big and Small
3 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
4. Introduction
Big Data
Large number of samples vs
number of features.
Crowd sourced.
Cheap to collect.
Images, text, video, and
sound.
Generally, helps ML
methods to not overfit.
Expensive to compute.
Small data
Small number of samples vs
number of features
Expensive to collect.
Often overfits ML methods.
Biomedical data
Not necessarily expensive to
compute
4 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
5. Motivation
• Mental Illness
◦ In 2014, there were an estimated 9.8
million adults in the US with severe
mental illness. [1]
• Structural MRI
◦ Large number of voxels (∼50’000)
◦ Few number of samples (∼400)
◦ Small data scenario
Need for regularization models to
alleviate overfiting effects when
investigating SMRI for mental illness
5 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
6. Thesis statement
• Augmenting a small dataset artificially may lead to improved
classification scores.
• ML methods may benefit from the induced variability, avoid
overfitting, and improve classification scores.
6 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
7. Contributions
• Data-driven sample generation technique
• Optimized rejection sampler
• Enable deep-learning for classification of SMRI data
7 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
9. Materials: Non Parametric Classifiers
Nearest Neighbors
• Search for the k-closest points
and vote
Decision Tree
• Sequence of decision rules
based on each feature
Random Forest
• Several decision trees that vote
Coffee
Bad
1 year
Good
Tropical
Bad
Polar
Good
Mediterranean
Organic
Bad
Non organic
≤ 1 year
9 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
10. Materials: Linear Classifiers
Logistic Regression
log
p(y|x)
1 − p(y|x)
= c + x · θ.
min
w,c
||w||L + C n
i=1 log(exp(−yi (xT
i w + c) + 1)
Linear SVM
Search for a plane wx + c = 0
Primal: min
c,w,ζ
||w||L + C n
i ζi subject to
yi (wT φ(xi ) + b) ≥ 1 − ζi , ζi ≥ 0, i = 1, ..., n
10 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
11. Materials: Non-Linear Classifiers
Naive Bayes
• p(y|x) = p(y)p(x|y)
p(x)
• Assumes Gaussian
distribution and
independence
Polynomial, Radial SVM
• Polynomial:
K(x, x ) = (xT x + c)d
• Radial:
K(x, x ) = exp(||x−x ||2
2σ2 )
Multilayer Perceptron
• Flexible
• Hard to train
• Each layer improves its ability to
fit more complex data
• Highly prone to overfitting
Input #1
Input #2
Input #3
Input #4
Output
Output
Output
Hidden
layer 1
Hidden
layer 2
Hidden
layer 3
Input
layer
Output
layer
11 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
13. Materials: Rejection Sampler
e(x): Envelop function,
e(x) = αh(x|θ)
α: Scale
h(x|theta): PDF easy to
sample from
repeat
Sample y ∼ h(y)
Sample u ∼ Uniform(0, e(y))
if u f (y) then
Reject y
else
Accept y as a sample from f (x)
end if
until the desired number of samples is
accepted
13 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
15. Materials:rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
rejected accepted
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
15 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
16. Materials: Optimized Rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
ˆθ, ˆα = argmin
θ,α
(αh(x|θ) − f (x))dx, s.t. e(x) − f (x) ≥ 0, ∀x ∈ R
Since h(·) and f (·) are PDFs, it reduces to
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − f (x) ≥ 0, ∀x ∈ Domain{f }
16 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
17. Materials: Optimized Rejection Sampler
Let f (x) = Beta(2, 2) = 6x(1 − x), x ∈ [0, 1] and
h(x|θ) = Uniform(0, θ) =
1/θ, if 0 ≤ x ≤ θ
0, otherwise
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − 6x(1 − x) ≥ 0, ∀x ∈ [0, 1]
For θ 1, there is no solution. Thus, θ ≥ 1 for the constrain to hold.
α
θ
≥ 6x(1 − x) →
α
θ
≥ 1.5.
ˆθ, ˆα = argmin
θ,α
α, s.t.
α
θ
≥ 1.5, and θ ≥ 1
17 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
18. Using Lagrangian multipliers,
L(α, θ, λ, γ) =
α − λ(
α
θ
− 1.5) − γ(θ − 1), we
solve
∂L
∂α
= 1 +
λ
θ
= 0
∂L
∂θ
= −λ
α
θ2
+ γ = 0
∂L
∂λ
=
α
θ
− 1.5 = 0
∂L
∂γ
= θ − 1 = 0
Then, the solution is θ = 1, and
α = 1.5.
which results in the optimal
e(x) = 1.5 Uniform(0, 1)
This is correct since the maximum
value for Beta(2, 2) is 1.5.
0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
f(x)
Optimal e(x)
e(x)
f(x)
18 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
19. Materials: Multivariate Normal
• Compute the sample mean and sample covariance matrix
• Generate samples with the same mean and covariance
19 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
21. Materials: PCA
• Introduced by Hotelling in 1933, still widely used.
• Algebraically: linear combinations of X.
• Geometrically: coordinate system rotation.
• E[XXT ] = UΛUT , S = Λ−1
2 UT X, and A = UΛ
1
2
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True Sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2 Mixed Sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
Estimated Sources (ˆS)
21 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
22. Materials: ICA
• Introduced by Herault et. al in 1983 as an extension of PCA.
• ICA searches for independence
• Independent sources, and no more than one Gaussian distributed
source
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2
Mixed sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
ICA (ˆS)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
PCA (ˆS)
22 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
23. Materials: Infomax ICA
• Joint Entropy: H(x) = − f (x) log f (x)dx., where f (x) is a joint
PDF.
• Mutual Information: I(x) = −H(g(x)) + E i log|gi (xi )|
fi (xi ) , where
g(x) = 1
1+exp−x ,
• Infomax: W = argmax
W
H(g(WX)), where W = A−1
23 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
24. Proposed method
Data Driven Sample
Generator
24 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
25. Data Driven Sample Generator
• Generate augmented datasets for ML methods to train on.
• ML tend to fail for datasets that are rich in features but short of
samples.
• Two assumptions:
◦ The input dataset is reducible, i.e reconstruction error from matrix
factorization is minimal.
◦ A group of samples with a common diagnosis shares statistical
properties that are reflected in their loading coefficients (A).
25 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
26. Data Driven Sample Generator
Block diagram
26 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
27. Data Driven Sample Generator
Classification framework
27 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
28. Case Study
Case Study: Schizophrenia
28 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
29. Case Study: Dataset
Patient Control Total
Male 121 97 218
Female 77 94 171
Age 39.68±12.12 40.26±15.02
Total 198 191 389
29 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
30. Case Study: ANOVA
Age Grouping
Healthy Patient
Age Male Female Male Female Total
Young (16-33) 39 35 37 19 130
Adult (34-43) 27 25 51 25 128
Senior (44-81) 31 34 33 33 131
Total 97 94 121 77 389
30 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
31. Case Study: ANOVA
Full Model:
GMC = µ. + Diagnosis + Age + Gender + Diagnosis ∗ Age +
Diagnosis ∗ Gender + Age ∗ Gender + Diagnosis ∗ Age ∗ Gender +
Reduced Model:
• Check the three way interaction (age-gender-diagnosis)
significance level.
• If three-way interaction is not significant, then conduct a model
comparison test (generalized linear F-test) to assess the reduced
model.
• If the test suggests to reduce the model, then we reduce it.
• Repeat with the less significant two-way interaction.
31 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
32. Case Study: Classification Framework
Method Parameter Values
Nearest Neighbors Number of neighbors [1, 5, 10, 20]
Decision Tree Maximum number of
features
’auto’
Random Forest Number of estimators [5...20]
Naive Bayes Kernel Gaussian
Logistic Regression C [0.001, 0.1, 1]
Support Vector Machines
Kernel [radial, polynomial]
C [0.01, 0.1, 1]
Linear SVM
C [0.01, 0.1, 1]
Penalty [’L1’, ’L2’]
Multilayer Perceptron
Depth [3, 4, 5]
Number of hidden
units
[50, 100, 200]
32 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
36. Results: ANOVA
36 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
37. Results: ANOVA
Three way ANOVA group means for the main effects of schizophrenia dataset.
Effect Brain Region Group Means (×10−2
)
Diagnosis
Control Patient
57.7 52.9
60.2 55.4
39.3 35.7
Right Superior Temporal Gyrus
Left Superior Temporal Gyrus
Superior Frontal Gyrus
Young Adult Senior
30.8 33.7 35.9
34.1 37.5 39.8
68.3 73.7∗
73.3∗
66.8 71.3∗
71.7∗
Age
Left Thalamus
Right Thalamus
Right Parahippocampal Gyrus
Left Parahippocampal Gyrus
Gender None
∗
Not statistically different.
37 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
38. Results: ANOVA
Three way ANOVA group means for the effects of interactions on
schizophrenia dataset.
Effect Brain Region Group Means (×10−2
)
Gender-Diagnosis Right Fusiform Gyrus
Control Patient
Male 27.7(a)
29.7(b)
Female 29.4(a,b)
28.1(a,b)
Age-Diagnosis
Right Inferior Parietal Lobule
Young Adult Senior
Control 54.5(c)
53.0(b,c)
47.1(a)
Patient 47.2(a)
50.7(a,b)
48.8(a)
Left Inferior Parietal lobule
Young Adult Senior
Control 51.9(b,c)
52.2(c)
47.1(a)
Patient 45.4(a)
50.2(a,b)
48.3(a,b)
Age-Gender-Diagnosis Left Precuneus
Senior Female Patient Others
43.0 47.1
38 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
40. Results: Classification
Raw ICA PCA Generator
50
55
60
65
70
75
80
AUC
Non parametric
Raw ICA PCA Generator
Linear
Raw ICA PCA Generator
Non Linear
Classification Method
Decision Tree
Linear SVM
Logistic Regression
Multilayer Perceptron
Naive Bayes
Nearest Neighbors
Random Forest
SVM
40 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
41. Results: Non-parametric Classifiers Size Effect
101 102 103 104
50
60
70
80
90
100
ROC AUC
Nearest Neighbors
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Decision Tree
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Random Forest
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
41 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
42. Results: Linear Classifiers Size Effect
101 102 103 104
50
60
70
80
90
100ROC AUC Linear SVM
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Logistic Regression
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
42 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
43. Results: Non-Linear Size Effect
101 102 103 104
50
60
70
80
90
100
ROC AUC
MLP
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Poly SVM
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Naive Bayes
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
43 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
45. Conclusion
• The generator provides reasonably looking data.
• ANOVA results replicate findings.
• MLP benefits the most from the augmented dataset.
• The augmented dataset provides comparable scores as in raw
data.
• The proposed method enables deep-learning methods for
classification of small datasets.
• More components → more likely to find correlated components →
use MVN
• Few components → less likely to find correlated components →
use Rejection
45 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
47. Polyssifier: http://github.com/alvarouc/polyssifier
• Bash:
poly data.npy label.npy –name schizophrenia –concurrency 8
• Python:
from polyssifier import poly, plot
scores, confusions, predictions = poly(data, label, n folds=8,
concurrency=4)
plot(scores)
47 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
48. MLP: http://github.com/alvarouc/mlp
from mlp import MLP
from sklearn.cross validation import cross val score
clf = MLP(n hidden=10, n deep=3, l1 norm=0, drop=0.1,
verbose=0)
scores = cross val score(clf, data, label, cv=5, n jobs=1,
scoring=’roc auc’)
48 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
49. Brain Graphics: http://github.com/alvarouc/brain_utils
from brain utils import plot source
plot source(source, template, np.where(mask), th=th, vmin=th,
vmax=np.max(t), cmap=’hot’, xyz=xyz)
Diagnosis
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
49 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
50. Funding and Acknoledgements
• This project was funded by grants P20GM103472 and
NIH-R01EB005846.
• We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Tesla K40 GPUs used for this research
50 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
51. Bibliography
Center for Behavioral Health Statistics and Quality.
Behavioral health trends in the united states: Results from the 2014
national survey on drug use and health (hhs publication no. sma
15-4927, nsduh series h-50), 2015.
Retrieved from http://www.samhsa.gov/data/.
51 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification