SlideShare a Scribd company logo
1 of 51
Download to read offline
University of New Mexico
Data Driven Sample Generator Model with Application to
Classification
Supervisor
Dr. Erik Erhardt
Candidate
Alvaro Ulloa
April 15, 2016
Outline
Introduction
Motivation
Thesis statement
Contributions
Materials
Machine Learning methods
Random Variable Samplers
Matrix Factorization
Data Driven Sample Generator
Case Study
Results
Conclusion
2 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Introduction
• Machine Learning
◦ Automate decision making
◦ Learn from experience
◦ Generalize data properties from a
subset
• Regularization
◦ Weight sparness: L1, L2
◦ Weight averaging: Dropout
◦ Weight variation: Noise insertion.
• Rely on design and previous
knowledge of the data
• Data size: Big and Small
3 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Introduction
Big Data
 Large number of samples vs
number of features.
 Crowd sourced.
 Cheap to collect.
 Images, text, video, and
sound.
 Generally, helps ML
methods to not overfit.
 Expensive to compute.
Small data
 Small number of samples vs
number of features
 Expensive to collect.
 Often overfits ML methods.
 Biomedical data
 Not necessarily expensive to
compute
4 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Motivation
• Mental Illness
◦ In 2014, there were an estimated 9.8
million adults in the US with severe
mental illness. [1]
• Structural MRI
◦ Large number of voxels (∼50’000)
◦ Few number of samples (∼400)
◦ Small data scenario
Need for regularization models to
alleviate overfiting effects when
investigating SMRI for mental illness
5 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Thesis statement
• Augmenting a small dataset artificially may lead to improved
classification scores.
• ML methods may benefit from the induced variability, avoid
overfitting, and improve classification scores.
6 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Contributions
• Data-driven sample generation technique
• Optimized rejection sampler
• Enable deep-learning for classification of SMRI data
7 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials
Machine Learning Methods
8 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Non Parametric Classifiers
Nearest Neighbors
• Search for the k-closest points
and vote
Decision Tree
• Sequence of decision rules
based on each feature
Random Forest
• Several decision trees that vote
Coffee
Bad
 1 year
Good
Tropical
Bad
Polar
Good
Mediterranean
Organic
Bad
Non organic
≤ 1 year
9 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Linear Classifiers
Logistic Regression
log
p(y|x)
1 − p(y|x)
= c + x · θ.
min
w,c
||w||L + C n
i=1 log(exp(−yi (xT
i w + c) + 1)
Linear SVM
Search for a plane wx + c = 0
Primal: min
c,w,ζ
||w||L + C n
i ζi subject to
yi (wT φ(xi ) + b) ≥ 1 − ζi , ζi ≥ 0, i = 1, ..., n
10 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Non-Linear Classifiers
Naive Bayes
• p(y|x) = p(y)p(x|y)
p(x)
• Assumes Gaussian
distribution and
independence
Polynomial, Radial SVM
• Polynomial:
K(x, x ) = (xT x + c)d
• Radial:
K(x, x ) = exp(||x−x ||2
2σ2 )
Multilayer Perceptron
• Flexible
• Hard to train
• Each layer improves its ability to
fit more complex data
• Highly prone to overfitting
Input #1
Input #2
Input #3
Input #4
Output
Output
Output
Hidden
layer 1
Hidden
layer 2
Hidden
layer 3
Input
layer
Output
layer
11 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials
Random Variable Samplers
12 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Rejection Sampler
e(x): Envelop function,
e(x) = αh(x|θ)
α: Scale
h(x|theta): PDF easy to
sample from
repeat
Sample y ∼ h(y)
Sample u ∼ Uniform(0, e(y))
if u  f (y) then
Reject y
else
Accept y as a sample from f (x)
end if
until the desired number of samples is
accepted
13 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials:rejection Sampler
0 2 4 6 8 10 12 14 16
0.00
0.05
0.10
0.15
0.20
0.25
0.30
f(x) = exp( −(x −1)2
2x
)x + 1
12
f(x)
0 2 4 6 8 10 12 14 16
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Histogram of generated samples
f(x)
14 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials:rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
rejected accepted
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
15 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Optimized Rejection Sampler
0 2 4 6 8 10 12 14 16
y
0.00
0.05
0.10
0.15
0.20
0.25
0.30
u
More efficient e(x)
rejected
accepted
e(x)
f(x)
ˆθ, ˆα = argmin
θ,α
(αh(x|θ) − f (x))dx, s.t. e(x) − f (x) ≥ 0, ∀x ∈ R
Since h(·) and f (·) are PDFs, it reduces to
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − f (x) ≥ 0, ∀x ∈ Domain{f }
16 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Optimized Rejection Sampler
Let f (x) = Beta(2, 2) = 6x(1 − x), x ∈ [0, 1] and
h(x|θ) = Uniform(0, θ) =
1/θ, if 0 ≤ x ≤ θ
0, otherwise
ˆθ, ˆα = argmin
θ,α
α, s.t. αh(x|θ) − 6x(1 − x) ≥ 0, ∀x ∈ [0, 1]
For θ  1, there is no solution. Thus, θ ≥ 1 for the constrain to hold.
α
θ
≥ 6x(1 − x) →
α
θ
≥ 1.5.
ˆθ, ˆα = argmin
θ,α
α, s.t.
α
θ
≥ 1.5, and θ ≥ 1
17 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Using Lagrangian multipliers,
L(α, θ, λ, γ) =
α − λ(
α
θ
− 1.5) − γ(θ − 1), we
solve
∂L
∂α
= 1 +
λ
θ
= 0
∂L
∂θ
= −λ
α
θ2
+ γ = 0
∂L
∂λ
=
α
θ
− 1.5 = 0
∂L
∂γ
= θ − 1 = 0
Then, the solution is θ = 1, and
α = 1.5.
which results in the optimal
e(x) = 1.5 Uniform(0, 1)
This is correct since the maximum
value for Beta(2, 2) is 1.5.
0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
f(x)
Optimal e(x)
e(x)
f(x)
18 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Multivariate Normal
• Compute the sample mean and sample covariance matrix
• Generate samples with the same mean and covariance
19 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials
Matrix Factorization
X = AS
20 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: PCA
• Introduced by Hotelling in 1933, still widely used.
• Algebraically: linear combinations of X.
• Geometrically: coordinate system rotation.
• E[XXT ] = UΛUT , S = Λ−1
2 UT X, and A = UΛ
1
2
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True Sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2 Mixed Sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
Estimated Sources (ˆS)
21 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: ICA
• Introduced by Herault et. al in 1983 as an extension of PCA.
• ICA searches for independence
• Independent sources, and no more than one Gaussian distributed
source
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
True sources (S)
3 2 1 0 1 2 3
x1
3
2
1
0
1
2
3
x2
Mixed sources (X)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
ICA (ˆS)
3 2 1 0 1 2 3
s1
3
2
1
0
1
2
3
s2
PCA  (ˆS)
22 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Materials: Infomax ICA
• Joint Entropy: H(x) = − f (x) log f (x)dx., where f (x) is a joint
PDF.
• Mutual Information: I(x) = −H(g(x)) + E i log|gi (xi )|
fi (xi ) , where
g(x) = 1
1+exp−x ,
• Infomax: W = argmax
W
H(g(WX)), where W = A−1
23 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Proposed method
Data Driven Sample
Generator
24 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Data Driven Sample Generator
• Generate augmented datasets for ML methods to train on.
• ML tend to fail for datasets that are rich in features but short of
samples.
• Two assumptions:
◦ The input dataset is reducible, i.e reconstruction error from matrix
factorization is minimal.
◦ A group of samples with a common diagnosis shares statistical
properties that are reflected in their loading coefficients (A).
25 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Data Driven Sample Generator
Block diagram
26 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Data Driven Sample Generator
Classification framework
27 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Case Study
Case Study: Schizophrenia
28 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Case Study: Dataset
Patient Control Total
Male 121 97 218
Female 77 94 171
Age 39.68±12.12 40.26±15.02
Total 198 191 389
29 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Case Study: ANOVA
Age Grouping
Healthy Patient
Age Male Female Male Female Total
Young (16-33) 39 35 37 19 130
Adult (34-43) 27 25 51 25 128
Senior (44-81) 31 34 33 33 131
Total 97 94 121 77 389
30 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Case Study: ANOVA
Full Model:
GMC = µ. + Diagnosis + Age + Gender + Diagnosis ∗ Age +
Diagnosis ∗ Gender + Age ∗ Gender + Diagnosis ∗ Age ∗ Gender +
Reduced Model:
• Check the three way interaction (age-gender-diagnosis)
significance level.
• If three-way interaction is not significant, then conduct a model
comparison test (generalized linear F-test) to assess the reduced
model.
• If the test suggests to reduce the model, then we reduce it.
• Repeat with the less significant two-way interaction.
31 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Case Study: Classification Framework
Method Parameter Values
Nearest Neighbors Number of neighbors [1, 5, 10, 20]
Decision Tree Maximum number of
features
’auto’
Random Forest Number of estimators [5...20]
Naive Bayes Kernel Gaussian
Logistic Regression C [0.001, 0.1, 1]
Support Vector Machines
Kernel [radial, polynomial]
C [0.01, 0.1, 1]
Linear SVM
C [0.01, 0.1, 1]
Penalty [’L1’, ’L2’]
Multilayer Perceptron
Depth [3, 4, 5]
Number of hidden
units
[50, 100, 200]
32 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results
Results
33 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Generator sample
Real sample
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
0.72
Which is real?
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
0.72
0 10 20 30 40 50 60
0
10
20
30
40
50 0.00
0.08
0.16
0.24
0.32
0.40
0.48
0.56
0.64
34 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: ANOVA
Diagnosis
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
Age
2.4
3.2
4.0
4.8
5.6
6.4
7.2
8.0
8.8
Gender
1.36
1.44
1.52
1.60
1.68
1.76
1.84
1.92
2.00
Gender-Diagnosis
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
Age-Diagnosis
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
3.8
Age-Gender-Diagnosis
2.04
2.10
2.16
2.22
2.28
2.34
2.40
2.46
2.52
35 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: ANOVA
36 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: ANOVA
Three way ANOVA group means for the main effects of schizophrenia dataset.
Effect Brain Region Group Means (×10−2
)
Diagnosis
Control Patient
57.7 52.9
60.2 55.4
39.3 35.7
Right Superior Temporal Gyrus
Left Superior Temporal Gyrus
Superior Frontal Gyrus
Young Adult Senior
30.8 33.7 35.9
34.1 37.5 39.8
68.3 73.7∗
73.3∗
66.8 71.3∗
71.7∗
Age
Left Thalamus
Right Thalamus
Right Parahippocampal Gyrus
Left Parahippocampal Gyrus
Gender None
∗
Not statistically different.
37 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: ANOVA
Three way ANOVA group means for the effects of interactions on
schizophrenia dataset.
Effect Brain Region Group Means (×10−2
)
Gender-Diagnosis Right Fusiform Gyrus
Control Patient
Male 27.7(a)
29.7(b)
Female 29.4(a,b)
28.1(a,b)
Age-Diagnosis
Right Inferior Parietal Lobule
Young Adult Senior
Control 54.5(c)
53.0(b,c)
47.1(a)
Patient 47.2(a)
50.7(a,b)
48.8(a)
Left Inferior Parietal lobule
Young Adult Senior
Control 51.9(b,c)
52.2(c)
47.1(a)
Patient 45.4(a)
50.2(a,b)
48.3(a,b)
Age-Gender-Diagnosis Left Precuneus
Senior Female Patient Others
43.0 47.1
38 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Classification
Method Raw ICA PCA Augmented
Logistic Regression 72.1 ± 3.5 66.4 ± 7.6 67.5 ± 3.9 71.0 ± 3.0
Multilayer Perceptron 60.2 ± 12.5 67.9 ± 5.2 66.6 ± 3.7 75.0 ± 4.5
SVM (radial, poly) 70.5 ± 5.9 57.0 ± 4.7 64.0 ± 5.5 70.1 ± 4.0
Linear SVM 69.1 ± 6.7 68.2 ± 7.5 67.4 ± 4.3 71.3 ± 3.9
Naive Bayes 60.3 ± 6.0 59.8 ± 8.6 65.2 ± 5.8 58.3 ± 3.7
Decision Tree 55.5 ± 4.9 54.3 ± 5.1 56.0 ± 5.6 55.2 ± 3.3
Random Forest 60.1 ± 3.4 62.3 ± 5.7 65.6 ± 3.9 63.3 ± 2.3
Nearest Neighbors 62.7 ± 3.5 58.6 ± 6.2 65.1 ± 3.8 60.3 ± 3.5
39 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Classification
Raw ICA PCA Generator
50
55
60
65
70
75
80
AUC
Non parametric
Raw ICA PCA Generator
Linear
Raw ICA PCA Generator
Non Linear
Classification Method
Decision Tree
Linear SVM
Logistic Regression
Multilayer Perceptron
Naive Bayes
Nearest Neighbors
Random Forest
SVM
40 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Non-parametric Classifiers Size Effect
101 102 103 104
50
60
70
80
90
100
ROC AUC
Nearest Neighbors
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Decision Tree
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Random Forest
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
41 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Linear Classifiers Size Effect
101 102 103 104
50
60
70
80
90
100ROC AUC Linear SVM
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Logistic Regression
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
42 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Results: Non-Linear Size Effect
101 102 103 104
50
60
70
80
90
100
ROC AUC
MLP
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Poly SVM
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
101 102 103 104
50
60
70
80
90
100
ROC AUC
Naive Bayes
101 102 103 104
Number of generated samples
0
2
4
6
8
10
Standard Deviation
Train
Test
43 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Conclusion
Conclusion
44 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Conclusion
• The generator provides reasonably looking data.
• ANOVA results replicate findings.
• MLP benefits the most from the augmented dataset.
• The augmented dataset provides comparable scores as in raw
data.
• The proposed method enables deep-learning methods for
classification of small datasets.
• More components → more likely to find correlated components →
use MVN
• Few components → less likely to find correlated components →
use Rejection
45 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Software
Software
46 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Polyssifier: http://github.com/alvarouc/polyssifier
• Bash:
poly data.npy label.npy –name schizophrenia –concurrency 8
• Python:
from polyssifier import poly, plot
scores, confusions, predictions = poly(data, label, n folds=8,
concurrency=4)
plot(scores)
47 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
MLP: http://github.com/alvarouc/mlp
from mlp import MLP
from sklearn.cross validation import cross val score
clf = MLP(n hidden=10, n deep=3, l1 norm=0, drop=0.1,
verbose=0)
scores = cross val score(clf, data, label, cv=5, n jobs=1,
scoring=’roc auc’)
48 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Brain Graphics: http://github.com/alvarouc/brain_utils
from brain utils import plot source
plot source(source, template, np.where(mask), th=th, vmin=th,
vmax=np.max(t), cmap=’hot’, xyz=xyz)
Diagnosis
2.0
2.4
2.8
3.2
3.6
4.0
4.4
4.8
5.2
49 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Funding and Acknoledgements
• This project was funded by grants P20GM103472 and
NIH-R01EB005846.
• We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Tesla K40 GPUs used for this research
50 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification
Bibliography
Center for Behavioral Health Statistics and Quality.
Behavioral health trends in the united states: Results from the 2014
national survey on drug use and health (hhs publication no. sma
15-4927, nsduh series h-50), 2015.
Retrieved from http://www.samhsa.gov/data/.
51 of 51
Alvaro Ulloa - Data Driven Sample Generator with App to classification

More Related Content

What's hot

Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Edureka!
 
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Informatikai Intézet
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...Andre Freitas
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習Masa Kato
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love BucharestStefan Adam
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMikhail Klassen
 
Erik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better MortgageErik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better MortgageMLconf
 

What's hot (20)

Part1
Part1Part1
Part1
 
data mining
data miningdata mining
data mining
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)
 
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
 
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
Blanka Láng, László Kovács and László Mohácsi: Linear regression model select...
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace Training
 
Machine learning
Machine learningMachine learning
Machine learning
 
Erik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better MortgageErik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better Mortgage
 

Viewers also liked

Còrrer per compromís (pàdel)
Còrrer per compromís (pàdel)Còrrer per compromís (pàdel)
Còrrer per compromís (pàdel)Sociologiainefc
 
FAKE NATTOKINASE
FAKE NATTOKINASEFAKE NATTOKINASE
FAKE NATTOKINASETaishou Me
 
Sociologia, projecte: "Còrrer per compromís"
Sociologia, projecte: "Còrrer per compromís"Sociologia, projecte: "Còrrer per compromís"
Sociologia, projecte: "Còrrer per compromís"Sociologiainefc
 
Beginners guide on php programming
Beginners guide on php programmingBeginners guide on php programming
Beginners guide on php programmingKindle Books
 
Development graphics scott wilson
Development graphics scott wilsonDevelopment graphics scott wilson
Development graphics scott wilsonScottWilson977
 
Cell phone ruining our younger generation
Cell phone ruining our younger generationCell phone ruining our younger generation
Cell phone ruining our younger generationSalman Saleem
 
Constructivism by Jesse Delia - Michie's case analysis
Constructivism by Jesse  Delia - Michie's case analysisConstructivism by Jesse  Delia - Michie's case analysis
Constructivism by Jesse Delia - Michie's case analysisMichie Lorenz Basco
 
UKM dan tantangan yang di hadapi
UKM dan tantangan yang di hadapiUKM dan tantangan yang di hadapi
UKM dan tantangan yang di hadapiDian Puspa Tiara
 
Brain and drugs.
Brain and drugs.Brain and drugs.
Brain and drugs.madercj
 
Corrientes del pensamiento enfermero
Corrientes del pensamiento enfermeroCorrientes del pensamiento enfermero
Corrientes del pensamiento enfermeroCarlos Mejía Huamán
 
El pensamiento enfermero
El pensamiento enfermeroEl pensamiento enfermero
El pensamiento enfermeroEstela Morales
 

Viewers also liked (15)

Trailer - Props
Trailer - PropsTrailer - Props
Trailer - Props
 
Còrrer per compromís (pàdel)
Còrrer per compromís (pàdel)Còrrer per compromís (pàdel)
Còrrer per compromís (pàdel)
 
FAKE NATTOKINASE
FAKE NATTOKINASEFAKE NATTOKINASE
FAKE NATTOKINASE
 
Sociologia, projecte: "Còrrer per compromís"
Sociologia, projecte: "Còrrer per compromís"Sociologia, projecte: "Còrrer per compromís"
Sociologia, projecte: "Còrrer per compromís"
 
Lbd
LbdLbd
Lbd
 
Beginners guide on php programming
Beginners guide on php programmingBeginners guide on php programming
Beginners guide on php programming
 
Development graphics scott wilson
Development graphics scott wilsonDevelopment graphics scott wilson
Development graphics scott wilson
 
Ven
VenVen
Ven
 
Egyptian medical syndicate
Egyptian medical syndicateEgyptian medical syndicate
Egyptian medical syndicate
 
Cell phone ruining our younger generation
Cell phone ruining our younger generationCell phone ruining our younger generation
Cell phone ruining our younger generation
 
Constructivism by Jesse Delia - Michie's case analysis
Constructivism by Jesse  Delia - Michie's case analysisConstructivism by Jesse  Delia - Michie's case analysis
Constructivism by Jesse Delia - Michie's case analysis
 
UKM dan tantangan yang di hadapi
UKM dan tantangan yang di hadapiUKM dan tantangan yang di hadapi
UKM dan tantangan yang di hadapi
 
Brain and drugs.
Brain and drugs.Brain and drugs.
Brain and drugs.
 
Corrientes del pensamiento enfermero
Corrientes del pensamiento enfermeroCorrientes del pensamiento enfermero
Corrientes del pensamiento enfermero
 
El pensamiento enfermero
El pensamiento enfermeroEl pensamiento enfermero
El pensamiento enfermero
 

Similar to main

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by ConsensusMiguel Rebollo
 
eMba i qt unit-5_sampling
eMba i qt unit-5_samplingeMba i qt unit-5_sampling
eMba i qt unit-5_samplingRai University
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 

Similar to main (20)

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmission
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Chapter12
Chapter12Chapter12
Chapter12
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Distributed Group Analytical Hierarchical Process by Consensus
 Distributed Group Analytical Hierarchical Process by Consensus Distributed Group Analytical Hierarchical Process by Consensus
Distributed Group Analytical Hierarchical Process by Consensus
 
08 entropie
08 entropie08 entropie
08 entropie
 
eMba i qt unit-5_sampling
eMba i qt unit-5_samplingeMba i qt unit-5_sampling
eMba i qt unit-5_sampling
 
Introduction
IntroductionIntroduction
Introduction
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
joe-olsen.pptx
joe-olsen.pptxjoe-olsen.pptx
joe-olsen.pptx
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 

main

  • 1. University of New Mexico Data Driven Sample Generator Model with Application to Classification Supervisor Dr. Erik Erhardt Candidate Alvaro Ulloa April 15, 2016
  • 2. Outline Introduction Motivation Thesis statement Contributions Materials Machine Learning methods Random Variable Samplers Matrix Factorization Data Driven Sample Generator Case Study Results Conclusion 2 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 3. Introduction • Machine Learning ◦ Automate decision making ◦ Learn from experience ◦ Generalize data properties from a subset • Regularization ◦ Weight sparness: L1, L2 ◦ Weight averaging: Dropout ◦ Weight variation: Noise insertion. • Rely on design and previous knowledge of the data • Data size: Big and Small 3 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 4. Introduction Big Data Large number of samples vs number of features. Crowd sourced. Cheap to collect. Images, text, video, and sound. Generally, helps ML methods to not overfit. Expensive to compute. Small data Small number of samples vs number of features Expensive to collect. Often overfits ML methods. Biomedical data Not necessarily expensive to compute 4 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 5. Motivation • Mental Illness ◦ In 2014, there were an estimated 9.8 million adults in the US with severe mental illness. [1] • Structural MRI ◦ Large number of voxels (∼50’000) ◦ Few number of samples (∼400) ◦ Small data scenario Need for regularization models to alleviate overfiting effects when investigating SMRI for mental illness 5 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 6. Thesis statement • Augmenting a small dataset artificially may lead to improved classification scores. • ML methods may benefit from the induced variability, avoid overfitting, and improve classification scores. 6 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 7. Contributions • Data-driven sample generation technique • Optimized rejection sampler • Enable deep-learning for classification of SMRI data 7 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 8. Materials Machine Learning Methods 8 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 9. Materials: Non Parametric Classifiers Nearest Neighbors • Search for the k-closest points and vote Decision Tree • Sequence of decision rules based on each feature Random Forest • Several decision trees that vote Coffee Bad 1 year Good Tropical Bad Polar Good Mediterranean Organic Bad Non organic ≤ 1 year 9 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 10. Materials: Linear Classifiers Logistic Regression log p(y|x) 1 − p(y|x) = c + x · θ. min w,c ||w||L + C n i=1 log(exp(−yi (xT i w + c) + 1) Linear SVM Search for a plane wx + c = 0 Primal: min c,w,ζ ||w||L + C n i ζi subject to yi (wT φ(xi ) + b) ≥ 1 − ζi , ζi ≥ 0, i = 1, ..., n 10 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 11. Materials: Non-Linear Classifiers Naive Bayes • p(y|x) = p(y)p(x|y) p(x) • Assumes Gaussian distribution and independence Polynomial, Radial SVM • Polynomial: K(x, x ) = (xT x + c)d • Radial: K(x, x ) = exp(||x−x ||2 2σ2 ) Multilayer Perceptron • Flexible • Hard to train • Each layer improves its ability to fit more complex data • Highly prone to overfitting Input #1 Input #2 Input #3 Input #4 Output Output Output Hidden layer 1 Hidden layer 2 Hidden layer 3 Input layer Output layer 11 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 12. Materials Random Variable Samplers 12 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 13. Materials: Rejection Sampler e(x): Envelop function, e(x) = αh(x|θ) α: Scale h(x|theta): PDF easy to sample from repeat Sample y ∼ h(y) Sample u ∼ Uniform(0, e(y)) if u f (y) then Reject y else Accept y as a sample from f (x) end if until the desired number of samples is accepted 13 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 14. Materials:rejection Sampler 0 2 4 6 8 10 12 14 16 0.00 0.05 0.10 0.15 0.20 0.25 0.30 f(x) = exp( −(x −1)2 2x )x + 1 12 f(x) 0 2 4 6 8 10 12 14 16 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Histogram of generated samples f(x) 14 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 15. Materials:rejection Sampler 0 2 4 6 8 10 12 14 16 y 0.00 0.05 0.10 0.15 0.20 0.25 0.30 u rejected accepted 0 2 4 6 8 10 12 14 16 y 0.00 0.05 0.10 0.15 0.20 0.25 0.30 u More efficient e(x) rejected accepted e(x) f(x) 15 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 16. Materials: Optimized Rejection Sampler 0 2 4 6 8 10 12 14 16 y 0.00 0.05 0.10 0.15 0.20 0.25 0.30 u More efficient e(x) rejected accepted e(x) f(x) ˆθ, ˆα = argmin θ,α (αh(x|θ) − f (x))dx, s.t. e(x) − f (x) ≥ 0, ∀x ∈ R Since h(·) and f (·) are PDFs, it reduces to ˆθ, ˆα = argmin θ,α α, s.t. αh(x|θ) − f (x) ≥ 0, ∀x ∈ Domain{f } 16 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 17. Materials: Optimized Rejection Sampler Let f (x) = Beta(2, 2) = 6x(1 − x), x ∈ [0, 1] and h(x|θ) = Uniform(0, θ) = 1/θ, if 0 ≤ x ≤ θ 0, otherwise ˆθ, ˆα = argmin θ,α α, s.t. αh(x|θ) − 6x(1 − x) ≥ 0, ∀x ∈ [0, 1] For θ 1, there is no solution. Thus, θ ≥ 1 for the constrain to hold. α θ ≥ 6x(1 − x) → α θ ≥ 1.5. ˆθ, ˆα = argmin θ,α α, s.t. α θ ≥ 1.5, and θ ≥ 1 17 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 18. Using Lagrangian multipliers, L(α, θ, λ, γ) = α − λ( α θ − 1.5) − γ(θ − 1), we solve ∂L ∂α = 1 + λ θ = 0 ∂L ∂θ = −λ α θ2 + γ = 0 ∂L ∂λ = α θ − 1.5 = 0 ∂L ∂γ = θ − 1 = 0 Then, the solution is θ = 1, and α = 1.5. which results in the optimal e(x) = 1.5 Uniform(0, 1) This is correct since the maximum value for Beta(2, 2) is 1.5. 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 f(x) Optimal e(x) e(x) f(x) 18 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 19. Materials: Multivariate Normal • Compute the sample mean and sample covariance matrix • Generate samples with the same mean and covariance 19 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 20. Materials Matrix Factorization X = AS 20 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 21. Materials: PCA • Introduced by Hotelling in 1933, still widely used. • Algebraically: linear combinations of X. • Geometrically: coordinate system rotation. • E[XXT ] = UΛUT , S = Λ−1 2 UT X, and A = UΛ 1 2 3 2 1 0 1 2 3 s1 3 2 1 0 1 2 3 s2 True Sources (S) 3 2 1 0 1 2 3 x1 3 2 1 0 1 2 3 x2 Mixed Sources (X) 3 2 1 0 1 2 3 s1 3 2 1 0 1 2 3 s2 Estimated Sources (ˆS) 21 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 22. Materials: ICA • Introduced by Herault et. al in 1983 as an extension of PCA. • ICA searches for independence • Independent sources, and no more than one Gaussian distributed source 3 2 1 0 1 2 3 s1 3 2 1 0 1 2 3 s2 True sources (S) 3 2 1 0 1 2 3 x1 3 2 1 0 1 2 3 x2 Mixed sources (X) 3 2 1 0 1 2 3 s1 3 2 1 0 1 2 3 s2 ICA (ˆS) 3 2 1 0 1 2 3 s1 3 2 1 0 1 2 3 s2 PCA  (ˆS) 22 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 23. Materials: Infomax ICA • Joint Entropy: H(x) = − f (x) log f (x)dx., where f (x) is a joint PDF. • Mutual Information: I(x) = −H(g(x)) + E i log|gi (xi )| fi (xi ) , where g(x) = 1 1+exp−x , • Infomax: W = argmax W H(g(WX)), where W = A−1 23 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 24. Proposed method Data Driven Sample Generator 24 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 25. Data Driven Sample Generator • Generate augmented datasets for ML methods to train on. • ML tend to fail for datasets that are rich in features but short of samples. • Two assumptions: ◦ The input dataset is reducible, i.e reconstruction error from matrix factorization is minimal. ◦ A group of samples with a common diagnosis shares statistical properties that are reflected in their loading coefficients (A). 25 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 26. Data Driven Sample Generator Block diagram 26 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 27. Data Driven Sample Generator Classification framework 27 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 28. Case Study Case Study: Schizophrenia 28 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 29. Case Study: Dataset Patient Control Total Male 121 97 218 Female 77 94 171 Age 39.68±12.12 40.26±15.02 Total 198 191 389 29 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 30. Case Study: ANOVA Age Grouping Healthy Patient Age Male Female Male Female Total Young (16-33) 39 35 37 19 130 Adult (34-43) 27 25 51 25 128 Senior (44-81) 31 34 33 33 131 Total 97 94 121 77 389 30 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 31. Case Study: ANOVA Full Model: GMC = µ. + Diagnosis + Age + Gender + Diagnosis ∗ Age + Diagnosis ∗ Gender + Age ∗ Gender + Diagnosis ∗ Age ∗ Gender + Reduced Model: • Check the three way interaction (age-gender-diagnosis) significance level. • If three-way interaction is not significant, then conduct a model comparison test (generalized linear F-test) to assess the reduced model. • If the test suggests to reduce the model, then we reduce it. • Repeat with the less significant two-way interaction. 31 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 32. Case Study: Classification Framework Method Parameter Values Nearest Neighbors Number of neighbors [1, 5, 10, 20] Decision Tree Maximum number of features ’auto’ Random Forest Number of estimators [5...20] Naive Bayes Kernel Gaussian Logistic Regression C [0.001, 0.1, 1] Support Vector Machines Kernel [radial, polynomial] C [0.01, 0.1, 1] Linear SVM C [0.01, 0.1, 1] Penalty [’L1’, ’L2’] Multilayer Perceptron Depth [3, 4, 5] Number of hidden units [50, 100, 200] 32 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 33. Results Results 33 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 34. Results: Generator sample Real sample 0 10 20 30 40 50 60 0 10 20 30 40 50 0.00 0.08 0.16 0.24 0.32 0.40 0.48 0.56 0.64 0.72 Which is real? 0 10 20 30 40 50 60 0 10 20 30 40 50 0.00 0.08 0.16 0.24 0.32 0.40 0.48 0.56 0.64 0.72 0 10 20 30 40 50 60 0 10 20 30 40 50 0.00 0.08 0.16 0.24 0.32 0.40 0.48 0.56 0.64 34 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 36. Results: ANOVA 36 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 37. Results: ANOVA Three way ANOVA group means for the main effects of schizophrenia dataset. Effect Brain Region Group Means (×10−2 ) Diagnosis Control Patient 57.7 52.9 60.2 55.4 39.3 35.7 Right Superior Temporal Gyrus Left Superior Temporal Gyrus Superior Frontal Gyrus Young Adult Senior 30.8 33.7 35.9 34.1 37.5 39.8 68.3 73.7∗ 73.3∗ 66.8 71.3∗ 71.7∗ Age Left Thalamus Right Thalamus Right Parahippocampal Gyrus Left Parahippocampal Gyrus Gender None ∗ Not statistically different. 37 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 38. Results: ANOVA Three way ANOVA group means for the effects of interactions on schizophrenia dataset. Effect Brain Region Group Means (×10−2 ) Gender-Diagnosis Right Fusiform Gyrus Control Patient Male 27.7(a) 29.7(b) Female 29.4(a,b) 28.1(a,b) Age-Diagnosis Right Inferior Parietal Lobule Young Adult Senior Control 54.5(c) 53.0(b,c) 47.1(a) Patient 47.2(a) 50.7(a,b) 48.8(a) Left Inferior Parietal lobule Young Adult Senior Control 51.9(b,c) 52.2(c) 47.1(a) Patient 45.4(a) 50.2(a,b) 48.3(a,b) Age-Gender-Diagnosis Left Precuneus Senior Female Patient Others 43.0 47.1 38 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 39. Results: Classification Method Raw ICA PCA Augmented Logistic Regression 72.1 ± 3.5 66.4 ± 7.6 67.5 ± 3.9 71.0 ± 3.0 Multilayer Perceptron 60.2 ± 12.5 67.9 ± 5.2 66.6 ± 3.7 75.0 ± 4.5 SVM (radial, poly) 70.5 ± 5.9 57.0 ± 4.7 64.0 ± 5.5 70.1 ± 4.0 Linear SVM 69.1 ± 6.7 68.2 ± 7.5 67.4 ± 4.3 71.3 ± 3.9 Naive Bayes 60.3 ± 6.0 59.8 ± 8.6 65.2 ± 5.8 58.3 ± 3.7 Decision Tree 55.5 ± 4.9 54.3 ± 5.1 56.0 ± 5.6 55.2 ± 3.3 Random Forest 60.1 ± 3.4 62.3 ± 5.7 65.6 ± 3.9 63.3 ± 2.3 Nearest Neighbors 62.7 ± 3.5 58.6 ± 6.2 65.1 ± 3.8 60.3 ± 3.5 39 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 40. Results: Classification Raw ICA PCA Generator 50 55 60 65 70 75 80 AUC Non parametric Raw ICA PCA Generator Linear Raw ICA PCA Generator Non Linear Classification Method Decision Tree Linear SVM Logistic Regression Multilayer Perceptron Naive Bayes Nearest Neighbors Random Forest SVM 40 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 41. Results: Non-parametric Classifiers Size Effect 101 102 103 104 50 60 70 80 90 100 ROC AUC Nearest Neighbors 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 101 102 103 104 50 60 70 80 90 100 ROC AUC Decision Tree 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 101 102 103 104 50 60 70 80 90 100 ROC AUC Random Forest 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 41 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 42. Results: Linear Classifiers Size Effect 101 102 103 104 50 60 70 80 90 100ROC AUC Linear SVM 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 101 102 103 104 50 60 70 80 90 100 ROC AUC Logistic Regression 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 42 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 43. Results: Non-Linear Size Effect 101 102 103 104 50 60 70 80 90 100 ROC AUC MLP 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 101 102 103 104 50 60 70 80 90 100 ROC AUC Poly SVM 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 101 102 103 104 50 60 70 80 90 100 ROC AUC Naive Bayes 101 102 103 104 Number of generated samples 0 2 4 6 8 10 Standard Deviation Train Test 43 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 44. Conclusion Conclusion 44 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 45. Conclusion • The generator provides reasonably looking data. • ANOVA results replicate findings. • MLP benefits the most from the augmented dataset. • The augmented dataset provides comparable scores as in raw data. • The proposed method enables deep-learning methods for classification of small datasets. • More components → more likely to find correlated components → use MVN • Few components → less likely to find correlated components → use Rejection 45 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 46. Software Software 46 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 47. Polyssifier: http://github.com/alvarouc/polyssifier • Bash: poly data.npy label.npy –name schizophrenia –concurrency 8 • Python: from polyssifier import poly, plot scores, confusions, predictions = poly(data, label, n folds=8, concurrency=4) plot(scores) 47 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 48. MLP: http://github.com/alvarouc/mlp from mlp import MLP from sklearn.cross validation import cross val score clf = MLP(n hidden=10, n deep=3, l1 norm=0, drop=0.1, verbose=0) scores = cross val score(clf, data, label, cv=5, n jobs=1, scoring=’roc auc’) 48 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 49. Brain Graphics: http://github.com/alvarouc/brain_utils from brain utils import plot source plot source(source, template, np.where(mask), th=th, vmin=th, vmax=np.max(t), cmap=’hot’, xyz=xyz) Diagnosis 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 49 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 50. Funding and Acknoledgements • This project was funded by grants P20GM103472 and NIH-R01EB005846. • We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPUs used for this research 50 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification
  • 51. Bibliography Center for Behavioral Health Statistics and Quality. Behavioral health trends in the united states: Results from the 2014 national survey on drug use and health (hhs publication no. sma 15-4927, nsduh series h-50), 2015. Retrieved from http://www.samhsa.gov/data/. 51 of 51 Alvaro Ulloa - Data Driven Sample Generator with App to classification