PhD Defense

Seoul National University
Advanced Computing Laboratory
Taehoon Lee
Robust Feature Learning
with Deep Neural Networks

• Achievements
• Preliminary
• Deep neural networks
• Dissertation overview
• Adversarial example handling
• Manifold regularized deep neural networks using adversarial examples
• Class-imbalance handling
• Boosted contrastive divergence
• Spatial dependency handling
• Structured sparsity via parallel fused Lasso
• Conclusion
• Limitations and future work
Outline
2/81

ResearchAreas
Deep neural networks are able to learn hierarchical representations.
Theory Image
Time series Bioinformatics
Machine Learning
Deep Learning
• Main theories: machine learning, deep learning, statistical learning
• Main applications: computer vision, bioinformatics
• Main skills: parallel computing
3/81

• Byunghan Lee, Taehoon Lee, andSungroh Yoon,"DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks," in Proceedings
of NIPS Workshop on Machine Learning in Computational Biology, Montreal, Canada, December 2015.
• Seungmyung Lee, Hanjoo Kim, Siqi Tan, Taehoon Lee, Sungroh Yoon, and Rhiju Das, "Automated band annotation for RNA structure probing
experiments with numerous capillary electrophoresis profiles," Bioinformatics, vol. 31, no. 17, pp. 2808-2815, September 2015.
• Taehoon Lee and Sungroh Yoon, "Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions," in
Proceedings of International Conference on Machine Learning (ICML), Lille, France, July 2015.
• Donghyeon Yu, Joong-Ho Won, Taehoon Lee, Johan Lim, and Sungroh Yoon,"High-dimensional Fused Lasso Regression using Majorization-
Minimization and Parallel Processing," Journal of Computational and Graphical Statistics, vol.24, no.1, pp. 121-153, March 2015.
• Taehoon Lee, Sungmin Lee, Woo Young Sim, Yu MiJung, Sunmi Han, Chanil Chung, Jay Junkeun Chang, Hyeyoung Min,and Sungroh Yoon,
"Robust Classification of DNA Damage Patterns in Single Cell Gel Electrophoresis," in Proceedings of 35th Annual International Conference of the
IEEE Engineering in Medicine andBiology Society (EMBC),Osaka, Japan, July 2013.
• Taehoon Lee, Hyeyoung Min,Seung Jean Kim, and Sungroh Yoon, "Application of maximin correlation analysis to classifying protein
environments for function prediction," Biochemical and Biophysical Research Communications, vol. 400, no. 2, pp. 219-224, September 2010.
• Hyeyoung Min,Seunghak Yu, Taehoon Lee, and Sungroh Yoon, "Support vector machine based classification of 3-dimensional protein
physicochemical environments for automated function annotation," Archives of Pharmacal Research, vol. 33, no. 9, pp. 1451-1459,September
2010.
• Taehoon Lee, Seung Jean Kim, Eui-Young Chung, andSungroh Yoon, "K-maximin Clustering: A Maximin Correlation Approach to Partition-Based
Clustering, " IEICE Electronics Express, vol. 6, no. 17, pp. 1205-1211, September 2009.
• Taehoon Lee, Taesup Moon,Seung Jean Kim, and Sungroh Yoon,"Regularization and Kernelization of the Maximin Correlation Approach" (under
review)
• Taehoon Lee, Minsuk Choi, and Sungroh Yoon, "Manifold Regularized Deep Networks using Adversarial Examples" (under review)
• Taehoon Lee, Joong-Ho Won, Johan Lim, and Sungroh Yoon,"Large-scale Fused Lasso on multi-GPU using FFT-Based Split Bregman Method"
(under review)
• Taehoon Lee et al., "HiComet: High-Throughput Comet Analysis Tool for Large-Scale DNA Damage Assessment Studies" (in preparation)
Publications
• 게재 완료: SCI급 저널 5편, 학술대회 논문 3편 (제 1저자 총 4편)
• 심사 중: SCI급 저널 3편, 학술대회 논문 1편 (모두 제 1저자)
• 국내 저널 및 학회: 12편 (제 1저자 6편)
4/81

• Achievements
• Preliminary
• Conclusion
Outline
6/81

• Deep Neural Network (DNN) learns effective hierarchical representation.
• DNN learns automatically representations and features from data.
What Do Deep Neural Networks Learn
object
↑
part
↑
motif
↑
Edge
Image
story
↑
sentence
↑
clause
↑
word
Language
word
↑
phoneme
↑
phone
↑
Sound
Speech
output
input
Hand-crafted program
Hand-crafted features
Trainable features
Trainable classifier
Trainable classifier
tiger
Traditional machine learning
Deep learning
Rule-based systems
higher level of
abstraction
7/81

3 × 2 + 3 × 5 + 3 × 7 → 3 × (2 + 5 + 7)
• As the number of layers goes larger, the effect of factorization gets higher.
• Factorization is the decomposition of an object into a product of factors.
Why Do Deep Neural NetworksWork SoWell
𝑥 𝑦
𝑊(1)
𝑊(2)
𝑥 𝑦
𝑊(1)
𝑊(2)
𝑊(3)
𝑊(4)
The more number of paths with the same number of weight values
shallow deep
Many data, complex models, various priors, and high-end
hardware altogether are enabling deep learning prosper.
8/81

History ofArtificial Neural Networks
Minsky and Papert, 1969
“Perceptrons”
(Limits of Perceptrons) [M69]
Rosenblatt, 1958
Perceptron [R58]
Fukushima, 1980
NeoCognitron
(Convolutional NN) [F80]
Hinton, 1983
Boltzmann
machine [H83]
Fukushima, 1975
Cognitron (Autoencoder) [F75]
Hinton, 1986
RBM, Restricted
Boltzmann machine [H86]
Hinton, 2006
Deep Belief
Networks [H06]
(mid 1980s)
Back-propagation
Early Models
Basic Models
Break
through
Le, 2012
Training of 1 billion
parameters [L12]
Lee, 2009
Convolutional
RBM [L09]
LeCun, 1998
Revisit of CNN [L98]
http://www.technologyrevi
ew.com/featuredstory/5136
96/deep-learning/
9/81

Deep LearningTechniques
Regularization helps the network avoid get over-fitted.
dropout
parameter
sharing
(CNN, RNN)
early stopping
weight decay
sparse
connectivity
exploiting
sparsity
traditionaltrendy
• Deconv nets
(Zeiler et al., CVPR 2010)
• Normalized initialization
(Glorot et al., AISTATS 2010)
• DropConnect
(Wan et al., ICML 2013)
• Batch normalization
(Loffe et al., ICML 2015)
• Inception
(Szegedy et al., CVPR 2015)
• Adversarial training
(Goodfellow et al., ICLR 2015)
LeCun et al., Proc.
IEEE 1998Srivastava et al.,
JMLR 2014
Baidu
10/81

Applications of Deep Learning
Natural Language Understanding
Natural Image Understanding
from Karpathy et al., NIPS 2014.
from Google I/O 2013 Highlights
Speech
Recognition
Image Recognition
Natural
Language
Processing
output sentence
current main applications rising applications
11/81

• RBM is a type of logistic belief network whose structure is a bipartite graph.
• Nodes:
• Input layer:
• Hidden layer:
• Probability of a configuration :
•
•
• Each node is a stochastic binary unit:
•
• can be used as a feature.
Restricted Boltzmann Machines
12/81

• CNN is a type of feed-forward artificial neural network where the individual
neurons respond to overlapping regions in the visual field.
• Key components are convolutional and subsampling layers.
Convolutional Neural Networks
LeCun et al., Proc. IEEE 1998.
C-layer
Convolution
between a kernel
and an image to
extract features.
S-layer
Aggregation of
the statistics of
local features at
various locations.
13/81

• Achievements
• Preliminary
• Conclusion
Outline
14/81

• Achievements
• Preliminary
• Conclusion
Outline
1
2
3
15/81

• Achievements
• Preliminary
• Conclusion
Outline
1
2
3
16/81

• As deep neural networks learn a large number of parameters, there have been
many attempts to obtain reasonable solutions over a wide search space. In this
dissertation, following three issues for deep learning are discussed.
Dissertation Overview
17/81

• First, deep neural networks expose the problem of intrinsic blind spots called
adversarial perturbations.
18/81

• Second, training restricted Boltzmann machines showed
limited performance for sampling for minority samples in
class-imbalanced dataset.
19/81

• Second, training restricted Boltzmann machines showed
limited performance for sampling for minority samples in
class-imbalanced dataset.
• Lastly, spatial dependency handling needs to be more
complicated while convolutional neural networks are known
as well learning technique for handling of spatial dependency.
20/81

• Achievements
• Preliminary
• Conclusion
Outline
21/81

• Desired behaviors and practical issues of deep learning and manifold learning:
• Deep learning discriminates different classes; however, it may result in
wiggly boundaries vulnerable to adversarial perturbations.
• Manifold learning preserves geodesic distances; however, it may result in
poor embedding.
Motivation
22/81

Szegedy et al, Intriguing Properties of Neural Networks, ICLR 2014.
Goodfellow et al, Explaining and HarnessingAdversarial Examples, ICLR 2015.
• We can generate an adversarial input 𝑥 𝑎𝑑𝑣 = 𝑥 + ∆𝑥.
• We expect the classifier to assign the same class to 𝑥 and 𝑥 𝑎𝑑𝑣 so long as
∆𝑥 ∞ < 𝜖.
• However, very small perturbation can misclassify correct images.
Adversarial Example
adversarial
example
original
example
small
perturbation
Goodfellow, ICLR 2015.
fooling networks
23/81

• Consider the dot product between a weight vector w and an adversarial
example 𝑥 𝑎𝑑𝑣:
• The adversarial perturbation causes the activation to grow by 𝑤 𝑇∆𝑥.
• We can maximize this increase subject to max norm constraint on ∆𝑥 by
assigning ∆𝑥 = sign(𝑤).
HowCanWe Fool Neural Networks?
𝑤 𝑇 𝑥 𝑎𝑑𝑣 = 𝑤 𝑇 𝑥 + 𝑤 𝑇∆𝑥
𝑥 𝑎𝑑𝑣 = 𝑥 − 𝜀𝑤 if 𝑥 is positive
𝑥 𝑎𝑑𝑣 = 𝑥 + 𝜀𝑤 if 𝑥 is negative
𝑤 = [8.28, 10.03]𝑥
24/81

Nguyen et al, Deep Neural Networks are Easily Fooled: HighConfidence
Predictions for Unrecognizable Images, CVPR 2015.
• We can maximize this increase subject to max norm constraint on ∆𝑥 by
assigning ∆𝑥 = 𝜀(𝛻𝑥 𝐽(𝜃, 𝑥, 𝑦)).
• We can also fool neural network by using following evolutionary algorithm.
Deep Neural NetworksCan BeAlso Fooled
25/81

• Adversarial examples can be explained as a property of high-dimensional
dot products.
• The direction of perturbation, rather than the specific point in space, matters
most. Space is not full of pockets of adversarial examples that finely tile the
reals like the rational numbers.
• Because it is the direction that matters most, adversarial perturbations
generalize across different clean examples.
• Linear models lack the capacity to resist adversarial perturbation; only
structures with a hidden layer (where the universal approximator theorem
applies) should be trained to resist adversarial perturbation.
Important Observations (Szegedy et al, ICLR 2014)
26/81

• How can we cover adversarial examples?
• Simply train all the noisy examples (Loosli et al., LargeScale Kernel
Machines 2007: INFINITE MNIST dataset).
• Exponential cost
• Include the adversarial term in the objective function (Goodfellow et al.,
ICLR 2015).
• 𝐽 𝜃, 𝑥, 𝑦 = 𝛼 𝐽 𝜃, 𝑥, 𝑦 + 1 − 𝛼 𝐽(𝜃, 𝑥 𝑎𝑑𝑣, 𝑦)
• 1.14% -> 0.77% error rate on test 10000 examples
• Commonly, people expect that elastic distortion can resist adversarial
examples.
RelatedWork
27/81

What is Manifold
In case of closed manifold,
we may represent it
in higher dimension
more than original one.
http://www.lib.utexas.edu/maps/world_maps/world_rel_803005AI_2003.jpg
In real world, many of observations organize manifol
d.That is reason why we are learning manifold.The
picture are 2-d manifold and 3-d manifold.
28/81

• Manifold term minimizes the difference between activations of several nodes
of the same class samples.
• This helps us to disentangle of the variation factors.
Manifold RegularizationTerm
𝒂(1): input representation
𝒂(5): manifold representation
𝒂(6)
: softmax layer
29/81

𝒂 𝒚
(1)
𝒂 𝒙
(1)
𝒂 𝒚
(5)
𝒂 𝒙
(5)
30/81

𝒂′ 𝒏
(5)
𝒂 𝒏
(5)
𝒙′ 𝒏
𝒙 𝒏
31/81

𝒂′ 𝒏
(5)
𝒂 𝒏
(5)
𝒙′ 𝒏
𝒙 𝒏
+𝜷(𝜵 𝒙 𝒏
𝑳(𝜽; 𝒙 𝒏, 𝒚 𝒏))
32/81

• The proposed methodology learns both classifier and manifold embedding
that is robust for adversarial perturbations.
• Forward and backward operations of MRnet:
• The first forward operation is the same as in a standard neural network.
• The following backward 𝑎𝑑𝑣 is the same as the standard back-propagation
except that an adversarial perturbation.
Proposed Regularized Networks
33/81

• Three datasets we tested:
• (a) MNIST
• (b, c)The rawdata and its normalized version (LCN) ofCIFAR-10
• (d, e)The rawdata and its normalized version (ZCA) of SVHN
Experimental Results
(Krizhevsky et al., 2009)
(LeCun et al., 1998)
(Netzer et al., 2011)
34/81

• We chose 𝛽 in the range that did not violate class information.
• (a-c) Distributions of Euclidean distances between training samples on
individual datasets.
• (d-f) Different perturbation levels on individual datasets.
Generation ofAdversarial Examples
35/81

MNIST Results
Bar: statistics of 10 runs.
Circle: single run reported
in literatures.
• Fully connected models have two hidden layers.
• Convolutional models have more than two
convolutional layers.
• All the results are without data augmentation.
• The proposed model shows the best
performance among the alternatives.
36/81

CIFAR-10 and SVHN Results
37/81

• Data:CIFAR-10 test set.
• (a) Pairwise distance matrix of a(L) without Φ.
• (b) 2-D visualization of the manifold embedding through t-SNE without Φ.
• (c)Query images and top 10 nearest images without Φ.
• (d-f) Pairwise distance matrix, t-SNE plot, and query images with Φ.
Embedding Results
38/81

• We have proposed a novel methodology, unifying deep learning and manifold
learning, called manifold regularized networks (MRnet).
• We tested MRnet and confirmed its improved generalization performance
underpinned by the proposed manifold loss term on deep architectures.
• By exploiting the characteristics of blind spots, the proposed MRnet can be
extended to the discovery of true representations on manifolds in various
learning tasks.
Summary ofTopic 1
39/81

• Achievements
• Preliminary
• Conclusion
Outline
40/81

• Deep Neural Networks (DNN) show human level performance on many
recognition tasks.
• We focus on class-imbalanced prediction.
• Insufficient samples to represent the true distribution of a class.
• Q. How can we learn minor but important features using neural networks?
• We propose a new RBM training method called boosted CD.
• We also devise a regularization term for sparsity of DNA sequences.
Motivation
negative positive
easy to
misclassify
query images
41/81

• Genetic information flows through the gene expression process.
• DNA: a sequence of four types of nucleotides (A,G,T,C).
• Gene: a segment of DNA (the basic unit of heredity).
(Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem
exon
GT: false boundary
GT: true boundary
ACGTCGACTGCTACGTAGCAGCGA
TACGTACCGATCATCACTATCATC
GAGGTACGATCGATCGATCGATCA
GTCGATCGTCGTTCAGTCAGTCGA
TATCAGTCATATGCACATCTCAGT
DNA
RNA
protein
gene expression
GT (or AG)
16K
76M
true sites
exon
intron
160K
(=0.21% over 76M)
42/81

• Two approaches:
• Machine learning-based:
• ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991),
• SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007),
• HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006).
• Sequence alignment-based:
• TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010),
RUM (Grant et al., 2011).
PreviousWork on Junction Prediction
We want to construct a learning model which can boost prediction
performance in a complementary way to alignment-based method.
1
2
1
2
We propose a learning model based on (multilayer) RBMs
and its training scheme.
43/81

• Training weights to minimize negative log-likelihood of data.
• Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps.
• The CD-𝑘 updates after seeing example 𝒗:
Contrastive Divergence (CD) forTraining RBMs
approximated by
k-step Markov chain
𝒗(0) = 𝒗
𝒉(0) 𝒉(1) 𝒉(𝑘)
𝒗(1) 𝒗(𝑘)
44/81

• Boosting is a meta-algorithm which converts weak learners to strong ones.
• Most boosting algorithms consist of iteratively learning weak classifiers with
respect to a distribution and adding them to a final strong classifier.
• The main variation between many boosting algorithms:
• The method of weighting training data points and hypotheses.
• AdaBoost, LPBoost,TotalBoost, …
What Boosting Is
from lecture notes @ UCIrvine CS 271 Fall 2007
45/81

• Contrastive divergence training is looped over all mini-batches and known to
be stable.
• However, for a class-imbalance distribution, we need to assign higher weights
to rare samples in order to jump to unseen examples byGibbs chains.
BoostedContrastive Divergence (1/2)
assign lower
weights to
ordinary samples
assign higher
weights to
rare samples
hardly
observed
regions
46/81

• If we assign the same weight to all the data, the performance ofGibbs
sampling would degrade in the regions that are hardly observed.
• Whenever sampling, we therefore re-weight each observation by the energy
of its reconstruction 𝐸(𝒗 𝑛
(𝑘), 𝒉 𝑛
(𝑘)
).
BoostedContrastive Divergence (2/2)
Relative locations of samples
and corresponding Markov
chains by PT
chains by the proposed
chains by CD
hardly
observed
regions
47/81

Relationship between Boosting and Importance Sampling
Importance Sampling Boosted CD
target distribution f
proposal distribution g
(a)
(b)
(c)
(a) Samples cannot be drawn conveniently from 𝑓
(b)The importance sampler draws samples from 𝑔
(c) A sample of 𝑓 is obtained by multiplying 𝑓/𝑔
1. Samples are drawn from 𝑔.
2. A sample of 𝑓 is obtained by multiplying α.
Correspondingly,
48/81

• Balance equations:
• a set of equations that can always be solved to give the equilibrium
distribution of a Markov chain (when such a distribution exists).
• For a restricted Boltzmann machine (Im et al., ICLR 2015):
• For a restricted Boltzmann machine with boosted CD:
• On the convergence properties of contrastive divergence (Sutskever et al., AISTATS 2010):
• “TheCD update is not the gradient of any objective function.”; “The CD update
is shown to have at least one fixed point when used with L2 regularization.”
Balance Equations for Restricted Boltzmann Machine
global balance
(or full balance)
local balance
(or detailed balance)
Boosted contrastive divergence inherited the
properties of contrastive divergence.
49/81

• For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001).
• A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively.
• In encoded binary vectors, 75% of the elements are zero.
• To resolve sparsity of 1-hot encoding vectors, we devise a new regularization
technique that incorporates prior knowledge on the sparsity.
Categorical Gradient
sparsity term
reconstruction with and w/o
the sparsity term
derived from
the sparsity term
50/81

ProposedTraining Algorithm
categorical gradient
boosted CD
51/81

• For simulating a class-
imbalance situation
• we randomly
dropped samples
with different drop
rates for different
classes.
Results: Effects of Boosting
Description
Training
cost
Noise
handling
Class-imbalance
handling
CD (Hinton,
Neural Comp. 2002)
Standard and
widely used
- - -
Persistent CD
(Tieleman, ICML 2008)
Use of a single
Markov chain
- -
Parallel tempering
(Cho et al., IJCNN 2010)
Simultaneous Markov
chains generation
Proposed boosted CD Reweighting samples - 52/81

• Data preparation:
• Real human DNA sequences with known boundary information.
• GWH dataset: 2-class (boundary or not).
• UCSC dataset: 3-class (acceptor, donor, or non-boundary).
Experimental Setup for Junction Prediction
Effects of
categorical gradient
Effects of boosting
Effects on the
splicing prediction
CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG
true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor
false acceptor 1false donor 1
53/81

• The proposed method shows the best performance in terms of reconstruction
error for both training and testing.
• Compare to the softmax approach, the proposed regularized RBM succeeds in
achieving lower error by slightly sacrificing the probability sum constraint.
Results: Effects ofCategorical Gradient
Data: chromosome 19 in
GWH-donor
Sequence Length: 200nt
(800 dimension)
# of iterations: 500
Learning rate: 0.1
L2-decay: 0.001
over-fitted best
54/81

Results: Improved Performance and Robustness
2-class classification performance 3-class classification Runtime
Insensitivity to sequence lengths Robustness to negative samples
55/81

exon intron
• (Important biological finding) non-canonical splicing can arise if:
• Introns containGCA or NAA sequences at their boundaries.
• Exons include contiguousA’s around the boundaries.
Results: Identification of Non-Canonical Splice Sites
We used 162,951
examples excluding
canonical splice sites.
56/81

Summary ofTopic 2
Significant boosts in splicing
prediction performance
Robustness to high-dimensional
class-imbalanced data
New RBM training methods
called boosted CD
New penalty term to handle
sparsity of DNA sequences
The ability to detect subtle
non-canonical splicing signals57/81

• Achievements
• Preliminary
• Conclusion
Outline
58/81

• In this paper, we consider the fused Lasso regression (FLR), an important
special case of the ℓ1-penalized regression for structured sparsity:
• The matrix 𝐷 is the difference matrix on the undirected and unweighted
graph of adjacent variables.
• Adjacency of the variables is determined by the application.
• For graphs with 2-D grid , the objective function can be written as
• The second penalty function is non-smooth and non-separable.
Fused Lasso Regression
59/81

• We want to solve the 2-dimensional fused Lasso regression on multi-GPU.
Overview of Proposed Method
fused Lasso
60/81

approximating due to the ℓ1-norm
fused Lasso
fused Lasso + split Bregman algorithm
61/81

fused Lasso
accelerating for solving a linear system
fused Lasso + split Bregman algorithm + PCGLS
62/81

fused Lasso
accelerating for solving a linear system
fused Lasso + split Bregman algorithm + PCGLS
replacing a linear system solver with FFT
fused Lasso + split Bregman algorithm + PCGLS + FFT
63/81

• Split Bregman algorithm for the ℓ1-norm:
• Because of the ℓ1-norm, the objective function is non-differentiable.
Split BregmanAlgorithm for Fused Lasso
introducing an auxiliary variable
approximating
64/81

• The conjugate gradient (CG) method aims to solve the linear system of
equations for the form 𝐴𝑥 = 𝑏 iteratively when 𝐴 is symmetric and positive
definite.
PCGLSAlgorithm
• For the least squared problems, it is
well-known that (9) is equivalent to
solving the normal equation
𝑥 = (𝐴 𝑇 𝐴)−1 𝐴 𝑇 𝑏.
• TheCG algorithm for least squares is
often referred to as theCGLS, and its
preconditioned counterpart as the
PCGLS (in this case the scaling
amounts to 𝐴 𝑇 𝐴 -> 𝑀−𝑇 𝐴 𝑇 𝐴𝑀−1).
acceleratable
65/81

• In mathematics, Poisson's equation is a partial differential equation of elliptic
type with broad utility in electrostatics, mechanical engineering and theoretical
physics.
• Poisson’s equation is frequently written as
Poisson’s Equation
http://en.wikipedia.org/wiki/Poisson's_equation
http://people.rit.edu/~pnveme/ExplictSolutions2/2Dim/Linear/PoissonDisk/PoissonDisk.html
66/81

• In two-dimensional Cartesian coordinates, it takes the form
Poisson’s Equation in 2-Dimensions
block tri-diagonal system
67/81

• Mathematical background
• Apply 2D forward FFT to 𝑓 to obtain 𝑓(𝑘), where 𝑘 is the wave number
• Apply the inverse of the Laplace operator to 𝑓(𝑘) to obtain 𝑣(𝑘): simple
element-wise division in Fourier space
• Apply 2D inverse FFT to 𝑣(𝑘) to obtain 𝑣
Poisson’s Equation using the FFT
𝑣 = −
𝑓
(𝑘 𝑥
2
+ 𝑘 𝑦
2
)
𝛻2
𝑣 = 𝑓 ↔ −(𝑘 𝑥
2
+ 𝑘 𝑦
2
)𝑣 = 𝑓
http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/3-CUDA_libraries_+_Matlab.pdf
68/81

• Pseudo codes for two iterative methods:
Split BregmanAlgorithm for Fused Lasso (1/2)
FFT
69/81

• Multi-GPU operations for matrix-vector
computations
Split BregmanAlgorithm for Fused Lasso (2/2)
70/81

• The computation times are measured inCPU time with
• CPU: Intel Xeon E5-4620 (2.2GHz) and 16GB RAM
• GPU: NVIDIAGTXTitan (2688 cores, 6GBGDDR5)
• We set the regularization parameters 𝜆1, 𝜆2 = 1,1 and stopping criterion is
• We generate 𝑛 samples from a 𝑝-dimensional 𝑁(0, 𝐼 𝑝) and the response
variable y is generated by using 𝑦 = 𝑋𝛽 + 𝜖 (𝑁(0, 𝐼 𝑛)) where 𝛽 = .
Experiments
71/81

• We first considered scenarios with synthetic regression problems where the
coefficients were defined on a square grid:
• For the very large cases, the average speed-up: 409.19 to 433.23
Runtime Comparison for PiecewiseConstant BlocksCases
72/81

• For the other cases (n = 12000–24000), the average speed-up: 26.67–47.47
• CircularGaussian cases are formulated by:
Runtime Comparison forCircularGaussian Cases
73/81

• Image-based regression of the behavioral fMRI data.
• Regression coefficients were overlaid and color-coded on the brain map as
described in the text.
Structured Sparsity Regression Example
74/81

• Image-based regression of the behavioral fMRI data.
• Regression coefficients were overlaid and color-coded on the brain map as
described in the text.
Structured Sparsity Regression Example
75/81

• By applying the proposed method to various large-scale datasets extensively,
we have demonstrated successfully the following:
• Feasibility of highly-parallelizable computational algorithms for high-
dimensional structured sparse regression problems,
• Use case of direct-communicating multiple GPUs for speed-up and
scalability,
• Promise of FFT-based preconditioners for parallel solving of a family of
linear systems.
• That the highest (433x) speed up occurred at the highest dimensional problems
clearly indicates where the merit of the multi-GPU scheme lies.
• Future work: connecting dots to deep neural networks
• FusedAutoencoder, Multi-layer fused Lasso, …
Summary ofTopic 3
76/81

• Achievements
• Preliminary
• Conclusion
Outline
77/81

1. The MRnet can be applied in a complementary way to generalize neural
networks with traditional techniques such as L2 decay.
2. We propose a novel method for training RBMs for class-imbalanced
prediction. Our proposal includes a deep belief network-based methodology
for computational splice junction prediction.
3. The parallel fused Lasso can be applied for data that have structured
sparsity like images to exploit more prior knowledge than convolutional or
recurrent operations.
Conclusion
This dissertation proposed a set of robust feature learning schemes
that can learn meaningful representation underlying in large-scale
genomic datasets and image datasets using deep networks.
1
2
3
78/81

• Several future work for the proposed methodologies can be possible.
• First, we can extend MRnet to extract scaling and translation invariant features
by replacing synthetic of nearest training samples.
• Second, it can be also interesting to alternate the objective function of MRnet
in order to generalize the whole procedure of MRnet.
• Lastly, the proposed three schemes (manifold loss, boosting, and L1 fusion
penalty) can be applied into the framework of recurrent neural networks.
Limitations and FutureWork
We need to make the proposed schemes
more universal and general.
79/81

PhD Defense

More Related Content

What's hot

Similar to PhD Defense

Recently uploaded

PhD Defense