- 1. Seoul National University Advanced Computing Laboratory Taehoon Lee Robust Feature Learning with Deep Neural Networks
- 2. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 2/81
- 3. ResearchAreas Deep neural networks are able to learn hierarchical representations. Theory Image Time series Bioinformatics Machine Learning Deep Learning • Main theories: machine learning, deep learning, statistical learning • Main applications: computer vision, bioinformatics • Main skills: parallel computing 3/81
- 4. • Byunghan Lee, Taehoon Lee, andSungroh Yoon,"DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks," in Proceedings of NIPS Workshop on Machine Learning in Computational Biology, Montreal, Canada, December 2015. • Seungmyung Lee, Hanjoo Kim, Siqi Tan, Taehoon Lee, Sungroh Yoon, and Rhiju Das, "Automated band annotation for RNA structure probing experiments with numerous capillary electrophoresis profiles," Bioinformatics, vol. 31, no. 17, pp. 2808-2815, September 2015. • Taehoon Lee and Sungroh Yoon, "Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions," in Proceedings of International Conference on Machine Learning (ICML), Lille, France, July 2015. • Donghyeon Yu, Joong-Ho Won, Taehoon Lee, Johan Lim, and Sungroh Yoon,"High-dimensional Fused Lasso Regression using Majorization- Minimization and Parallel Processing," Journal of Computational and Graphical Statistics, vol.24, no.1, pp. 121-153, March 2015. • Taehoon Lee, Sungmin Lee, Woo Young Sim, Yu MiJung, Sunmi Han, Chanil Chung, Jay Junkeun Chang, Hyeyoung Min,and Sungroh Yoon, "Robust Classification of DNA Damage Patterns in Single Cell Gel Electrophoresis," in Proceedings of 35th Annual International Conference of the IEEE Engineering in Medicine andBiology Society (EMBC),Osaka, Japan, July 2013. • Taehoon Lee, Hyeyoung Min,Seung Jean Kim, and Sungroh Yoon, "Application of maximin correlation analysis to classifying protein environments for function prediction," Biochemical and Biophysical Research Communications, vol. 400, no. 2, pp. 219-224, September 2010. • Hyeyoung Min,Seunghak Yu, Taehoon Lee, and Sungroh Yoon, "Support vector machine based classification of 3-dimensional protein physicochemical environments for automated function annotation," Archives of Pharmacal Research, vol. 33, no. 9, pp. 1451-1459,September 2010. • Taehoon Lee, Seung Jean Kim, Eui-Young Chung, andSungroh Yoon, "K-maximin Clustering: A Maximin Correlation Approach to Partition-Based Clustering, " IEICE Electronics Express, vol. 6, no. 17, pp. 1205-1211, September 2009. • Taehoon Lee, Taesup Moon,Seung Jean Kim, and Sungroh Yoon,"Regularization and Kernelization of the Maximin Correlation Approach" (under review) • Taehoon Lee, Minsuk Choi, and Sungroh Yoon, "Manifold Regularized Deep Networks using Adversarial Examples" (under review) • Taehoon Lee, Joong-Ho Won, Johan Lim, and Sungroh Yoon,"Large-scale Fused Lasso on multi-GPU using FFT-Based Split Bregman Method" (under review) • Taehoon Lee et al., "HiComet: High-Throughput Comet Analysis Tool for Large-Scale DNA Damage Assessment Studies" (in preparation) Publications • 게재 완료: SCI급 저널 5편, 학술대회 논문 3편 (제 1저자 총 4편) • 심사 중: SCI급 저널 3편, 학술대회 논문 1편 (모두 제 1저자) • 국내 저널 및 학회: 12편 (제 1저자 6편) 4/81
- 6. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 6/81
- 7. • Deep Neural Network (DNN) learns effective hierarchical representation. • DNN learns automatically representations and features from data. What Do Deep Neural Networks Learn object ↑ part ↑ motif ↑ Edge Image story ↑ sentence ↑ clause ↑ word Language word ↑ phoneme ↑ phone ↑ Sound Speech output input Hand-crafted program Hand-crafted features Trainable features Trainable classifier Trainable classifier tiger Traditional machine learning Deep learning Rule-based systems higher level of abstraction 7/81
- 8. 3 × 2 + 3 × 5 + 3 × 7 → 3 × (2 + 5 + 7) • As the number of layers goes larger, the effect of factorization gets higher. • Factorization is the decomposition of an object into a product of factors. Why Do Deep Neural NetworksWork SoWell 𝑥 𝑦 𝑊(1) 𝑊(2) 𝑥 𝑦 𝑊(1) 𝑊(2) 𝑊(3) 𝑊(4) The more number of paths with the same number of weight values shallow deep Many data, complex models, various priors, and high-end hardware altogether are enabling deep learning prosper. 8/81
- 9. History ofArtificial Neural Networks Minsky and Papert, 1969 “Perceptrons” (Limits of Perceptrons) [M69] Rosenblatt, 1958 Perceptron [R58] Fukushima, 1980 NeoCognitron (Convolutional NN) [F80] Hinton, 1983 Boltzmann machine [H83] Fukushima, 1975 Cognitron (Autoencoder) [F75] Hinton, 1986 RBM, Restricted Boltzmann machine [H86] Hinton, 2006 Deep Belief Networks [H06] (mid 1980s) Back-propagation Early Models Basic Models Break through Le, 2012 Training of 1 billion parameters [L12] Lee, 2009 Convolutional RBM [L09] LeCun, 1998 Revisit of CNN [L98] http://www.technologyrevi ew.com/featuredstory/5136 96/deep-learning/ 9/81
- 10. Deep LearningTechniques Regularization helps the network avoid get over-fitted. dropout parameter sharing (CNN, RNN) early stopping weight decay sparse connectivity exploiting sparsity traditionaltrendy • Deconv nets (Zeiler et al., CVPR 2010) • Normalized initialization (Glorot et al., AISTATS 2010) • DropConnect (Wan et al., ICML 2013) • Batch normalization (Loffe et al., ICML 2015) • Inception (Szegedy et al., CVPR 2015) • Adversarial training (Goodfellow et al., ICLR 2015) LeCun et al., Proc. IEEE 1998Srivastava et al., JMLR 2014 Baidu 10/81
- 11. Applications of Deep Learning Natural Language Understanding Natural Image Understanding from Karpathy et al., NIPS 2014. from Google I/O 2013 Highlights Speech Recognition Image Recognition Natural Language Processing output sentence current main applications rising applications 11/81
- 12. • RBM is a type of logistic belief network whose structure is a bipartite graph. • Nodes: • Input layer: • Hidden layer: • Probability of a configuration : • • • Each node is a stochastic binary unit: • • can be used as a feature. Restricted Boltzmann Machines 12/81
- 13. • CNN is a type of feed-forward artificial neural network where the individual neurons respond to overlapping regions in the visual field. • Key components are convolutional and subsampling layers. Convolutional Neural Networks LeCun et al., Proc. IEEE 1998. C-layer Convolution between a kernel and an image to extract features. S-layer Aggregation of the statistics of local features at various locations. 13/81
- 14. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 14/81
- 15. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 1 2 3 15/81
- 16. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 1 2 3 16/81
- 17. • As deep neural networks learn a large number of parameters, there have been many attempts to obtain reasonable solutions over a wide search space. In this dissertation, following three issues for deep learning are discussed. Dissertation Overview 17/81
- 18. • As deep neural networks learn a large number of parameters, there have been many attempts to obtain reasonable solutions over a wide search space. In this dissertation, following three issues for deep learning are discussed. • First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Dissertation Overview 18/81
- 19. • As deep neural networks learn a large number of parameters, there have been many attempts to obtain reasonable solutions over a wide search space. In this dissertation, following three issues for deep learning are discussed. • First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Dissertation Overview • Second, training restricted Boltzmann machines showed limited performance for sampling for minority samples in class-imbalanced dataset. 19/81
- 20. • As deep neural networks learn a large number of parameters, there have been many attempts to obtain reasonable solutions over a wide search space. In this dissertation, following three issues for deep learning are discussed. • First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Dissertation Overview • Second, training restricted Boltzmann machines showed limited performance for sampling for minority samples in class-imbalanced dataset. • Lastly, spatial dependency handling needs to be more complicated while convolutional neural networks are known as well learning technique for handling of spatial dependency. 20/81
- 21. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 21/81
- 22. • Desired behaviors and practical issues of deep learning and manifold learning: • Deep learning discriminates different classes; however, it may result in wiggly boundaries vulnerable to adversarial perturbations. • Manifold learning preserves geodesic distances; however, it may result in poor embedding. Motivation 22/81
- 23. Szegedy et al, Intriguing Properties of Neural Networks, ICLR 2014. Goodfellow et al, Explaining and HarnessingAdversarial Examples, ICLR 2015. • We can generate an adversarial input 𝑥 𝑎𝑑𝑣 = 𝑥 + ∆𝑥. • We expect the classifier to assign the same class to 𝑥 and 𝑥 𝑎𝑑𝑣 so long as ∆𝑥 ∞ < 𝜖. • However, very small perturbation can misclassify correct images. Adversarial Example adversarial example original example small perturbation Goodfellow, ICLR 2015. fooling networks 23/81
- 24. • Consider the dot product between a weight vector w and an adversarial example 𝑥 𝑎𝑑𝑣: • The adversarial perturbation causes the activation to grow by 𝑤 𝑇∆𝑥. • We can maximize this increase subject to max norm constraint on ∆𝑥 by assigning ∆𝑥 = sign(𝑤). HowCanWe Fool Neural Networks? 𝑤 𝑇 𝑥 𝑎𝑑𝑣 = 𝑤 𝑇 𝑥 + 𝑤 𝑇∆𝑥 𝑥 𝑎𝑑𝑣 = 𝑥 − 𝜀𝑤 if 𝑥 is positive 𝑥 𝑎𝑑𝑣 = 𝑥 + 𝜀𝑤 if 𝑥 is negative 𝑤 = [8.28, 10.03]𝑥 24/81
- 25. Nguyen et al, Deep Neural Networks are Easily Fooled: HighConfidence Predictions for Unrecognizable Images, CVPR 2015. • We can maximize this increase subject to max norm constraint on ∆𝑥 by assigning ∆𝑥 = 𝜀(𝛻𝑥 𝐽(𝜃, 𝑥, 𝑦)). • We can also fool neural network by using following evolutionary algorithm. Deep Neural NetworksCan BeAlso Fooled 25/81
- 26. • Adversarial examples can be explained as a property of high-dimensional dot products. • The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers. • Because it is the direction that matters most, adversarial perturbations generalize across different clean examples. • Linear models lack the capacity to resist adversarial perturbation; only structures with a hidden layer (where the universal approximator theorem applies) should be trained to resist adversarial perturbation. Important Observations (Szegedy et al, ICLR 2014) 26/81
- 27. • How can we cover adversarial examples? • Simply train all the noisy examples (Loosli et al., LargeScale Kernel Machines 2007: INFINITE MNIST dataset). • Exponential cost • Include the adversarial term in the objective function (Goodfellow et al., ICLR 2015). • 𝐽 𝜃, 𝑥, 𝑦 = 𝛼 𝐽 𝜃, 𝑥, 𝑦 + 1 − 𝛼 𝐽(𝜃, 𝑥 𝑎𝑑𝑣, 𝑦) • 1.14% -> 0.77% error rate on test 10000 examples • Commonly, people expect that elastic distortion can resist adversarial examples. RelatedWork 27/81
- 28. What is Manifold In case of closed manifold, we may represent it in higher dimension more than original one. http://www.lib.utexas.edu/maps/world_maps/world_rel_803005AI_2003.jpg In real world, many of observations organize manifol d.That is reason why we are learning manifold.The picture are 2-d manifold and 3-d manifold. 28/81
- 29. • Manifold term minimizes the difference between activations of several nodes of the same class samples. • This helps us to disentangle of the variation factors. Manifold RegularizationTerm 𝒂(1): input representation 𝒂(5): manifold representation 𝒂(6) : softmax layer 29/81
- 30. Manifold RegularizationTerm • Manifold term minimizes the difference between activations of several nodes of the same class samples. • This helps us to disentangle of the variation factors. 𝒂(1): input representation 𝒂 𝒚 (1) 𝒂 𝒙 (1) 𝒂(5): manifold representation 𝒂 𝒚 (5) 𝒂 𝒙 (5) 30/81
- 31. Manifold RegularizationTerm • Manifold term minimizes the difference between activations of several nodes of the same class samples. • This helps us to disentangle of the variation factors. 𝒂(1): input representation 𝒂(5): manifold representation 𝒂′ 𝒏 (5) 𝒂 𝒏 (5) 𝒙′ 𝒏 𝒙 𝒏 31/81
- 32. Manifold RegularizationTerm • Manifold term minimizes the difference between activations of several nodes of the same class samples. • This helps us to disentangle of the variation factors. 𝒂(1): input representation 𝒂′ 𝒏 (5) 𝒂 𝒏 (5) 𝒂(5): manifold representation 𝒙′ 𝒏 𝒙 𝒏 +𝜷(𝜵 𝒙 𝒏 𝑳(𝜽; 𝒙 𝒏, 𝒚 𝒏)) 32/81
- 33. • The proposed methodology learns both classifier and manifold embedding that is robust for adversarial perturbations. • Forward and backward operations of MRnet: • The first forward operation is the same as in a standard neural network. • The following backward 𝑎𝑑𝑣 is the same as the standard back-propagation except that an adversarial perturbation. Proposed Regularized Networks 33/81
- 34. • Three datasets we tested: • (a) MNIST • (b, c)The rawdata and its normalized version (LCN) ofCIFAR-10 • (d, e)The rawdata and its normalized version (ZCA) of SVHN Experimental Results (Krizhevsky et al., 2009) (LeCun et al., 1998) (Netzer et al., 2011) 34/81
- 35. • We chose 𝛽 in the range that did not violate class information. • (a-c) Distributions of Euclidean distances between training samples on individual datasets. • (d-f) Different perturbation levels on individual datasets. Generation ofAdversarial Examples 35/81
- 36. MNIST Results Bar: statistics of 10 runs. Circle: single run reported in literatures. • Fully connected models have two hidden layers. • Convolutional models have more than two convolutional layers. • All the results are without data augmentation. • The proposed model shows the best performance among the alternatives. 36/81
- 37. CIFAR-10 and SVHN Results 37/81
- 38. • Data:CIFAR-10 test set. • (a) Pairwise distance matrix of a(L) without Φ. • (b) 2-D visualization of the manifold embedding through t-SNE without Φ. • (c)Query images and top 10 nearest images without Φ. • (d-f) Pairwise distance matrix, t-SNE plot, and query images with Φ. Embedding Results 38/81
- 39. • We have proposed a novel methodology, unifying deep learning and manifold learning, called manifold regularized networks (MRnet). • We tested MRnet and confirmed its improved generalization performance underpinned by the proposed manifold loss term on deep architectures. • By exploiting the characteristics of blind spots, the proposed MRnet can be extended to the discovery of true representations on manifolds in various learning tasks. Summary ofTopic 1 39/81
- 40. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 40/81
- 41. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? • We propose a new RBM training method called boosted CD. • We also devise a regularization term for sparsity of DNA sequences. Motivation negative positive easy to misclassify query images 41/81
- 42. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem exon GT: false boundary GT: true boundary ACGTCGACTGCTACGTAGCAGCGA TACGTACCGATCATCACTATCATC GAGGTACGATCGATCGATCGATCA GTCGATCGTCGTTCAGTCAGTCGA TATCAGTCATATGCACATCTCAGT DNA RNA protein gene expression GT (or AG) 16K 76M true sites exon intron 160K (=0.21% over 76M) 42/81
- 43. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction We want to construct a learning model which can boost prediction performance in a complementary way to alignment-based method. 1 2 1 2 We propose a learning model based on (multilayer) RBMs and its training scheme. 43/81
- 44. • Training weights to minimize negative log-likelihood of data. • Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps. • The CD-𝑘 updates after seeing example 𝒗: Contrastive Divergence (CD) forTraining RBMs approximated by k-step Markov chain 𝒗(0) = 𝒗 𝒉(0) 𝒉(1) 𝒉(𝑘) 𝒗(1) 𝒗(𝑘) 44/81
- 45. • Boosting is a meta-algorithm which converts weak learners to strong ones. • Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. • The main variation between many boosting algorithms: • The method of weighting training data points and hypotheses. • AdaBoost, LPBoost,TotalBoost, … What Boosting Is from lecture notes @ UCIrvine CS 271 Fall 2007 45/81
- 46. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) assign lower weights to ordinary samples assign higher weights to rare samples hardly observed regions 46/81
- 47. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by PT Relative locations of samples and corresponding Markov chains by the proposed Relative locations of samples and corresponding Markov chains by CD hardly observed regions 47/81
- 48. Relationship between Boosting and Importance Sampling Importance Sampling Boosted CD target distribution f proposal distribution g (a) (b) (c) (a) Samples cannot be drawn conveniently from 𝑓 (b)The importance sampler draws samples from 𝑔 (c) A sample of 𝑓 is obtained by multiplying 𝑓/𝑔 1. Samples are drawn from 𝑔. 2. A sample of 𝑓 is obtained by multiplying α. Correspondingly, 48/81
- 49. • Balance equations: • a set of equations that can always be solved to give the equilibrium distribution of a Markov chain (when such a distribution exists). • For a restricted Boltzmann machine (Im et al., ICLR 2015): • For a restricted Boltzmann machine with boosted CD: • On the convergence properties of contrastive divergence (Sutskever et al., AISTATS 2010): • “TheCD update is not the gradient of any objective function.”; “The CD update is shown to have at least one fixed point when used with L2 regularization.” Balance Equations for Restricted Boltzmann Machine global balance (or full balance) local balance (or detailed balance) Boosted contrastive divergence inherited the properties of contrastive divergence. 49/81
- 50. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient sparsity term reconstruction with and w/o the sparsity term derived from the sparsity term 50/81
- 52. • For simulating a class- imbalance situation • we randomly dropped samples with different drop rates for different classes. Results: Effects of Boosting Description Training cost Noise handling Class-imbalance handling CD (Hinton, Neural Comp. 2002) Standard and widely used - - - Persistent CD (Tieleman, ICML 2008) Use of a single Markov chain - - Parallel tempering (Cho et al., IJCNN 2010) Simultaneous Markov chains generation Proposed boosted CD Reweighting samples - 52/81
- 53. • Data preparation: • Real human DNA sequences with known boundary information. • GWH dataset: 2-class (boundary or not). • UCSC dataset: 3-class (acceptor, donor, or non-boundary). Experimental Setup for Junction Prediction Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor false acceptor 1false donor 1 53/81
- 54. • The proposed method shows the best performance in terms of reconstruction error for both training and testing. • Compare to the softmax approach, the proposed regularized RBM succeeds in achieving lower error by slightly sacrificing the probability sum constraint. Results: Effects ofCategorical Gradient Data: chromosome 19 in GWH-donor Sequence Length: 200nt (800 dimension) # of iterations: 500 Learning rate: 0.1 L2-decay: 0.001 over-fitted best 54/81
- 55. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime Insensitivity to sequence lengths Robustness to negative samples 55/81
- 56. exon intron • (Important biological finding) non-canonical splicing can arise if: • Introns containGCA or NAA sequences at their boundaries. • Exons include contiguousA’s around the boundaries. Results: Identification of Non-Canonical Splice Sites We used 162,951 examples excluding canonical splice sites. 56/81
- 57. Summary ofTopic 2 Significant boosts in splicing prediction performance Robustness to high-dimensional class-imbalanced data New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences The ability to detect subtle non-canonical splicing signals57/81
- 58. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 58/81
- 59. • In this paper, we consider the fused Lasso regression (FLR), an important special case of the ℓ1-penalized regression for structured sparsity: • The matrix 𝐷 is the difference matrix on the undirected and unweighted graph of adjacent variables. • Adjacency of the variables is determined by the application. • For graphs with 2-D grid , the objective function can be written as • The second penalty function is non-smooth and non-separable. Fused Lasso Regression 59/81
- 60. • We want to solve the 2-dimensional fused Lasso regression on multi-GPU. Overview of Proposed Method fused Lasso 60/81
- 61. • We want to solve the 2-dimensional fused Lasso regression on multi-GPU. Overview of Proposed Method approximating due to the ℓ1-norm fused Lasso fused Lasso + split Bregman algorithm 61/81
- 62. • We want to solve the 2-dimensional fused Lasso regression on multi-GPU. Overview of Proposed Method approximating due to the ℓ1-norm fused Lasso fused Lasso + split Bregman algorithm accelerating for solving a linear system fused Lasso + split Bregman algorithm + PCGLS 62/81
- 63. • We want to solve the 2-dimensional fused Lasso regression on multi-GPU. Overview of Proposed Method approximating due to the ℓ1-norm fused Lasso fused Lasso + split Bregman algorithm accelerating for solving a linear system fused Lasso + split Bregman algorithm + PCGLS replacing a linear system solver with FFT fused Lasso + split Bregman algorithm + PCGLS + FFT 63/81
- 64. • Split Bregman algorithm for the ℓ1-norm: • Because of the ℓ1-norm, the objective function is non-differentiable. Split BregmanAlgorithm for Fused Lasso introducing an auxiliary variable approximating 64/81
- 65. • The conjugate gradient (CG) method aims to solve the linear system of equations for the form 𝐴𝑥 = 𝑏 iteratively when 𝐴 is symmetric and positive definite. PCGLSAlgorithm • For the least squared problems, it is well-known that (9) is equivalent to solving the normal equation 𝑥 = (𝐴 𝑇 𝐴)−1 𝐴 𝑇 𝑏. • TheCG algorithm for least squares is often referred to as theCGLS, and its preconditioned counterpart as the PCGLS (in this case the scaling amounts to 𝐴 𝑇 𝐴 -> 𝑀−𝑇 𝐴 𝑇 𝐴𝑀−1). acceleratable 65/81
- 66. • In mathematics, Poisson's equation is a partial differential equation of elliptic type with broad utility in electrostatics, mechanical engineering and theoretical physics. • Poisson’s equation is frequently written as Poisson’s Equation http://en.wikipedia.org/wiki/Poisson's_equation http://people.rit.edu/~pnveme/ExplictSolutions2/2Dim/Linear/PoissonDisk/PoissonDisk.html 66/81
- 67. • In two-dimensional Cartesian coordinates, it takes the form Poisson’s Equation in 2-Dimensions block tri-diagonal system 67/81
- 68. • Mathematical background • Apply 2D forward FFT to 𝑓 to obtain 𝑓(𝑘), where 𝑘 is the wave number • Apply the inverse of the Laplace operator to 𝑓(𝑘) to obtain 𝑣(𝑘): simple element-wise division in Fourier space • Apply 2D inverse FFT to 𝑣(𝑘) to obtain 𝑣 Poisson’s Equation using the FFT 𝑣 = − 𝑓 (𝑘 𝑥 2 + 𝑘 𝑦 2 ) 𝛻2 𝑣 = 𝑓 ↔ −(𝑘 𝑥 2 + 𝑘 𝑦 2 )𝑣 = 𝑓 http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/3-CUDA_libraries_+_Matlab.pdf 68/81
- 69. • Pseudo codes for two iterative methods: Split BregmanAlgorithm for Fused Lasso (1/2) FFT 69/81
- 70. • Multi-GPU operations for matrix-vector computations Split BregmanAlgorithm for Fused Lasso (2/2) 70/81
- 71. • The computation times are measured inCPU time with • CPU: Intel Xeon E5-4620 (2.2GHz) and 16GB RAM • GPU: NVIDIAGTXTitan (2688 cores, 6GBGDDR5) • We set the regularization parameters 𝜆1, 𝜆2 = 1,1 and stopping criterion is • We generate 𝑛 samples from a 𝑝-dimensional 𝑁(0, 𝐼 𝑝) and the response variable y is generated by using 𝑦 = 𝑋𝛽 + 𝜖 (𝑁(0, 𝐼 𝑛)) where 𝛽 = . Experiments 71/81
- 72. • We first considered scenarios with synthetic regression problems where the coefficients were defined on a square grid: • For the very large cases, the average speed-up: 409.19 to 433.23 Runtime Comparison for PiecewiseConstant BlocksCases 72/81
- 73. • For the other cases (n = 12000–24000), the average speed-up: 26.67–47.47 • CircularGaussian cases are formulated by: Runtime Comparison forCircularGaussian Cases 73/81
- 74. • Image-based regression of the behavioral fMRI data. • Regression coefficients were overlaid and color-coded on the brain map as described in the text. Structured Sparsity Regression Example 74/81
- 75. • Image-based regression of the behavioral fMRI data. • Regression coefficients were overlaid and color-coded on the brain map as described in the text. Structured Sparsity Regression Example 75/81
- 76. • By applying the proposed method to various large-scale datasets extensively, we have demonstrated successfully the following: • Feasibility of highly-parallelizable computational algorithms for high- dimensional structured sparse regression problems, • Use case of direct-communicating multiple GPUs for speed-up and scalability, • Promise of FFT-based preconditioners for parallel solving of a family of linear systems. • That the highest (433x) speed up occurred at the highest dimensional problems clearly indicates where the merit of the multi-GPU scheme lies. • Future work: connecting dots to deep neural networks • FusedAutoencoder, Multi-layer fused Lasso, … Summary ofTopic 3 76/81
- 77. • Achievements • Preliminary • Deep neural networks • Dissertation overview • Adversarial example handling • Manifold regularized deep neural networks using adversarial examples • Class-imbalance handling • Boosted contrastive divergence • Spatial dependency handling • Structured sparsity via parallel fused Lasso • Conclusion • Limitations and future work Outline 77/81
- 78. 1. The MRnet can be applied in a complementary way to generalize neural networks with traditional techniques such as L2 decay. 2. We propose a novel method for training RBMs for class-imbalanced prediction. Our proposal includes a deep belief network-based methodology for computational splice junction prediction. 3. The parallel fused Lasso can be applied for data that have structured sparsity like images to exploit more prior knowledge than convolutional or recurrent operations. Conclusion This dissertation proposed a set of robust feature learning schemes that can learn meaningful representation underlying in large-scale genomic datasets and image datasets using deep networks. 1 2 3 78/81
- 79. • Several future work for the proposed methodologies can be possible. • First, we can extend MRnet to extract scaling and translation invariant features by replacing synthetic of nearest training samples. • Second, it can be also interesting to alternate the objective function of MRnet in order to generalize the whole procedure of MRnet. • Lastly, the proposed three schemes (manifold loss, boosting, and L1 fusion penalty) can be applied into the framework of recurrent neural networks. Limitations and FutureWork We need to make the proposed schemes more universal and general. 79/81