Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICML2015 Slides

258 views

Published on

Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions
http://proceedings.mlr.press/v37/leeb15.pdf

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

ICML2015 Slides

  1. 1. Advanced Computing Laboratory Electrical and Computer Engineering Seoul National University Taehoon Lee SungrohYoon BoostedCategorical Restricted Boltzmann Machine forComputational Prediction of Splice Junctions
  2. 2. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 2/25
  3. 3. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation 3/25
  4. 4. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation negative positive 3/25
  5. 5. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. Motivation negative positive easy to misclassify query images 3/25
  6. 6. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? Motivation negative positive easy to misclassify query images 3/25
  7. 7. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? • We propose a new RBM training method called boosted CD. Motivation negative positive easy to misclassify query images 3/25
  8. 8. • Deep Neural Networks (DNN) show human level performance on many recognition tasks. • We focus on class-imbalanced prediction. • Insufficient samples to represent the true distribution of a class. • Q. How can we learn minor but important features using neural networks? • We propose a new RBM training method called boosted CD. • We also devise a regularization term for sparsity of DNA sequences. Motivation negative positive easy to misclassify query images 3/25
  9. 9. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem 4/25
  10. 10. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein 4/25
  11. 11. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression 4/25
  12. 12. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression exon 4/25
  13. 13. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem DNA RNA protein gene expression exon intron 4/25
  14. 14. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem exon GT: false boundary GT: true boundary ACGTCGACTGCTACGTAGCAGCGA TACGTACCGATCATCACTATCATC GAGGTACGATCGATCGATCGATCA GTCGATCGTCGTTCAGTCAGTCGA TATCAGTCATATGCACATCTCAGT DNA RNA protein gene expression exon intron 4/25
  15. 15. • Genetic information flows through the gene expression process. • DNA: a sequence of four types of nucleotides (A,G,T,C). • Gene: a segment of DNA (the basic unit of heredity). (Splice) Junction Prediction: ExtremelyClass-Imbalanced Problem exon GT: false boundary GT: true boundary ACGTCGACTGCTACGTAGCAGCGA TACGTACCGATCATCACTATCATC GAGGTACGATCGATCGATCGATCA GTCGATCGTCGTTCAGTCAGTCGA TATCAGTCATATGCACATCTCAGT DNA RNA protein gene expression GT (or AG) 16K 76M true sites exon intron 160K (=0.21% over 76M) 4/25
  16. 16. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction 1 2 5/25
  17. 17. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction We want to construct a learning model which can boost prediction performance in a complementary way to alignment-based method. 1 2 1 2 5/25
  18. 18. • Two approaches: • Machine learning-based: • ANN (Stormo et al., 1982; Noordewier et al., 1990; Brunak et al., 1991), • SVM (Degroeve et al., 2005; Huang et al., 2006; Sonnenburg et al., 2007), • HMM (Reese et al., 1997; Pertea et al., 2001; Baten et al., 2006). • Sequence alignment-based: • TopHat (Trapnell et al., 2010), MapSplice (Wang et al., 2010), RUM (Grant et al., 2011). PreviousWork on Junction Prediction We want to construct a learning model which can boost prediction performance in a complementary way to alignment-based method. 1 2 1 2 We propose a learning model based on (multilayer) RBMs and its training scheme. 5/25
  19. 19. • Training methods of RBM • RBM for categorical values • Softmax input units (Salakhutdinov et al., ICML 2007). • Class-imbalance problems • Refer to a review byGalar et al. (IEEET SMC 2012). Related Methodologies Description Training cost Noise handling Class-imbalance handling CD (Hinton, Neural Comp. 2002) Standard and widely used - - - Persistent CD (Tieleman, ICML 2008) Use of a single Markov chain - - Parallel tempering (Cho et al., IJCNN 2010) Simultaneous Markov chains generation 6/25
  20. 20. Main Contributions 7/25
  21. 21. Main Contributions New RBM training methods called boosted CD 7/25
  22. 22. Main Contributions New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  23. 23. Main Contributions Significant boosts in splicing prediction performance New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  24. 24. Main Contributions Significant boosts in splicing prediction performance Robustness to high-dimensional class-imbalanced data New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 7/25
  25. 25. Main Contributions Significant boosts in splicing prediction performance Robustness to high-dimensional class-imbalanced data New RBM training methods called boosted CD New penalty term to handle sparsity of DNA sequences 25/25 The ability to detect subtle non-canonical splicing signals
  26. 26. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 8/25
  27. 27. • RBM is a type of logistic belief network whose structure is a bipartite graph. • Nodes: • Input layer: • Hidden layer: • Probability of a configuration : • • • Each node is a stochastic binary unit: • • can be used as a feature. Restricted Boltzmann Machines 9/25
  28. 28. • Training weights to minimize negative log-likelihood of data. • Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps. • The CD-𝑘 updates after seeing example 𝒗: Contrastive Divergence (CD) forTraining RBMs 𝒗(0) = 𝒗 𝒉(0) 𝒉(1) 𝒉(𝑘) 𝒗(1) 𝒗(𝑘) 10/25
  29. 29. • Training weights to minimize negative log-likelihood of data. • Run the MCMC chain 𝒗(0), 𝒗(1),… , 𝒗(𝑘) for 𝑘 steps. • The CD-𝑘 updates after seeing example 𝒗: Contrastive Divergence (CD) forTraining RBMs approximated by k-step Markov chain 𝒗(0) = 𝒗 𝒉(0) 𝒉(1) 𝒉(𝑘) 𝒗(1) 𝒗(𝑘) 10/25
  30. 30. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 11/25
  31. 31. Overview of Proposed Methodology 12/25
  32. 32. Overview of Proposed Methodology 12/25
  33. 33. Overview of Proposed Methodology 12/25
  34. 34. Overview of Proposed Methodology 12/25
  35. 35. Overview of Proposed Methodology 12/25
  36. 36. Overview of Proposed Methodology 12/25
  37. 37. • Boosting is a meta-algorithm which converts weak learners to strong ones. • Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. • The main variation between many boosting algorithms: • The method of weighting training data points and hypotheses. • AdaBoost, LPBoost,TotalBoost, … What Boosting Is from lecture notes @ UCIrvine CS 271 Fall 2007 13/25
  38. 38. • Contrastive divergence training is looped over all mini-batches and known to be stable. BoostedContrastive Divergence (1/2) 14/25
  39. 39. • Contrastive divergence training is looped over all mini-batches and known to be stable. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  40. 40. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  41. 41. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) 14/25 hardly observed regions
  42. 42. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) assign higher weights to rare samples 14/25 hardly observed regions
  43. 43. • Contrastive divergence training is looped over all mini-batches and known to be stable. • However, for a class-imbalance distribution, we need to assign higher weights to rare samples in order to jump to unseen examples byGibbs chains. BoostedContrastive Divergence (1/2) assign lower weights to ordinary samples assign higher weights to rare samples 14/25 hardly observed regions
  44. 44. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2)
  45. 45. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  46. 46. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by the proposed Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  47. 47. • If we assign the same weight to all the data, the performance ofGibbs sampling would degrade in the regions that are hardly observed. • Whenever sampling, we therefore re-weight each observation by the energy of its reconstruction 𝐸(𝒗 𝑛 (𝑘), 𝒉 𝑛 (𝑘) ). 15/25 BoostedContrastive Divergence (2/2) Relative locations of samples and corresponding Markov chains by PT Relative locations of samples and corresponding Markov chains by the proposed Relative locations of samples and corresponding Markov chains by CD hardly observed regions
  48. 48. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. Categorical Gradient 16/25
  49. 49. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient 16/25
  50. 50. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient sparsity term 16/25 reconstruction with and w/o the sparsity term
  51. 51. • For biological sequences, 1-hot encoding is widely used (Baldi & Brunak, 2001). • A,C,G, andT are encoded by 1000, 0100, 0010, and 0001, respectively. • In encoded binary vectors, 75% of the elements are zero. • To resolve sparsity of 1-hot encoding vectors, we devise a new regularization technique that incorporates prior knowledge on the sparsity. Categorical Gradient sparsity term 16/25 reconstruction with and w/o the sparsity term derived from the sparsity term
  52. 52. ProposedTraining Algorithm categorical gradient boosted CD 17/25
  53. 53. • Motivation • Preliminary • Boosted contrastive divergence • Categorical restricted Boltzmann machine • Experiment results • Conclusion Outline 18/25
  54. 54. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction 19/25
  55. 55. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG 19/25
  56. 56. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor 19/25
  57. 57. • Data preparation: • Real human DNA sequences with known boundary information. Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor false acceptor 1false donor 1 19/25
  58. 58. • Data preparation: • Real human DNA sequences with known boundary information. • GWH dataset: 2-class (boundary or not). • UCSC dataset: 3-class (acceptor, donor, or non-boundary). Results Effects of categorical gradient Effects of boosting Effects on the splicing prediction CGTAGCAGCGATACGTACCGATCGTCACTATCATCGAGGTACGAGAGATCGATCGGCAACG true acceptor 1 true donor 1 true acceptor 2 non-canonical true donor false acceptor 1false donor 1 19/25
  59. 59. • The proposed method shows the best performance in terms of reconstruction error for both training and testing. • Compare to the softmax approach, the proposed regularized RBM succeeds in achieving lower error by slightly sacrificing the probability sum constraint. Results: Effects ofCategorical Gradient Data: chromosome 19 in GWH-donor Sequence Length: 200nt (800 dimension) # of iterations: 500 Learning rate: 0.1 L2-decay: 0.001 over-fitted best 20/25
  60. 60. • For simulating a class- imbalance situation • we randomly dropped samples with different drop rates for different classes. Results: Effects of Boosting
  61. 61. • For simulating a class- imbalance situation • we randomly dropped samples with different drop rates for different classes. Results: Effects of Boosting Description Training cost Noise handling Class-imbalance handling CD (Hinton, Neural Comp. 2002) Standard and widely used - - - Persistent CD (Tieleman, ICML 2008) Use of a single Markov chain - - Parallel tempering (Cho et al., IJCNN 2010) Simultaneous Markov chains generation Proposed boosted CD Reweighting samples -
  62. 62. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime 22/25
  63. 63. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime Insensitivity to sequence lengths 22/25
  64. 64. Results: Improved Performance and Robustness 2-class classification performance 3-class classification Runtime Insensitivity to sequence lengths Robustness to negative samples 22/25
  65. 65. exon intron • (Important biological finding) non-canonical splicing can arise if: • Introns containGCA or NAA sequences at their boundaries. • Exons include contiguousA’s around the boundaries. Results: Identification of Non-Canonical Splice Sites We used 162,951 examples excluding canonical splice sites. 23/25
  66. 66. • We proposed a new RBM training method called boosted CD with categorical gradients that improves conventionalCD for class-imbalanced data. • Significant boosts in splicing prediction in terms of accuracy and runtime. • Increased robustness to high-dimensional class-imbalanced data. • The proposed scheme shows the ability to detect subtle non-canonical splicing signals that often could not be identified by traditional methods. • Future work: additional validation using various class-imbalance datasets. 24/25 Conclusion
  67. 67. • Our lab members • Financial supports • ICML 2015 travel scholarship Acknowledgements June 2, 2015 25/25
  68. 68. • Our lab members • Financial supports • ICML 2015 travel scholarship Acknowledgements June 2, 2015 25/25
  69. 69. • The proposed DBN showed xx% higher performance in terms of the F1-score. • RNN is appropriate for sequence modeling. However, splicing signals are often too far from the boundaries and hard to maintain splicing information. Backup:Comparison with Recurrent Neural Networks (RNNs) To be placed Backup/25

×