This document summarizes various optimization techniques for deep learning models, including gradient descent, stochastic gradient descent, and variants like momentum, Nesterov's accelerated gradient, AdaGrad, RMSProp, and Adam. It provides an overview of how each technique works and comparisons of their performance on image classification tasks using MNIST and CIFAR-10 datasets. The document concludes by encouraging attendees to try out the different optimization methods in Keras and provides resources for further deep learning topics.
Roadmap
• Tour ofmachine learning algorithms (1 session)
• Feature engineering (1 session)
• Feature selection - Yan
• Supervised learning (4 sessions)
• Regression models -Yan
• SVM and kernel SVM - Yan
• Tree-based models - Dario
• Bayesian method - Xiaoyang
• Ensemble models - Yan
• Unsupervised learning (3 sessions)
• K-means clustering
• DBSCAN - Cheng
• Mean shift
• Agglomerative clustering – Kunal
• Spectral clustering – Yan
• Dimension reduction for data visualization - Yan
• Deep learning
• Neural network - Yan
• Convolutional neural network – Hengyang Lu
• Recurrent neural networks – Yan
• Hands-on session with deep nets - Yan
Slides posted on:
http://www.slideshare.net/xuyangela
4.
More deep learningcoming up!
• Optimization in Deep learning (today’s session)
• Behind AlphaGo
• Mastering the game of Go with deep neural networks
and tree search
• Attention network
• Application of Deep Learning and showcase
5.
Outline
• Gradient Descent
•Stochastic Gradient Descent (SGD)
• Variants of SGD
• Use “momentum”
• Nestrov’s Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• Root Mean Square Propagation (RMSProp)
• Adaptive Moment Estimation (Adam)
SGD recommendation
• Randomlyshuffle training samples
• Monitor training and validation error
• Experiment learning rates using small sample of
training set
• Leverage sparsity of training samples
• Varying learning rate:
13.
Variants of SGD
•Use “momentum”
• Nestrov’s Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• Root Mean Square Propagation (RMSProp)
• Adaptive Moment Estimation (Adam)
Ref:
https://moodle2.cs.huji.ac.il/nu15/pluginfile.php/316969/mod_resource/conte
nt/1/adam_pres.pdf
AdaGrad
Adaptive learning rate:
•weights that receive high gradients will have their
effective learning rate reduced
• weights that receive small or infrequent updates
will have their effective learning rate increased
More deep learningcoming up!
• Optimization in Deep learning (today’s session)
• Behind AlphaGo
• Mastering the game of Go with deep neural networks
and tree search
• Attention network
• Application of Deep Learning and showcase
• Any proposal?
31.
Thank you
Slides willbe posted at: http://www.slideshare.net/xuyangela
Leave a
group
review
please