Optimization in Deep Learning

Optimization in Deep Learning
Houston Machine Learning Deep Learning Series

Roadmap
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
• Feature selection - Yan
• Supervised learning (4 sessions)
• Regression models -Yan
• SVM and kernel SVM - Yan
• Tree-based models - Dario
• Bayesian method - Xiaoyang
• Ensemble models - Yan
• Unsupervised learning (3 sessions)
• K-means clustering
• DBSCAN - Cheng
• Mean shift
• Agglomerative clustering – Kunal
• Spectral clustering – Yan
• Dimension reduction for data visualization - Yan
• Deep learning
• Neural network - Yan
• Convolutional neural network – Hengyang Lu
• Recurrent neural networks – Yan
• Hands-on session with deep nets - Yan
Slides posted on:
http://www.slideshare.net/xuyangela

More deep learning coming up!
• Optimization in Deep learning (today’s session)
• Behind AlphaGo
• Mastering the game of Go with deep neural networks
and tree search
• Attention network
• Application of Deep Learning and showcase

Outline
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• Variants of SGD
• Use “momentum”
• Nestrov’s Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• Root Mean Square Propagation (RMSProp)
• Adaptive Moment Estimation (Adam)

Stochastic Gradient Descent (SGD)

SGD recommendation
• Randomly shuffle training samples
• Monitor training and validation error
• Experiment learning rates using small sample of
training set
• Leverage sparsity of training samples
• Varying learning rate:

Variants of SGD
• Use “momentum”
• Nestrov’s Accelerated Gradient (NAG)
• Adaptive Gradient (AdaGrad)
• Root Mean Square Propagation (RMSProp)
• Adaptive Moment Estimation (Adam)
Ref:
https://moodle2.cs.huji.ac.il/nu15/pluginfile.php/316969/mod_resource/conte
nt/1/adam_pres.pdf

Performance comparison
http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

Long Valley

Saddle point

The momentum method by Dr. Geoffrey Hinton
https://www.youtube.com/watch?v=LdkkZglLZ0Q&list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&index=27

SGD with momentum
Start with 0.5

NAG (Nesterov’s Accelerated Gradient)

AdaGrad
Adaptive learning rate:
• weights that receive high gradients will have their
effective learning rate reduced
• weights that receive small or infrequent updates
will have their effective learning rate increased

Comparisons of Different Optimization
Methods

MINIST
Methods

CIFAR-10
Methods

Summary of learning methods for DL
https://www.youtube.com/watch?v=defQQqkXEfE&list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&index=29 from:7:33

Try it out!
From hands-on session: https://www.dropbox.com/s/92sckhnf1hjgjlo/CNN.zip?dl=0
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
…….
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
Optimizer: SGD, RMSprop, Adagrad, Adam…. (https://keras.io/optimizers/)

Summary
Full-batch
GD
SGD
Momentum
SGD
NAG
AdaGrad
RMSProp
Adam
Speed up by
momentum
Adaptive learning rate

More deep learning coming up!
• Optimization in Deep learning (today’s session)
• Behind AlphaGo
• Mastering the game of Go with deep neural networks
and tree search
• Attention network
• Application of Deep Learning and showcase
• Any proposal?

Thank you
Slides will be posted at: http://www.slideshare.net/xuyangela
Leave a
group
review
please 

Optimization in Deep Learning

In this document