CS571: Gradient Descent

•

1 like•1,697 views

Gradient descent is an optimization algorithm used in supervised learning problems to minimize a loss function. It works by taking steps in the negative direction of the gradient of the loss function with respect to the weights at each iteration. Stochastic gradient descent is a variant that updates the weights for each training example, rather than for the full batch. The perceptron is an algorithm that updates the weights only when the prediction is incorrect. It converges when the weights correctly classify all training examples.

Technology

Gradient Descent
Natural Language Processing
Emory University
Jinho D. Choi

ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
E(f) =
Z
`(ˆy; y) · P(x, y)
Supervised Learning
2
(X, Y ) = {(x1, y1), . . . , (xn, yn)}
ˆy = f(x) predicts the output of x
input
prediction
loss function joint distribution
Expected risk
unknown!
Empirical risk minimize!
output y = ±1 binomial distribution

`(w, x; y) =
1
2
(wT
x y)2
ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
Linear Prediction
3
least squares
linear function
Find a weight vector that minimizes the loss.
`(ˆy; y) =
1
2
(ˆy y)2
ˆy = f(x) = wT
(x) = wT
x
feature vector

wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
Gradient Descent
4
learning rate derivative of the loss
Minimize loss
Derivative → 0
Global optimum?
Convex optimization

Gradient Descent
5
How often is the weight vector updated?
wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
`(w, x; y) =
1
2
(wT
x y)2
@
@w
`(w, x; y) =
@
@w
1
2
(wT
x y)2
= (wT
x y)x
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi

Stochastic Gradient Descent
6
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi
wt+1 wt ⌘t(wT
t xi yi)xi
0
+
-
w0 0
wT
0 x1 > 0
wT
1 x2 < 0
wT
2 x3 < 0 w3 w2 ⌘( 1)x3
w2 w1 ⌘( + 1)x2
w1 w0 ⌘( + 1)x1
wT
3 x4 > 0 w4 w3 ⌘( 1)x4
updated for every instance

Perceptron
7
wt+1 wt ⌘t `
Stochastic gradient descent
wt+1 wt + ⌘t
⇢
x · y wT
t x · y < 0
0 otherwise
`(w, x; y) =
1
2
(wT
x y)2
` = (wT
x y)x
Least squares
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron

Averaged Perceptron
8
The ﬁnal hyperplane may be 
overﬁtted to later instances.
Take the average of all hyperplanes
including ones that are not updated.

Averaged Perceptron
9
c c + 1
Initialization:
Update rule: for every instance
c 1
sparse vector?
wt+1 wt + ⌘t(x · y)
vt+1 vt + ⌘t · c(x · y)
w w
1
c
· v
wt+1 wt + ⌘t(x · y) if wT
t x · y < 0
w
1
c
c 1X
t=0
wt

Multinomial Perceptron
11
a b c d ew =
1 0 0 1 0x =
wT
x = a + d ˆy =
⇢
1 wT
x 0
1 otherwise
a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 e0 e1 e2 e3w =
5 features (including bias)
Binomial
Multinomial y = {0, 1, 2, 3}
ˆy = arg max
y
wT
y xwT
y x = ay + dy
y = { 1, 1}

Binomial vs. Multinomial Perceptron
12
wt+1 wt + ⌘t(x · y)
Binomial
wy,t+1 wy,t + ⌘t · x
Multinomial
wˆy,t+1 wˆy,t ⌘t · x
if wT
t x · y < 0 , y 6= ˆy

Hinge Loss
13
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron
Hinge loss
`(w, x; y) = max{0, 1 wT
x · y}
` =
⇢
x · y wT
x · y < 1
0 otherwise

Adaptive Gradient Descent
14
if wT
t x · y < 0
Perceptron
if wT
t · y < 1
Hinge loss
wt+1 wt + ⌘t(x · y)
gt+1 gt + x x
wt+1 wt +
⌘
⇢ +
p
gt+1
· (x · y)

What's hot

Interpolation In Numerical Methods.Abu Kaisar

[2019] Language ModelingJinho Choi

Newton’s Forward & backward interpolation Meet Patel

Newton's forward differenceRaj Parekh

Newton's Backward Interpolation Formula with ExampleMuhammadUsmanIkram2

Resumen de Integrales (Cálculo Diferencial e Integral UNAB)Mauricio Vargas 帕夏

A Note on BPTT for LSTM LMTomonari Masada

Asymptotes | WORKING PRINCIPLE OF ASYMPTOTESNITESH POONIA

Gentle intro to SVMZoya Bylinskii

Lesson 27: Integration by Substitution (Section 041 slides)Matthew Leingang

RecurrenceMath Academy Singapore

Complex Numbers 1 - Math Academy - JC H2 maths A levelsMath Academy Singapore

Sample2Nima Rasekh

Interpolation functionsTarun Gehlot

BBMP1103 - Sept 2011 exam workshop - part 8Richard Ng

AMLChao Chen

125 5.2Jeneva Clark

Integration by partsЕлена Доброштан

Lecture9 multi kernel_svmStéphane Canu

Newton backward interpolationMUHAMMADUMAIR647

What's hot (20)

Interpolation In Numerical Methods.

[2019] Language Modeling

Newton’s Forward & backward interpolation

Newton's forward difference

Newton's Backward Interpolation Formula with Example

Resumen de Integrales (Cálculo Diferencial e Integral UNAB)

A Note on BPTT for LSTM LM

Asymptotes | WORKING PRINCIPLE OF ASYMPTOTES

Gentle intro to SVM

Lesson 27: Integration by Substitution (Section 041 slides)

Recurrence

Complex Numbers 1 - Math Academy - JC H2 maths A levels

Sample2

Interpolation functions

BBMP1103 - Sept 2011 exam workshop - part 8

AML

125 5.2

Integration by parts

Lecture9 multi kernel_svm

Newton backward interpolation

Viewers also liked

Introduction to Deep Learning and neon at GalvanizeIntel Nervana

Sentiment analysis using naive bayes classifier Dev Sahu

Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan

Tutorial on Opinion Mining and Sentiment AnalysisYun Hao

A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...Cataldo Musto

CS571: Sentiment AnalysisJinho Choi

Text categorizationPhuong Nguyen

(Deep) Neural Networks在 NLP 和 Text Mining 总结君廖

Text categorizationKU Leuven

Viewers also liked (9)

Introduction to Deep Learning and neon at Galvanize

Sentiment analysis using naive bayes classifier

Rule based approach to sentiment analysis at romip’11 slides

Tutorial on Opinion Mining and Sentiment Analysis

A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...

CS571: Sentiment Analysis

Text categorization

(Deep) Neural Networks在 NLP 和 Text Mining 总结

Text categorization

Similar to CS571: Gradient Descent

Calculus First Test 2011/10/20Kuan-Lun Wang

Physical Chemistry Assignment HelpEdu Assignment Help

Eight Regression Algorithmsguestfee8698

Emat 213 study guideakabaka12

Differential Calculus OlooPundit

Lecture8 multi class_svmStéphane Canu

MLHEP Lectures - day 3, basic trackarogozhnikov

6.3_DiscriminantFunctions for machine learning supervised learningMrsMargaretSavithaP

InterpolationDmytro Mitin

Calculus B Notes (Notre Dame)Laurel Ayuyao

Kernels and Support Vector MachinesEdgar Marca

1. newtonsforwardbackwordinterpolation-190305095001.pdfFaisalMehmood887349

Sect1 5inKFUPM

Sect1 4inKFUPM

1 - Linear RegressionNikita Zhiltsov

H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati

Lecture 2: linear SVM in the dualStéphane Canu

Lecture 2: linear SVM in the DualStéphane Canu

2.1 Calculus 2.formulas.pdf.pdfNiccoloAaronMendozaA

CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...The Statistical and Applied Mathematical Sciences Institute

Similar to CS571: Gradient Descent (20)

Calculus First Test 2011/10/20

Physical Chemistry Assignment Help

Eight Regression Algorithms

Emat 213 study guide

Differential Calculus

Lecture8 multi class_svm

MLHEP Lectures - day 3, basic track

6.3_DiscriminantFunctions for machine learning supervised learning

Interpolation

Calculus B Notes (Notre Dame)

Kernels and Support Vector Machines

1. newtonsforwardbackwordinterpolation-190305095001.pdf

Sect1 5

Sect1 4

1 - Linear Regression

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Lecture 2: linear SVM in the dual

Lecture 2: linear SVM in the Dual

2.1 Calculus 2.formulas.pdf.pdf

CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...

Recently uploaded

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Key Features Of Token Development (1).pptxLBM Solutions

AI as an Interface for Commercial BuildingsMemoori

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

"ML in Production",Oleksandr BaganFwdays

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Gen AI in Business - Global Trends Report 2024.pdfAddepto

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

SIP trunking in Janus @ Kamailio World 2024

DMCC Future of Trade Web3 - Special Edition

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Streamlining Python Development: A Guide to a Modern Project Setup

Key Features Of Token Development (1).pptx

AI as an Interface for Commercial Buildings

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Unraveling Multimodality with Large Language Models.pdf

SQL Database Design For Developers at php[tek] 2024

"ML in Production",Oleksandr Bagan

Scanning the Internet for External Cloud Exposures via SSL Certs

Understanding the Laravel MVC Architecture

Gen AI in Business - Global Trends Report 2024.pdf

My INSURER PTE LTD - Insurtech Innovation Award 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

Vertex AI Gemini Prompt Engineering Tips

CS571: Gradient Descent

1. Gradient Descent Natural Language Processing Emory University Jinho D. Choi

2. ˆE(f) = 1 n nX i=1 `(ˆyi; yi) E(f) = Z `(ˆy; y) · P(x, y) Supervised Learning 2 (X, Y ) = {(x1, y1), . . . , (xn, yn)} ˆy = f(x) predicts the output of x input prediction loss function joint distribution Expected risk unknown! Empirical risk minimize! output y = ±1 binomial distribution

3. `(w, x; y) = 1 2 (wT x y)2 ˆE(f) = 1 n nX i=1 `(ˆyi; yi) Linear Prediction 3 least squares linear function Find a weight vector that minimizes the loss. `(ˆy; y) = 1 2 (ˆy y)2 ˆy = f(x) = wT (x) = wT x feature vector

4. wt+1 wt ⌘t 1 n nX i=1 @ @w `(wt, xi; yi) Gradient Descent 4 learning rate derivative of the loss Minimize loss Derivative → 0 Global optimum? Convex optimization

5. Gradient Descent 5 How often is the weight vector updated? wt+1 wt ⌘t 1 n nX i=1 @ @w `(wt, xi; yi) `(w, x; y) = 1 2 (wT x y)2 @ @w `(w, x; y) = @ @w 1 2 (wT x y)2 = (wT x y)x wt+1 wt ⌘t 1 n nX i=1 (wT xi yi)xi

6. Stochastic Gradient Descent 6 wt+1 wt ⌘t 1 n nX i=1 (wT xi yi)xi wt+1 wt ⌘t(wT t xi yi)xi 0 + - w0 0 wT 0 x1 > 0 wT 1 x2 < 0 wT 2 x3 < 0 w3 w2 ⌘( 1)x3 w2 w1 ⌘( + 1)x2 w1 w0 ⌘( + 1)x1 wT 3 x4 > 0 w4 w3 ⌘( 1)x4 updated for every instance

7. Perceptron 7 wt+1 wt ⌘t ` Stochastic gradient descent wt+1 wt + ⌘t ⇢ x · y wT t x · y < 0 0 otherwise `(w, x; y) = 1 2 (wT x y)2 ` = (wT x y)x Least squares ` = ⇢ x · y wT x · y < 0 0 otherwise `(w, x; y) = max{0, wT x · y} Perceptron

8. Averaged Perceptron 8 The ﬁnal hyperplane may be  overﬁtted to later instances. Take the average of all hyperplanes including ones that are not updated.

9. Averaged Perceptron 9 c c + 1 Initialization: Update rule: for every instance c 1 sparse vector? wt+1 wt + ⌘t(x · y) vt+1 vt + ⌘t · c(x · y) w w 1 c · v wt+1 wt + ⌘t(x · y) if wT t x · y < 0 w 1 c c 1X t=0 wt

10. Emory University Logo Guidelines - Multinomial Perceptron 10 Binomial distribution requires 1 hyperplane to separate 2 classes. Multinomial distribution requires m hyperplanes to separate m classes. How many for  m classes?

11. Multinomial Perceptron 11 a b c d ew = 1 0 0 1 0x = wT x = a + d ˆy = ⇢ 1 wT x 0 1 otherwise a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 e0 e1 e2 e3w = 5 features (including bias) Binomial Multinomial y = {0, 1, 2, 3} ˆy = arg max y wT y xwT y x = ay + dy y = { 1, 1}

12. Binomial vs. Multinomial Perceptron 12 wt+1 wt + ⌘t(x · y) Binomial wy,t+1 wy,t + ⌘t · x Multinomial wˆy,t+1 wˆy,t ⌘t · x if wT t x · y < 0 , y 6= ˆy

13. Hinge Loss 13 ` = ⇢ x · y wT x · y < 0 0 otherwise `(w, x; y) = max{0, wT x · y} Perceptron Hinge loss `(w, x; y) = max{0, 1 wT x · y} ` = ⇢ x · y wT x · y < 1 0 otherwise

14. Adaptive Gradient Descent 14 if wT t x · y < 0 Perceptron if wT t · y < 1 Hinge loss wt+1 wt + ⌘t(x · y) gt+1 gt + x x wt+1 wt + ⌘ ⇢ + p gt+1 · (x · y)

CS571: Gradient Descent

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to CS571: Gradient Descent

Similar to CS571: Gradient Descent (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

CS571: Gradient Descent