Jeonghun Yoon
์ง€๋‚œ ์‹œ๊ฐ„.....Naive Bayes Classifier
arg max
๐‘ฆ
๐‘ƒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘‘ ๐‘ฆ ๐‘ƒ(๐‘ฆ) = arg max
๐‘ฆ
๐‘ƒ ๐‘ฅ๐‘– ๐‘ฆ ๐‘ƒ(๐‘ฆ)
๐‘‘
๐‘–=1
class ๐‘ฆ ์˜ ๋ฐœ์ƒ ํ™•๋ฅ ๊ณผ test set์—์„œ class ๐‘ฆ์˜ label์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ๋ฒกํ„ฐ์˜
์›์†Œ ๐‘ฅ๐‘– (๋ฌธ์„œ์˜ ์˜ˆ์—์„œ๋Š” ๋‹จ์–ด) ๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์˜ ๊ณฑ
ex) (I, love, you)๊ฐ€ spam์ธ์ง€ ์•„๋‹Œ์ง€ ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋Š”,
test set์—์„œ spam์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ๊ณผ
spam์œผ๋กœ labeling ๋œ ๋ฌธ์„œ์—์„œ I์™€ love์™€ you๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ํ™•๋ฅ ์„ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒƒ๊ณผ,
test set์—์„œ ham์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ๊ณผ
ham์œผ๋กœ labeling ๋œ ๋ฌธ์„œ์—์„œ I์™€ love์™€ you๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ํ™•๋ฅ ์„ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒƒ์„,
๋น„๊ตํ•œ๋‹ค.
์ง€๋‚œ ์‹œ๊ฐ„ ๋ฏธ๋น„ํ–ˆ๋˜ ์  ๋“ค...
1. Laplacian Smoothing (appendix ์ฐธ๊ณ )
2. MLE / MAP
1
Bayesโ€™ Rule
๐‘ ๐œƒ ๐•ฉ =
๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
posteriori
(์‚ฌํ›„ ํ™•๋ฅ )
likelihood
(์šฐ๋„ ๊ฐ’)
prior
(์‚ฌ์ „ ํ™•๋ฅ )
์‚ฌํ›„ ํ™•๋ฅ  : ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๊ด€์ฐฐ ๋œ ํ›„์— ๋ชจ์ˆ˜(parameter)์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.
์‚ฌ์ „ ํ™•๋ฅ  : ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๊ด€์ฐฐ ๋˜๊ธฐ ์ „์— ๋ชจ์ˆ˜์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.
์šฐ๋„ ๊ฐ’ : ๋ชจ์ˆ˜์˜ ๊ฐ’์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ 
Maximum Likelihood Estimate
๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›)
๐“› ๐œฝ = ๐’‘ ๐•ฉ ๐œฝ
์šฐ๋„(likelihood)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ๋œ๋‹ค.
๋ณ€์ˆ˜(parameter) ๐œƒ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, data set ๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›) (๊ด€์ฐฐ ๋œ, observed) ๋ฅผ
์–ป์„ ์ˆ˜ ์žˆ๋Š”(obtaining) ํ™•๋ฅ 
๐‘(๐•ฉ|๐œƒ)
๐‘‹
๐œƒ์˜ ํ•จ์ˆ˜.
๐œƒ์˜ pdf๋Š” ์•„๋‹˜.
๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›)
Maximum Likelihood Estimate๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ๋œ๋‹ค.
๊ด€์ฐฐ ๋œ data set ๐•ฉ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ™•๋ฅ ์ด ๊ฐ€์žฅ ํฐ ๐œƒ๊ฐ€ MLE์ด๋‹ค.
๐‘(๐•ฉ|๐œƒ1)
๐‘‹
๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›)
๐œฝ = ๐š๐ซ๐  ๐ฆ๐š๐ฑ
๐œฝ
๐“› ๐œฝ = ๐š๐ซ๐  ๐ฆ๐š๐ฑ
๐œฝ
๐’‘(๐•ฉ|๐œฝ)ฬ‚
๐‘(๐•ฉ|๐œƒ2) ๐‘(๐•ฉ|๐œƒ3)
๐‘(๐•ฉ|๐œƒ)
๐œƒ = ๐œƒ2
ฬ‚
์šฐ๋ฆฌ๊ฐ€ likelihood function ๐‘(๐•ฉ|๐œƒ)์™€ prior ๐‘(๐œƒ)๋ฅผ ์•Œ ๋•Œ, Bayes rule์— ์˜ํ•˜์—ฌ
posteriori function์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
๐’‘ ๐œฝ ๐•ฉ โˆ ๐’‘ ๐•ฉ ๐œฝ ๐’‘(๐œฝ)
Maximum A Posteriori Estimate
๐‘ ๐œƒ ๐•ฉ =
๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
posteriori
(์‚ฌํ›„ ํ™•๋ฅ )
likelihood
(์šฐ๋„ ๊ฐ’)
prior
(์‚ฌ์ „ ํ™•๋ฅ )
Likelihood ๐‘(๐•ฉ|๐œƒ)
Prior ๐‘(๐œƒ)
Posterior
๐‘ ๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
Likelihood ๐‘(๐•ฉ|๐œƒ)
Prior ๐‘(๐œƒ)
Posterior
๐‘ ๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
๐œฝ = ๐š๐ซ๐  ๐ฆ๐š๐ฑ
๐œฝ
๐’‘(๐œฝ|๐•ฉ)
Likelihood ๐‘(๐•ฉ|๐œƒ)
Prior ๐‘(๐œƒ)
Posterior
๐‘ ๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
Regression
๋‚˜๋Š” ํฐ ์‹ ๋ฐœํšŒ์‚ฌ์˜ CEO์ด๋‹ค. ๋งŽ์€ ์ง€์ ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋ฒˆ์— ์ƒˆ๋กœ์šด ์ง€์ ์„ ๋‚ด๊ณ  ์‹ถ๋‹ค. ์–ด๋А ์ง€์—ญ์— ๋‚ด์•ผ ๋ ๊นŒ?
๋‚ด๊ฐ€ ์ƒˆ๋กœ์šด ์ง€์ ์„ ๋‚ด๊ณ  ์‹ถ์–ดํ•˜๋Š” ์ง€์—ญ๋“ค์˜ ์˜ˆ์ƒ ์ˆ˜์ต๋งŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉด
ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ธ๋ฐ!
๋‚ด๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ž๋ฃŒ(data)๋Š” ๊ฐ ์ง€์ ์˜ ์ˆ˜์ต(profits)๊ณผ ๊ฐ ์ง€์ ์ด ์žˆ๋Š” ์ง€์—ญ์˜
์ธ๊ตฌ์ˆ˜(populations)์ด๋‹ค.
ํ•ด๊ฒฐ์ฑ…! Linear Regression!
์ด๊ฒƒ์„ ํ†ตํ•˜์—ฌ, ์ƒˆ๋กœ์šด ์ง€์—ญ์˜ ์ธ๊ตฌ์ˆ˜๋ฅผ ์•Œ๊ฒŒ ๋  ๊ฒฝ์šฐ, ๊ทธ ์ง€์—ญ์˜ ์˜ˆ์ƒ ์ˆ˜์ต์„ ๊ตฌ
ํ•  ์ˆ˜ ์žˆ๋‹ค.
Example 1)
Example 2)
๋‚˜๋Š” ์ง€๊ธˆ Pittsburgh๋กœ ์ด์‚ฌ๋ฅผ ์™”๋‹ค
๋‚˜๋Š” ๊ฐ€์žฅ ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ์˜ ์•„ํŒŒํŠธ๋ฅผ ์–ป๊ธฐ ์›ํ•œ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ์˜ ์กฐ๊ฑด๋“ค์€ ๋‚ด๊ฐ€ ์ง‘์„ ์‚ฌ๊ธฐ ์œ„ํ•ด ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ๋“ค์ด๋‹ค.
square-ft(ํ‰๋ฐฉ๋ฏธํ„ฐ), ์นจ์‹ค์˜ ์ˆ˜, ํ•™๊ต ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ...
๋‚ด๊ฐ€ ์›ํ•˜๋Š” ํฌ๊ธฐ์™€ ์นจ์‹ค์˜ ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ง‘์˜ ๊ฐ€๊ฒฉ์€ ๊ณผ์—ฐ ์–ผ๋งˆ์ผ๊นŒ?
โ‘  Given an input ๐‘ฅ we would like to compute an output ๐‘ฆ.
(๋‚ด๊ฐ€ ์›ํ•˜๋Š” ์ง‘์˜ ํฌ๊ธฐ์™€, ๋ฐฉ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ž…๋ ฅํ–ˆ์„ ๋•Œ, ์ง‘ ๊ฐ€๊ฒฉ์˜ ์˜ˆ์ธก ๊ฐ’์„ ๊ณ„์‚ฐ)
โ‘ก For example
1) Predict height from age (height = ๐‘ฆ, age = ๐‘ฅ)
2) Predict Google`s price from Yahoo`s price (Google's price = ๐‘ฆ, Yahoo's price = ๐‘ฅ)
๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ
์ฆ‰, ๊ธฐ์กด์˜ data๋“ค์—์„œ
์ง์„ (๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ)์„ ์ฐพ์•„๋‚ด๋ฉด,
์ƒˆ๋กœ์šด ๊ฐ’ ๐‘ฅ ๐‘›๐‘’๐‘ค๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ,
ํ•ด๋‹นํ•˜๋Š” ๐‘ฆ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜
์žˆ๊ฒ ๊ตฌ๋‚˜!
learning, training
prediction
Input : ์ง‘์˜ ํฌ๊ธฐ(๐‘ฅ1), ๋ฐฉ์˜ ๊ฐœ์ˆ˜(๐‘ฅ2), ํ•™๊ต๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ(๐‘ฅ3),.....
(๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ ๐‘›) : ํŠน์„ฑ ๋ฒกํ„ฐ feature vector
Output : ์ง‘ ๊ฐ’(๐‘ฆ)
๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’
training set์„ ํ†ตํ•˜์—ฌ ํ•™์Šต(learning)
Simple Linear Regression
๐‘ฆ๐‘– = ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– + ๐œ€๐‘–
๐‘–๋ฒˆ์งธ ๊ด€์ฐฐ์  ๐‘ฆ๐‘–, ๐‘ฅ๐‘– ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹จ์ˆœ ํšŒ๊ท€ ๋ชจํ˜•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
๐œ–3
๐œ–๐‘– : ๐‘–๋ฒˆ์งธ ๊ด€์ฐฐ์ ์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š”
ํšŒ๊ท€์ง์„ ๊ณผ ์‹ค์ œ ๊ด€์ฐฐ๋œ ๐‘ฆ๐‘–์˜ ์ฐจ์ด (error)
์šฐ๋ฆฌ๋Š” ์˜ค๋ฅ˜์˜ ํ•ฉ์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ
๋งŒ๋“œ๋Š” ์ง์„ ์„ ์ฐพ๊ณ  ์‹ถ๋‹ค. ์ฆ‰ ๊ทธ๋ ‡๊ฒŒ
๋งŒ๋“œ๋Š” ๐œฝ ๐ŸŽ์™€ ๐œฝ ๐Ÿ์„ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค !
How!! ์ตœ์†Œ ์ œ๊ณฑ ๋ฒ•! (Least Squares Method)
min ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘–
2
๐‘–
= ๐‘š๐‘–๐‘› ๐œ–๐‘–
2
๐‘–
๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ
์‹ค์ œ ๊ด€์ธก ๊ฐ’ ํšŒ๊ท€ ์ง์„ ์˜ ๊ฐ’(์ด์ƒ์ ์ธ ๊ฐ’)
์ข…์† ๋ณ€์ˆ˜ ์„ค๋ช… ๋ณ€์ˆ˜, ๋…๋ฆฝ ๋ณ€์ˆ˜
min ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘–
2
๐‘–
= min ๐œ–๐‘–
2
๐‘–
์‹ค์ œ ๊ด€์ธก ๊ฐ’ ํšŒ๊ท€ ์ง์„ ์˜ ๊ฐ’(์ด์ƒ์ ์ธ ๊ฐ’)
์œ„์˜ ์‹์„ ์ตœ๋Œ€ํ•œ ๋งŒ์กฑ ์‹œํ‚ค๋Š” ๐œƒ0, ๐œƒ1์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ?
(์ด๋Ÿฌํ•œ ๐œƒ1, ๐œƒ2๋ฅผ ๐œƒ1, ๐œƒ2 ๋ผ๊ณ  ํ•˜์ž.)
- Normal Equation
- Steepest Gradient Descent
ห† ห†
What is normal equation?
๊ทน๋Œ€ ๊ฐ’, ๊ทน์†Œ ๊ฐ’์„ ๊ตฌํ•  ๋•Œ, ์ฃผ์–ด์ง„ ์‹์„ ๋ฏธ๋ถ„ํ•œ ํ›„์—, ๋ฏธ๋ถ„ํ•œ ์‹์„ 0์œผ๋กœ
๋งŒ๋“œ๋Š” ๊ฐ’์„ ์ฐพ๋Š”๋‹ค.
min ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘–
2
๐‘–
๋จผ์ €, ๐œƒ0์— ๋Œ€ํ•˜์—ฌ ๋ฏธ๋ถ„ํ•˜์ž. โˆ’ ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– = 0
๐‘–
๐œ•
๐œ•๐œƒ0
๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘–
2
๐‘–
=
๋‹ค์Œ์œผ๋กœ, ๐œƒ1์— ๋Œ€ํ•˜์—ฌ ๋ฏธ๋ถ„ํ•˜์ž. โˆ’ ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– ๐‘ฅ๐‘– = 0
๐‘–
๐œ•
๐œ•๐œƒ1
๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘–
2
๐‘–
=
์œ„ ์˜ ๋‘ ์‹์„ 0์œผ๋กœ ๋งŒ์กฑ์‹œํ‚ค๋Š” ๐œƒ0, ๐œƒ1๋ฅผ ์ฐพ์œผ๋ฉด ๋œ๋‹ค. ์ด์ฒ˜๋Ÿผ 2๊ฐœ์˜ ๋ฏธ์ง€์ˆ˜์— ๋Œ€ํ•˜์—ฌ,
2๊ฐœ์˜ ๋ฐฉ์ •์‹(system)์ด ์žˆ์„ ๋•Œ, ์šฐ๋ฆฌ๋Š” ์ด system์„ normal equation(์ •๊ทœ๋ฐฉ์ •์‹)์ด๋ผ ๋ถ€๋ฅธ๋‹ค.
The normal equation form
๐•ฉ๐‘– = 1, ๐‘ฅ๐‘–
๐‘‡
, ฮ˜ = ๐œƒ0, ๐œƒ1
๐‘‡
, ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ ๐‘›
๐‘‡
, ๐‘‹ =
1
1
โ€ฆ
๐‘ฅ1
๐‘ฅ2
โ€ฆ
1 ๐‘ฅ ๐‘›
, ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž.
๐•ช = ๐‘‹ฮ˜ + ๐•–
๐‘ฆ1 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ1 + ๐œ–1
๐‘ฆ2 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ2 + ๐œ–2
.......
๐‘ฆ ๐‘›โˆ’1 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ๐‘›โˆ’1 + ๐œ– ๐‘›โˆ’1
๐‘ฆ ๐‘› = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ๐‘› + ๐œ– ๐‘›
๐‘›๊ฐœ์˜ ๊ด€์ธก ๊ฐ’ (๐‘ฅ๐‘–, ๐‘ฆ๐‘–)์€ ์•„๋ž˜์™€ ๊ฐ™์€ ํšŒ๊ท€ ๋ชจํ˜•์„ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.
๐‘ฆ1
๐‘ฆ2
๐‘ฆ3
โ€ฆ
๐‘ฆ ๐‘›
=
1
1
1
โ€ฆ
๐‘ฅ1
๐‘ฅ2
๐‘ฅ3
โ€ฆ
1 ๐‘ฅ ๐‘›
๐œƒ0
๐œƒ1
+
๐œ–1
๐œ–2
๐œ–3
โ€ฆ
๐œ– ๐‘›
๐œ–๐‘—
2
๐‘›
๐‘—=1
= ๐•– ๐‘‡
๐•– = ๐•ช โˆ’ ๐‘‹ฮ˜ ๐‘‡
(๐•ช โˆ’ ๐‘‹ฮ˜)
= ๐•ช ๐‘‡
๐•ช โˆ’ ฮ˜ ๐‘‡
๐‘‹ ๐‘‡
๐•ช โˆ’ ๐•ช ๐‘‡
๐‘‹ฮ˜ + ฮ˜ ๐‘‡
๐‘‹ ๐‘‡
๐‘‹ฮ˜
= ๐•ช ๐‘‡
๐•ช โˆ’ 2ฮ˜ ๐‘‡
๐‘‹ ๐‘‡
๐•ช + ฮ˜ ๐‘‡
๐‘‹ ๐‘‡
๐‘‹ฮ˜
1 by 1 ํ–‰๋ ฌ์ด๋ฏ€๋กœ
์ „์น˜ํ–‰๋ ฌ์˜ ๊ฐ’์ด ๊ฐ™๋‹ค!
๐œ•(๐•– ๐‘‡
๐•–)
๐œ•ฮ˜
= ๐ŸŽ
๐œ•(๐•– ๐‘‡
๐•–)
๐œ•ฮ˜
= โˆ’2๐‘‹ ๐‘‡
๐•ช + 2๐‘‹ ๐‘‡
๐‘‹ฮ˜ = ๐ŸŽ
๐‘‹ ๐‘‡
๐‘‹๐šฏ = ๐‘‹ ๐‘‡
๐•ช ๐šฏ = ๐‘‹ ๐‘‡
๐‘‹ โˆ’1
๐‘‹ ๐‘‡
๐•ชห†
์ •๊ทœ๋ฐฉ์ •์‹
๐•ช = ๐‘‹ฮ˜ + ๐•– ๐•– = ๐•ช โˆ’ ๐‘‹ฮ˜
Minimize ๐œ–๐‘—
2
๐‘›
๐‘—=1
What is Gradient Descent?
machine learning์—์„œ๋Š” ๋งค๊ฐœ ๋ณ€์ˆ˜(parameter, ์„ ํ˜•ํšŒ๊ท€์—์„œ๋Š” ๐œƒ0, ๐œƒ1)๊ฐ€ ์ˆ˜์‹ญ~
์ˆ˜๋ฐฑ ์ฐจ์›์˜ ๋ฒกํ„ฐ์ธ ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๋‹ค. ๋˜ํ•œ ๋ชฉ์  ํ•จ์ˆ˜(์„ ํ˜•ํšŒ๊ท€์—์„œ๋Š” ฮฃ๐œ–๐‘–
2
)๊ฐ€
๋ชจ๋“  ๊ตฌ๊ฐ„์—์„œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ณด์žฅ์ด ํ•ญ์ƒ ์žˆ๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ ํ•œ ๋ฒˆ์˜ ์ˆ˜์‹ ์ „๊ฐœ๋กœ ํ•ด๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ์ด ์ ์ง€ ์•Š๊ฒŒ ์žˆ๋‹ค.
์ด๋Ÿฐ ๊ฒฝ์šฐ์—๋Š” ์ดˆ๊ธฐ ํ•ด์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํ•ด๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฐœ์„ ํ•ด ๋‚˜๊ฐ€๋Š” ์ˆ˜์น˜์ 
๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. (๋ฏธ๋ถ„์ด ์‚ฌ์šฉ ๋จ)
What is Gradient Descent?
์ดˆ๊ธฐํ•ด ๐›ผ0 ์„ค์ •
๐‘ก = 0
๐›ผ ๐‘ก๊ฐ€ ๋งŒ์กฑ์Šค๋Ÿฝ๋‚˜?
๐›ผ ๐‘ก+1 = ๐‘ˆ ๐›ผ ๐‘ก
๐‘ก = ๐‘ก + 1
๐›ผ = ๐›ผ ๐‘ก
ห†No
Yes
What is Gradient Descent?
Gradient Descent
ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ธ‰ํ•˜๊ฒŒ ํ•˜๊ฐ•ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ๊ณ ,
๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์•ฝ๊ฐ„ ์ด๋™ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ์žก๋Š”๋‹ค.
์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ๋‚ฎ์€ ์ง€์ (์ฆ‰ ์ตœ์ € ์ )์„ ์ฐพ์•„ ๊ฐ„๋‹ค.
Gradient Ascent
ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ธ‰ํ•˜๊ฒŒ ์ƒ์Šนํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ๊ณ ,
๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์•ฝ๊ฐ„ ์ด๋™ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ์žก๋Š”๋‹ค.
์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ๋†’์€ ์ง€์ (์ฆ‰ ์ตœ๋Œ€ ์ )์„ ์ฐพ์•„ ๊ฐ„๋‹ค.
What is Gradient Descent?
Gradient Descent
๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก โˆ’ ๐œŒ
๐œ•๐ฝ
๐œ•๐›ผ ๐›ผ ๐‘ก
๐ฝ = ๋ชฉ์ ํ•จ์ˆ˜
๐œ•๐ฝ
๐œ•๐›ผ ๐›ผ ๐‘ก
: ๐›ผ ๐‘ก์—์„œ์˜ ๋„ํ•จ์ˆ˜
๐œ•๐ฝ
๐œ•๐›ผ
์˜ ๊ฐ’
๐›ผ ๐‘ก ๐›ผ ๐‘ก+1
โˆ’
๐๐‘ฑ
๐๐œถ ๐œถ ๐’•
๐๐‘ฑ
๐๐œถ ๐œถ ๐’•
๐›ผ ๐‘ก์—์„œ์˜ ๋ฏธ๋ถ„๊ฐ’์€ ์Œ์ˆ˜์ด๋‹ค.
๊ทธ๋ž˜์„œ
๐œ•J
๐œ•ฮฑ ฮฑt
๋ฅผ ๋”ํ•˜๊ฒŒ ๋˜๋ฉด
์™ผ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด ๋ชฉ์ ํ•จ์ˆ˜์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š”
๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค.
๋”ฐ๋ผ์„œ
๐œ•J
๐œ•ฮฑ ฮฑt
๋ฅผ ๋นผ์ค€๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ ๋‹นํ•œ ๐œŒ๋ฅผ ๊ณฑํ•ด์ฃผ์–ด์„œ ์กฐ๊ธˆ๋งŒ
์ด๋™ํ•˜๊ฒŒ ํ•œ๋‹ค.
โˆ’๐†
๐๐‘ฑ
๐๐œถ ๐œถ ๐’•
What is Gradient Descent?
Gradient Descent
๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก โˆ’ ๐œŒ
๐œ•๐ฝ
๐œ•๐›ผ ๐›ผ ๐‘ก
Gradient Ascent
๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก + ๐œŒ
๐œ•๐ฝ
๐œ•๐›ผ ๐›ผ ๐‘ก
๐ฝ = ๋ชฉ์ ํ•จ์ˆ˜
๐œ•๐ฝ
๐œ•๐›ผ ๐›ผ ๐‘ก
: ๐›ผ ๐‘ก์—์„œ์˜ ๋„ํ•จ์ˆ˜
๐œ•๐ฝ
๐œ•๐›ผ
์˜ ๊ฐ’
Gradient Descent, Gradient Ascent๋Š” ์ „ํ˜•์ ์ธ Greedy algorithm์ด๋‹ค.
๊ณผ๊ฑฐ ๋˜๋Š” ๋ฏธ๋ž˜๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ํ˜„์žฌ ์ƒํ™ฉ์—์„œ ๊ฐ€์žฅ ์œ ๋ฆฌํ•œ ๋‹ค์Œ ์œ„์น˜๋ฅผ ์ฐพ์•„
Local optimal point๋กœ ๋๋‚  ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
๐ฝ ฮ˜ =
1
2
๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
2
๐‘›
๐‘–=1
=
1
2
ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–
2
๐‘›
๐‘–=1
๐•ฉ๐‘– = 1, ๐‘ฅ๐‘–
๐‘‡
, ฮ˜ = ๐œƒ0, ๐œƒ1
๐‘‡
, ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘›
๐‘‡
, ๐‘‹ =
1
1
โ€ฆ
๐‘ฅ1
๐‘ฅ2
โ€ฆ
1 ๐‘ฅ ๐‘›
, ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž.
๐œƒ0
๐‘ก+1
= ๐œƒ0
๐‘ก
โˆ’ ๐›ผ
๐œ•
๐œ•๐œƒ0
๐ฝ(ฮ˜)๐‘ก
๐œƒ1
๐‘ก+1
= ๐œƒ1
๐‘ก
โˆ’ ๐›ผ
๐œ•
๐œ•๐œƒ1
๐ฝ(ฮ˜)๐‘ก
๐œƒ0์˜ ๐‘ก๋ฒˆ์งธ ๊ฐ’์„,
๐ฝ(ฮ˜)๋ฅผ ๐œƒ0์œผ๋กœ ๋ฏธ๋ถ„ํ•œ ์‹์—๋‹ค๊ฐ€ ๋Œ€์ž….
๊ทธ ํ›„์—, ์ด ๊ฐ’์„ ๐œƒ0์—์„œ ๋นผ ์คŒ.
๋ฏธ๋ถ„ํ•  ๋•Œ ์ด์šฉ.
Gradient descent๋ฅผ ์ค‘์ง€ํ•˜๋Š”
๊ธฐ์ค€์ด ๋˜๋Š” ํ•จ์ˆ˜
๐ฝ ฮ˜ =
1
2
๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–
2
๐‘›
๐‘–=1
=
1
2
ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–
2
๐‘›
๐‘–=1
๐•ฉ๐‘– = 1, ๐‘ฅ๐‘–
๐‘‡
, ฮ˜ = ๐œƒ0, ๐œƒ1
๐‘‡
, ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘›
๐‘‡
, ๐‘‹ =
1
1
โ€ฆ
๐‘ฅ1
๐‘ฅ2
โ€ฆ
1 ๐‘ฅ ๐‘›
, ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž.
Gradient of ๐ฝ(ฮ˜)
๐œ•
๐œ•๐œƒ0
๐ฝ ๐œƒ = (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–)
๐‘›
๐‘–=1
1
๐œ•
๐œ•๐œƒ1
๐ฝ ๐œƒ = (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–)
๐‘›
๐‘–=1
๐‘ฅ๐‘–
๐›ป๐ฝ ฮ˜ =
๐œ•
๐œ•๐œƒ0
๐ฝ ฮ˜ ,
๐œ•
๐œ•๐œƒ1
๐ฝ ฮ˜
๐‘‡
= ฮ˜ ๐‘‡
๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– ๐•ฉ๐‘–
๐‘›
๐‘–=1
๐•ฉ๐‘– = 1, ๐‘ฅ๐‘–
๐‘‡
, ฮ˜ = ๐œƒ0, ๐œƒ1
๐‘‡
, ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘›
๐‘‡
, ๐‘‹ =
1
1
โ€ฆ
๐‘ฅ1
๐‘ฅ2
โ€ฆ
1 ๐‘ฅ ๐‘›
, ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž.
๐œƒ0
๐‘ก+1
= ๐œƒ0
๐‘ก
โˆ’ ๐›ผ (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–)
๐‘›
๐‘–=1
1
๋‹จ, ์ด ๋•Œ์˜ ฮ˜์ž๋ฆฌ์—๋Š”
๐‘ก๋ฒˆ์งธ์— ์–ป์–ด์ง„ ฮ˜๊ฐ’์„ ๋Œ€์ž…ํ•ด์•ผ ํ•œ๋‹ค.
๐œƒ1
๐‘ก+1
= ๐œƒ1
๐‘ก
โˆ’ ๐›ผ ฮ˜ ๐‘‡
๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘ฅ๐‘–
๐‘›
๐‘–=1
Steepest Descent
Steepest Descent
์žฅ์  : easy to implement, conceptually clean, guaranteed convergence
๋‹จ์  : often slow converging
ฮ˜ ๐‘ก+1 = ฮ˜ ๐‘ก โˆ’ ๐›ผ {(ฮ˜ ๐‘ก) ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–}๐•ฉ๐‘–
๐‘›
๐‘–=1
Normal Equations
์žฅ์  : a single-shot algorithm! Easiest to implement.
๋‹จ์  : need to compute pseudo-inverse ๐‘‹ ๐‘‡
๐‘‹ โˆ’1
, expensive, numerical issues
(e.g., matrix is singular..), although there are ways to get around this ...
๐•– = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐•ชห†
Multivariate Linear Regression
๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’
๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์€, input ๋ณ€์ˆ˜๊ฐ€ 1.
๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์€, input ๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ ์ด์ƒ.
Google์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ
Yahoo์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ
Microsoft์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ
๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ
๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ
๐Ÿ’ + ๐
์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜์™€ ๊ฐ™์€ ์‹์„ ์„ ํ˜•์œผ๋กœ ์ƒ๊ฐํ•˜์—ฌ ํ’€ ์ˆ˜ ์žˆ๋Š”๊ฐ€?
๋ฌผ๋ก , input ๋ณ€์ˆ˜๊ฐ€ polynomial(๋‹คํ•ญ์‹)์˜ ํ˜•ํƒœ์ด์ง€๋งŒ, coefficients ๐œƒ๐‘–๊ฐ€
์„ ํ˜•(linear)์ด๋ฏ€๋กœ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์˜ ํ•ด๋ฒ•์œผ๋กœ ํ’€ ์ˆ˜ ์žˆ๋‹ค.
๐šฏ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐•ชห†
๐œƒ0, ๐œƒ1, โ€ฆ , ๐œƒ ๐‘›
๐‘‡
General Linear Regression
๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’์ค‘ ํšŒ๊ท€ ๋ถ„์„
์ผ๋ฐ˜ ํšŒ๊ท€ ๋ถ„์„ ๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’ˆ ๐Ÿ(๐’™ ๐Ÿ) + ๐œฝ ๐Ÿ ๐’ˆ ๐Ÿ(๐’™ ๐Ÿ) + โ‹ฏ + ๐œฝ ๐’ ๐’ˆ ๐’(๐’™ ๐’)
๐‘”๐‘—๋Š” ๐‘ฅ ๐‘—
๋˜๋Š”
(๐‘ฅโˆ’๐œ‡ ๐‘—)
2๐œŽ ๐‘—
๋˜๋Š”
1
1+exp(โˆ’๐‘  ๐‘— ๐‘ฅ)
๋“ฑ์˜ ํ•จ์ˆ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.
์ด๊ฒƒ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์„ ํ˜• ํšŒ๊ท€ ํ’€์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋‹ค.
๐‘ค ๐‘‡
= (๐‘ค0, ๐‘ค1, โ€ฆ , ๐‘ค ๐‘›)
๐œ™ ๐‘ฅ ๐‘– ๐‘‡
= ๐œ™0 ๐‘ฅ ๐‘–
, ๐œ™1 ๐‘ฅ ๐‘–
, โ€ฆ , ๐œ™ ๐‘› ๐‘ฅ ๐‘–
๐‘ค ๐‘‡
= (๐‘ค0, ๐‘ค1, โ€ฆ , ๐‘ค ๐‘›)
๐œ™ ๐‘ฅ ๐‘– ๐‘‡
= ๐œ™0 ๐‘ฅ ๐‘–
, ๐œ™1 ๐‘ฅ ๐‘–
, โ€ฆ , ๐œ™ ๐‘› ๐‘ฅ ๐‘–
normal equation
[ ์ž๋ฃŒ์˜ ๋ถ„์„ ]
โ‘  ๋ชฉ์  : ์ง‘์„ ํŒ”๊ธฐ ์›ํ•จ. ์•Œ๋งž์€ ๊ฐ€๊ฒฉ์„ ์ฐพ๊ธฐ ์›ํ•จ.
โ‘ก ๊ณ ๋ คํ•  ๋ณ€์ˆ˜(feature) : ์ง‘์˜ ํฌ๊ธฐ(in square feet), ์นจ์‹ค์˜ ๊ฐœ์ˆ˜, ์ง‘ ๊ฐ€๊ฒฉ
(์ถœ์ฒ˜ : http://aimotion.blogspot.kr/2011/10/machine-learning-with-python-linear.html)
โ‘ข ์ฃผ์˜์‚ฌํ•ญ : ์ง‘์˜ ํฌ๊ธฐ์™€ ์นจ์‹ค์˜ ๊ฐœ์ˆ˜์˜ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ง‘์˜ ํฌ๊ธฐ๊ฐ€ 4000 square feet์ธ๋ฐ,
์นจ์‹ค์˜ ๊ฐœ์ˆ˜๋Š” 3๊ฐœ์ด๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ ์ƒ feature๋“ค ๊ฐ„ ๊ทœ๋ชจ์˜ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค. ์ด๋Ÿด ๊ฒฝ์šฐ,
feature์˜ ๊ฐ’์„ ์ •๊ทœํ™”(normalizing)๋ฅผ ํ•ด์ค€๋‹ค. ๊ทธ๋ž˜์•ผ, Gradient Descent๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ,
๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋‹ค.
โ‘ฃ ์ •๊ทœํ™”์˜ ๋ฐฉ๋ฒ•
- feature์˜ mean(ํ‰๊ท )์„ ๊ตฌํ•œ ํ›„, feature๋‚ด์˜ ๋ชจ๋“  data์˜ ๊ฐ’์—์„œ mean์„ ๋นผ์ค€๋‹ค.
- data์—์„œ mean์„ ๋นผ ์ค€ ๊ฐ’์„, ๊ทธ data๊ฐ€ ์†ํ•˜๋Š” standard deviation(ํ‘œ์ค€ ํŽธ์ฐจ)๋กœ ๋‚˜๋ˆ„์–ด ์ค€๋‹ค. (scaling)
์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ๊ณ ๋“ฑํ•™๊ต ๋•Œ ๋ฐฐ์› ๋˜ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” ๊ฒƒ์„ ๋– ์˜ฌ๋ ค๋ณด์ž.
ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ  ์ค‘ ํ•˜๋‚˜๋Š”, ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๋ถ„ํฌ, ์ฆ‰ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฑฐ๋‚˜ ์–ด๋ ค์šด ๋‘ ๋ถ„ํฌ๋ฅผ ์‰ฝ๊ฒŒ
๋น„๊ตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค.
๐‘ =
๐‘‹ โˆ’ ๐œ‡
๐œŽ
If ๐‘‹~(๐œ‡, ๐œŽ) then ๐‘~๐‘(1,0)
1. http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture5-LiR.pdf
2. http://www.cs.cmu.edu/~10701/lecture/RegNew.pdf
3. ํšŒ๊ท€๋ถ„์„ ์ œ 3ํŒ (๋ฐ•์„ฑํ˜„ ์ €)
4. ํŒจํ„ด์ธ์‹ (์˜ค์ผ์„ ์ง€์Œ)
5. ์ˆ˜๋ฆฌํ†ต๊ณ„ํ•™ ์ œ 3ํŒ (์ „๋ช…์‹ ์ง€์Œ)
Laplacian Smoothing
multinomial random variable ๐‘ง : ๐‘ง๋Š” 1๋ถ€ํ„ฐ ๐‘˜๊นŒ์ง€์˜ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
์šฐ๋ฆฌ๋Š” test set์œผ๋กœ ๐‘š๊ฐœ์˜ ๋…๋ฆฝ์ธ ๊ด€์ฐฐ ๊ฐ’ ๐‘ง 1
, โ€ฆ , ๐‘ง ๐‘š
์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
์šฐ๋ฆฌ๋Š” ๊ด€์ฐฐ ๊ฐ’์„ ํ†ตํ•ด, ๐’‘(๐’› = ๐’Š) ๋ฅผ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค. (๐‘– = 1, โ€ฆ , ๐‘˜)
์ถ”์ • ๊ฐ’(MLE)์€,
๐‘ ๐‘ง = ๐‘— =
๐ผ{๐‘ง ๐‘–
= ๐‘—}๐‘š
๐‘–=1
๐‘š
์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๐ผ . ๋Š” ์ง€์‹œ ํ•จ์ˆ˜ ์ด๋‹ค. ๊ด€์ฐฐ ๊ฐ’ ๋‚ด์—์„œ์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”์ •ํ•œ๋‹ค.
ํ•œ ๊ฐ€์ง€ ์ฃผ์˜ ํ•  ๊ฒƒ์€, ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•˜๋ ค๋Š” ๊ฐ’์€ ๋ชจ์ง‘๋‹จ(population)์—์„œ์˜ ๋ชจ์ˆ˜
๐‘(๐‘ง = ๐‘–)๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ test set(or ํ‘œ๋ณธ ์ง‘๋‹จ)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ๋ฟ์ด๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ๐‘ง(๐‘–)
โ‰  3 for all ๐‘– = 1, โ€ฆ , ๐‘š ์ด๋ผ๋ฉด, ๐‘ ๐‘ง = 3 = 0 ์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.
์ด๊ฒƒ์€, ํ†ต๊ณ„์ ์œผ๋กœ ๋ณผ ๋•Œ, ์ข‹์ง€ ์•Š์€ ์ƒ๊ฐ์ด๋‹ค. ๋‹จ์ง€, ํ‘œ๋ณธ ์ง‘๋‹จ์—์„œ ๋ณด์ด์ง€
์•Š๋Š” ๋‹ค๋Š” ์ด์œ ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชจ์ง‘๋‹จ์˜ ๋ชจ์ˆ˜ ๊ฐ’์„ 0์œผ๋กœ ํ•œ๋‹ค๋Š” ๊ฒƒ์€
ํ†ต๊ณ„์ ์œผ๋กœ ์ข‹์ง€ ์•Š์€ ์ƒ๊ฐ(bad idea)์ด๋‹ค. (MLE์˜ ์•ฝ์ )
์ด๊ฒƒ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”,
โ‘  ๋ถ„์ž๊ฐ€ 0์ด ๋˜์–ด์„œ๋Š” ์•ˆ ๋œ๋‹ค.
โ‘ก ์ถ”์ • ๊ฐ’์˜ ํ•ฉ์ด 1์ด ๋˜์–ด์•ผ ํ•œ๋‹ค. ๐‘ ๐‘ง = ๐‘—๐‘ง =1 (โˆต ํ™•๋ฅ ์˜ ํ•ฉ์€ 1์ด ๋˜์–ด์•ผ ํ•จ)
๋”ฐ๋ผ์„œ,
๐’‘ ๐’› = ๐’‹ =
๐‘ฐ ๐’› ๐’Š
= ๐’‹ + ๐Ÿ๐’Ž
๐’Š=๐Ÿ
๐’Ž + ๐’Œ
์ด๋ผ๊ณ  ํ•˜์ž.
โ‘ ์˜ ์„ฑ๋ฆฝ : test set ๋‚ด์— ๐‘—์˜ ๊ฐ’์ด ์—†์–ด๋„, ํ•ด๋‹น ์ถ”์ • ๊ฐ’์€ 0์ด ๋˜์ง€ ์•Š๋Š”๋‹ค.
โ‘ก์˜ ์„ฑ๋ฆฝ : ๐‘ง(๐‘–)
= ๐‘—์ธ data์˜ ์ˆ˜๋ฅผ ๐‘›๐‘—๋ผ๊ณ  ํ•˜์ž. ๐‘ ๐‘ง = 1 =
๐‘›1+1
๐‘š+๐‘˜
, โ€ฆ , ๐‘ ๐‘ง = ๐‘˜ =
๐‘› ๐‘˜+1
๐‘š+๐‘˜
์ด๋‹ค. ๊ฐ ์ถ”์ • ๊ฐ’์„ ๋‹ค ๋”ํ•˜๊ฒŒ ๋˜๋ฉด 1์ด ๋‚˜์˜จ๋‹ค.
์ด๊ฒƒ์ด ๋ฐ”๋กœ Laplacian smoothing์ด๋‹ค.
๐‘ง๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ด 1๋ถ€ํ„ฐ ๐‘˜๊นŒ์ง€ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์ •์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๊ณ 
์ง๊ด€์ ์œผ๋กœ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
1

03. linear regression

  • 1.
  • 2.
    ์ง€๋‚œ ์‹œ๊ฐ„.....Naive BayesClassifier arg max ๐‘ฆ ๐‘ƒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘‘ ๐‘ฆ ๐‘ƒ(๐‘ฆ) = arg max ๐‘ฆ ๐‘ƒ ๐‘ฅ๐‘– ๐‘ฆ ๐‘ƒ(๐‘ฆ) ๐‘‘ ๐‘–=1 class ๐‘ฆ ์˜ ๋ฐœ์ƒ ํ™•๋ฅ ๊ณผ test set์—์„œ class ๐‘ฆ์˜ label์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ๋ฒกํ„ฐ์˜ ์›์†Œ ๐‘ฅ๐‘– (๋ฌธ์„œ์˜ ์˜ˆ์—์„œ๋Š” ๋‹จ์–ด) ๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์˜ ๊ณฑ ex) (I, love, you)๊ฐ€ spam์ธ์ง€ ์•„๋‹Œ์ง€ ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋Š”, test set์—์„œ spam์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ๊ณผ spam์œผ๋กœ labeling ๋œ ๋ฌธ์„œ์—์„œ I์™€ love์™€ you๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ํ™•๋ฅ ์„ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒƒ๊ณผ, test set์—์„œ ham์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ๊ณผ ham์œผ๋กœ labeling ๋œ ๋ฌธ์„œ์—์„œ I์™€ love์™€ you๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ํ™•๋ฅ ์„ ๋ชจ๋‘ ๊ณฑํ•œ ๊ฒƒ์„, ๋น„๊ตํ•œ๋‹ค.
  • 3.
    ์ง€๋‚œ ์‹œ๊ฐ„ ๋ฏธ๋น„ํ–ˆ๋˜์  ๋“ค... 1. Laplacian Smoothing (appendix ์ฐธ๊ณ ) 2. MLE / MAP 1
  • 4.
    Bayesโ€™ Rule ๐‘ ๐œƒ๐•ฉ = ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ) ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ) posteriori (์‚ฌํ›„ ํ™•๋ฅ ) likelihood (์šฐ๋„ ๊ฐ’) prior (์‚ฌ์ „ ํ™•๋ฅ ) ์‚ฌํ›„ ํ™•๋ฅ  : ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๊ด€์ฐฐ ๋œ ํ›„์— ๋ชจ์ˆ˜(parameter)์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค. ์‚ฌ์ „ ํ™•๋ฅ  : ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๊ด€์ฐฐ ๋˜๊ธฐ ์ „์— ๋ชจ์ˆ˜์˜ ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค. ์šฐ๋„ ๊ฐ’ : ๋ชจ์ˆ˜์˜ ๊ฐ’์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ 
  • 5.
    Maximum Likelihood Estimate ๐•ฉ= (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›) ๐“› ๐œฝ = ๐’‘ ๐•ฉ ๐œฝ ์šฐ๋„(likelihood)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ๋œ๋‹ค. ๋ณ€์ˆ˜(parameter) ๐œƒ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, data set ๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›) (๊ด€์ฐฐ ๋œ, observed) ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š”(obtaining) ํ™•๋ฅ  ๐‘(๐•ฉ|๐œƒ) ๐‘‹ ๐œƒ์˜ ํ•จ์ˆ˜. ๐œƒ์˜ pdf๋Š” ์•„๋‹˜. ๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›)
  • 6.
    Maximum Likelihood Estimate๋Š”๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ๋œ๋‹ค. ๊ด€์ฐฐ ๋œ data set ๐•ฉ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ™•๋ฅ ์ด ๊ฐ€์žฅ ํฐ ๐œƒ๊ฐ€ MLE์ด๋‹ค. ๐‘(๐•ฉ|๐œƒ1) ๐‘‹ ๐•ฉ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘›) ๐œฝ = ๐š๐ซ๐  ๐ฆ๐š๐ฑ ๐œฝ ๐“› ๐œฝ = ๐š๐ซ๐  ๐ฆ๐š๐ฑ ๐œฝ ๐’‘(๐•ฉ|๐œฝ)ฬ‚ ๐‘(๐•ฉ|๐œƒ2) ๐‘(๐•ฉ|๐œƒ3) ๐‘(๐•ฉ|๐œƒ) ๐œƒ = ๐œƒ2 ฬ‚
  • 7.
    ์šฐ๋ฆฌ๊ฐ€ likelihood function๐‘(๐•ฉ|๐œƒ)์™€ prior ๐‘(๐œƒ)๋ฅผ ์•Œ ๋•Œ, Bayes rule์— ์˜ํ•˜์—ฌ posteriori function์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๐’‘ ๐œฝ ๐•ฉ โˆ ๐’‘ ๐•ฉ ๐œฝ ๐’‘(๐œฝ) Maximum A Posteriori Estimate ๐‘ ๐œƒ ๐•ฉ = ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ) ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ) posteriori (์‚ฌํ›„ ํ™•๋ฅ ) likelihood (์šฐ๋„ ๊ฐ’) prior (์‚ฌ์ „ ํ™•๋ฅ )
  • 8.
    Likelihood ๐‘(๐•ฉ|๐œƒ) Prior ๐‘(๐œƒ) Posterior ๐‘๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
  • 9.
    Likelihood ๐‘(๐•ฉ|๐œƒ) Prior ๐‘(๐œƒ) Posterior ๐‘๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
  • 10.
    ๐œฝ = ๐š๐ซ๐ ๐ฆ๐š๐ฑ ๐œฝ ๐’‘(๐œฝ|๐•ฉ) Likelihood ๐‘(๐•ฉ|๐œƒ) Prior ๐‘(๐œƒ) Posterior ๐‘ ๐œƒ ๐•ฉ โˆ ๐‘ ๐•ฉ ๐œƒ ๐‘(๐œƒ)
  • 11.
  • 12.
    ๋‚˜๋Š” ํฐ ์‹ ๋ฐœํšŒ์‚ฌ์˜CEO์ด๋‹ค. ๋งŽ์€ ์ง€์ ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฒˆ์— ์ƒˆ๋กœ์šด ์ง€์ ์„ ๋‚ด๊ณ  ์‹ถ๋‹ค. ์–ด๋А ์ง€์—ญ์— ๋‚ด์•ผ ๋ ๊นŒ? ๋‚ด๊ฐ€ ์ƒˆ๋กœ์šด ์ง€์ ์„ ๋‚ด๊ณ  ์‹ถ์–ดํ•˜๋Š” ์ง€์—ญ๋“ค์˜ ์˜ˆ์ƒ ์ˆ˜์ต๋งŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ธ๋ฐ! ๋‚ด๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ž๋ฃŒ(data)๋Š” ๊ฐ ์ง€์ ์˜ ์ˆ˜์ต(profits)๊ณผ ๊ฐ ์ง€์ ์ด ์žˆ๋Š” ์ง€์—ญ์˜ ์ธ๊ตฌ์ˆ˜(populations)์ด๋‹ค. ํ•ด๊ฒฐ์ฑ…! Linear Regression! ์ด๊ฒƒ์„ ํ†ตํ•˜์—ฌ, ์ƒˆ๋กœ์šด ์ง€์—ญ์˜ ์ธ๊ตฌ์ˆ˜๋ฅผ ์•Œ๊ฒŒ ๋  ๊ฒฝ์šฐ, ๊ทธ ์ง€์—ญ์˜ ์˜ˆ์ƒ ์ˆ˜์ต์„ ๊ตฌ ํ•  ์ˆ˜ ์žˆ๋‹ค. Example 1)
  • 13.
    Example 2) ๋‚˜๋Š” ์ง€๊ธˆPittsburgh๋กœ ์ด์‚ฌ๋ฅผ ์™”๋‹ค ๋‚˜๋Š” ๊ฐ€์žฅ ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ์˜ ์•„ํŒŒํŠธ๋ฅผ ์–ป๊ธฐ ์›ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ์˜ ์กฐ๊ฑด๋“ค์€ ๋‚ด๊ฐ€ ์ง‘์„ ์‚ฌ๊ธฐ ์œ„ํ•ด ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ๋“ค์ด๋‹ค. square-ft(ํ‰๋ฐฉ๋ฏธํ„ฐ), ์นจ์‹ค์˜ ์ˆ˜, ํ•™๊ต ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ... ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ํฌ๊ธฐ์™€ ์นจ์‹ค์˜ ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ง‘์˜ ๊ฐ€๊ฒฉ์€ ๊ณผ์—ฐ ์–ผ๋งˆ์ผ๊นŒ?
  • 14.
    โ‘  Given aninput ๐‘ฅ we would like to compute an output ๐‘ฆ. (๋‚ด๊ฐ€ ์›ํ•˜๋Š” ์ง‘์˜ ํฌ๊ธฐ์™€, ๋ฐฉ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ž…๋ ฅํ–ˆ์„ ๋•Œ, ์ง‘ ๊ฐ€๊ฒฉ์˜ ์˜ˆ์ธก ๊ฐ’์„ ๊ณ„์‚ฐ) โ‘ก For example 1) Predict height from age (height = ๐‘ฆ, age = ๐‘ฅ) 2) Predict Google`s price from Yahoo`s price (Google's price = ๐‘ฆ, Yahoo's price = ๐‘ฅ) ๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ์ฆ‰, ๊ธฐ์กด์˜ data๋“ค์—์„œ ์ง์„ (๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ)์„ ์ฐพ์•„๋‚ด๋ฉด, ์ƒˆ๋กœ์šด ๊ฐ’ ๐‘ฅ ๐‘›๐‘’๐‘ค๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํ•ด๋‹นํ•˜๋Š” ๐‘ฆ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒ ๊ตฌ๋‚˜! learning, training prediction
  • 15.
    Input : ์ง‘์˜ํฌ๊ธฐ(๐‘ฅ1), ๋ฐฉ์˜ ๊ฐœ์ˆ˜(๐‘ฅ2), ํ•™๊ต๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ(๐‘ฅ3),..... (๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ ๐‘›) : ํŠน์„ฑ ๋ฒกํ„ฐ feature vector Output : ์ง‘ ๊ฐ’(๐‘ฆ) ๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’ training set์„ ํ†ตํ•˜์—ฌ ํ•™์Šต(learning)
  • 16.
  • 17.
    ๐‘ฆ๐‘– = ๐œƒ0+ ๐œƒ1 ๐‘ฅ๐‘– + ๐œ€๐‘– ๐‘–๋ฒˆ์งธ ๊ด€์ฐฐ์  ๐‘ฆ๐‘–, ๐‘ฅ๐‘– ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹จ์ˆœ ํšŒ๊ท€ ๋ชจํ˜•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ๐œ–3 ๐œ–๐‘– : ๐‘–๋ฒˆ์งธ ๊ด€์ฐฐ์ ์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ•˜๊ณ ์ž ํ•˜๋Š” ํšŒ๊ท€์ง์„ ๊ณผ ์‹ค์ œ ๊ด€์ฐฐ๋œ ๐‘ฆ๐‘–์˜ ์ฐจ์ด (error) ์šฐ๋ฆฌ๋Š” ์˜ค๋ฅ˜์˜ ํ•ฉ์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ์ง์„ ์„ ์ฐพ๊ณ  ์‹ถ๋‹ค. ์ฆ‰ ๊ทธ๋ ‡๊ฒŒ ๋งŒ๋“œ๋Š” ๐œฝ ๐ŸŽ์™€ ๐œฝ ๐Ÿ์„ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค ! How!! ์ตœ์†Œ ์ œ๊ณฑ ๋ฒ•! (Least Squares Method) min ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– 2 ๐‘– = ๐‘š๐‘–๐‘› ๐œ–๐‘– 2 ๐‘– ๐‘ฆ = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ์‹ค์ œ ๊ด€์ธก ๊ฐ’ ํšŒ๊ท€ ์ง์„ ์˜ ๊ฐ’(์ด์ƒ์ ์ธ ๊ฐ’) ์ข…์† ๋ณ€์ˆ˜ ์„ค๋ช… ๋ณ€์ˆ˜, ๋…๋ฆฝ ๋ณ€์ˆ˜
  • 18.
    min ๐‘ฆ๐‘– โˆ’๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– 2 ๐‘– = min ๐œ–๐‘– 2 ๐‘– ์‹ค์ œ ๊ด€์ธก ๊ฐ’ ํšŒ๊ท€ ์ง์„ ์˜ ๊ฐ’(์ด์ƒ์ ์ธ ๊ฐ’) ์œ„์˜ ์‹์„ ์ตœ๋Œ€ํ•œ ๋งŒ์กฑ ์‹œํ‚ค๋Š” ๐œƒ0, ๐œƒ1์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ? (์ด๋Ÿฌํ•œ ๐œƒ1, ๐œƒ2๋ฅผ ๐œƒ1, ๐œƒ2 ๋ผ๊ณ  ํ•˜์ž.) - Normal Equation - Steepest Gradient Descent ห† ห†
  • 19.
    What is normalequation? ๊ทน๋Œ€ ๊ฐ’, ๊ทน์†Œ ๊ฐ’์„ ๊ตฌํ•  ๋•Œ, ์ฃผ์–ด์ง„ ์‹์„ ๋ฏธ๋ถ„ํ•œ ํ›„์—, ๋ฏธ๋ถ„ํ•œ ์‹์„ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฐ’์„ ์ฐพ๋Š”๋‹ค. min ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– 2 ๐‘– ๋จผ์ €, ๐œƒ0์— ๋Œ€ํ•˜์—ฌ ๋ฏธ๋ถ„ํ•˜์ž. โˆ’ ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– = 0 ๐‘– ๐œ• ๐œ•๐œƒ0 ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– 2 ๐‘– = ๋‹ค์Œ์œผ๋กœ, ๐œƒ1์— ๋Œ€ํ•˜์—ฌ ๋ฏธ๋ถ„ํ•˜์ž. โˆ’ ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– ๐‘ฅ๐‘– = 0 ๐‘– ๐œ• ๐œ•๐œƒ1 ๐‘ฆ๐‘– โˆ’ ๐œƒ0 + ๐œƒ1 ๐‘ฅ๐‘– 2 ๐‘– = ์œ„ ์˜ ๋‘ ์‹์„ 0์œผ๋กœ ๋งŒ์กฑ์‹œํ‚ค๋Š” ๐œƒ0, ๐œƒ1๋ฅผ ์ฐพ์œผ๋ฉด ๋œ๋‹ค. ์ด์ฒ˜๋Ÿผ 2๊ฐœ์˜ ๋ฏธ์ง€์ˆ˜์— ๋Œ€ํ•˜์—ฌ, 2๊ฐœ์˜ ๋ฐฉ์ •์‹(system)์ด ์žˆ์„ ๋•Œ, ์šฐ๋ฆฌ๋Š” ์ด system์„ normal equation(์ •๊ทœ๋ฐฉ์ •์‹)์ด๋ผ ๋ถ€๋ฅธ๋‹ค.
  • 20.
    The normal equationform ๐•ฉ๐‘– = 1, ๐‘ฅ๐‘– ๐‘‡ , ฮ˜ = ๐œƒ0, ๐œƒ1 ๐‘‡ , ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ ๐‘› ๐‘‡ , ๐‘‹ = 1 1 โ€ฆ ๐‘ฅ1 ๐‘ฅ2 โ€ฆ 1 ๐‘ฅ ๐‘› , ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž. ๐•ช = ๐‘‹ฮ˜ + ๐•– ๐‘ฆ1 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ1 + ๐œ–1 ๐‘ฆ2 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ2 + ๐œ–2 ....... ๐‘ฆ ๐‘›โˆ’1 = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ๐‘›โˆ’1 + ๐œ– ๐‘›โˆ’1 ๐‘ฆ ๐‘› = ๐œƒ0 + ๐œƒ1 ๐‘ฅ ๐‘› + ๐œ– ๐‘› ๐‘›๊ฐœ์˜ ๊ด€์ธก ๊ฐ’ (๐‘ฅ๐‘–, ๐‘ฆ๐‘–)์€ ์•„๋ž˜์™€ ๊ฐ™์€ ํšŒ๊ท€ ๋ชจํ˜•์„ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3 โ€ฆ ๐‘ฆ ๐‘› = 1 1 1 โ€ฆ ๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 โ€ฆ 1 ๐‘ฅ ๐‘› ๐œƒ0 ๐œƒ1 + ๐œ–1 ๐œ–2 ๐œ–3 โ€ฆ ๐œ– ๐‘›
  • 21.
    ๐œ–๐‘— 2 ๐‘› ๐‘—=1 = ๐•– ๐‘‡ ๐•–= ๐•ช โˆ’ ๐‘‹ฮ˜ ๐‘‡ (๐•ช โˆ’ ๐‘‹ฮ˜) = ๐•ช ๐‘‡ ๐•ช โˆ’ ฮ˜ ๐‘‡ ๐‘‹ ๐‘‡ ๐•ช โˆ’ ๐•ช ๐‘‡ ๐‘‹ฮ˜ + ฮ˜ ๐‘‡ ๐‘‹ ๐‘‡ ๐‘‹ฮ˜ = ๐•ช ๐‘‡ ๐•ช โˆ’ 2ฮ˜ ๐‘‡ ๐‘‹ ๐‘‡ ๐•ช + ฮ˜ ๐‘‡ ๐‘‹ ๐‘‡ ๐‘‹ฮ˜ 1 by 1 ํ–‰๋ ฌ์ด๋ฏ€๋กœ ์ „์น˜ํ–‰๋ ฌ์˜ ๊ฐ’์ด ๊ฐ™๋‹ค! ๐œ•(๐•– ๐‘‡ ๐•–) ๐œ•ฮ˜ = ๐ŸŽ ๐œ•(๐•– ๐‘‡ ๐•–) ๐œ•ฮ˜ = โˆ’2๐‘‹ ๐‘‡ ๐•ช + 2๐‘‹ ๐‘‡ ๐‘‹ฮ˜ = ๐ŸŽ ๐‘‹ ๐‘‡ ๐‘‹๐šฏ = ๐‘‹ ๐‘‡ ๐•ช ๐šฏ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐•ชห† ์ •๊ทœ๋ฐฉ์ •์‹ ๐•ช = ๐‘‹ฮ˜ + ๐•– ๐•– = ๐•ช โˆ’ ๐‘‹ฮ˜ Minimize ๐œ–๐‘— 2 ๐‘› ๐‘—=1
  • 22.
    What is GradientDescent? machine learning์—์„œ๋Š” ๋งค๊ฐœ ๋ณ€์ˆ˜(parameter, ์„ ํ˜•ํšŒ๊ท€์—์„œ๋Š” ๐œƒ0, ๐œƒ1)๊ฐ€ ์ˆ˜์‹ญ~ ์ˆ˜๋ฐฑ ์ฐจ์›์˜ ๋ฒกํ„ฐ์ธ ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๋‹ค. ๋˜ํ•œ ๋ชฉ์  ํ•จ์ˆ˜(์„ ํ˜•ํšŒ๊ท€์—์„œ๋Š” ฮฃ๐œ–๐‘– 2 )๊ฐ€ ๋ชจ๋“  ๊ตฌ๊ฐ„์—์„œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ณด์žฅ์ด ํ•ญ์ƒ ์žˆ๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•œ ๋ฒˆ์˜ ์ˆ˜์‹ ์ „๊ฐœ๋กœ ํ•ด๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ์ด ์ ์ง€ ์•Š๊ฒŒ ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ์—๋Š” ์ดˆ๊ธฐ ํ•ด์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํ•ด๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฐœ์„ ํ•ด ๋‚˜๊ฐ€๋Š” ์ˆ˜์น˜์  ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. (๋ฏธ๋ถ„์ด ์‚ฌ์šฉ ๋จ)
  • 23.
    What is GradientDescent? ์ดˆ๊ธฐํ•ด ๐›ผ0 ์„ค์ • ๐‘ก = 0 ๐›ผ ๐‘ก๊ฐ€ ๋งŒ์กฑ์Šค๋Ÿฝ๋‚˜? ๐›ผ ๐‘ก+1 = ๐‘ˆ ๐›ผ ๐‘ก ๐‘ก = ๐‘ก + 1 ๐›ผ = ๐›ผ ๐‘ก ห†No Yes
  • 24.
    What is GradientDescent? Gradient Descent ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ธ‰ํ•˜๊ฒŒ ํ•˜๊ฐ•ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ๊ณ , ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์•ฝ๊ฐ„ ์ด๋™ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ์žก๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ๋‚ฎ์€ ์ง€์ (์ฆ‰ ์ตœ์ € ์ )์„ ์ฐพ์•„ ๊ฐ„๋‹ค. Gradient Ascent ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ฒฝ์‚ฌ๊ฐ€ ๊ฐ€์žฅ ๊ธ‰ํ•˜๊ฒŒ ์ƒ์Šนํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ๊ณ , ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ์•ฝ๊ฐ„ ์ด๋™ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์œ„์น˜๋ฅผ ์žก๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ๋†’์€ ์ง€์ (์ฆ‰ ์ตœ๋Œ€ ์ )์„ ์ฐพ์•„ ๊ฐ„๋‹ค.
  • 25.
    What is GradientDescent? Gradient Descent ๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก โˆ’ ๐œŒ ๐œ•๐ฝ ๐œ•๐›ผ ๐›ผ ๐‘ก ๐ฝ = ๋ชฉ์ ํ•จ์ˆ˜ ๐œ•๐ฝ ๐œ•๐›ผ ๐›ผ ๐‘ก : ๐›ผ ๐‘ก์—์„œ์˜ ๋„ํ•จ์ˆ˜ ๐œ•๐ฝ ๐œ•๐›ผ ์˜ ๊ฐ’ ๐›ผ ๐‘ก ๐›ผ ๐‘ก+1 โˆ’ ๐๐‘ฑ ๐๐œถ ๐œถ ๐’• ๐๐‘ฑ ๐๐œถ ๐œถ ๐’• ๐›ผ ๐‘ก์—์„œ์˜ ๋ฏธ๋ถ„๊ฐ’์€ ์Œ์ˆ˜์ด๋‹ค. ๊ทธ๋ž˜์„œ ๐œ•J ๐œ•ฮฑ ฮฑt ๋ฅผ ๋”ํ•˜๊ฒŒ ๋˜๋ฉด ์™ผ์ชฝ์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๋ชฉ์ ํ•จ์ˆ˜์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๐œ•J ๐œ•ฮฑ ฮฑt ๋ฅผ ๋นผ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ ๋‹นํ•œ ๐œŒ๋ฅผ ๊ณฑํ•ด์ฃผ์–ด์„œ ์กฐ๊ธˆ๋งŒ ์ด๋™ํ•˜๊ฒŒ ํ•œ๋‹ค. โˆ’๐† ๐๐‘ฑ ๐๐œถ ๐œถ ๐’•
  • 26.
    What is GradientDescent? Gradient Descent ๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก โˆ’ ๐œŒ ๐œ•๐ฝ ๐œ•๐›ผ ๐›ผ ๐‘ก Gradient Ascent ๐›ผ ๐‘ก+1 = ๐›ผ ๐‘ก + ๐œŒ ๐œ•๐ฝ ๐œ•๐›ผ ๐›ผ ๐‘ก ๐ฝ = ๋ชฉ์ ํ•จ์ˆ˜ ๐œ•๐ฝ ๐œ•๐›ผ ๐›ผ ๐‘ก : ๐›ผ ๐‘ก์—์„œ์˜ ๋„ํ•จ์ˆ˜ ๐œ•๐ฝ ๐œ•๐›ผ ์˜ ๊ฐ’ Gradient Descent, Gradient Ascent๋Š” ์ „ํ˜•์ ์ธ Greedy algorithm์ด๋‹ค. ๊ณผ๊ฑฐ ๋˜๋Š” ๋ฏธ๋ž˜๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ํ˜„์žฌ ์ƒํ™ฉ์—์„œ ๊ฐ€์žฅ ์œ ๋ฆฌํ•œ ๋‹ค์Œ ์œ„์น˜๋ฅผ ์ฐพ์•„ Local optimal point๋กœ ๋๋‚  ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
  • 27.
    ๐ฝ ฮ˜ = 1 2 ๐œƒ0+ ๐œƒ1 ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– 2 ๐‘› ๐‘–=1 = 1 2 ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– 2 ๐‘› ๐‘–=1 ๐•ฉ๐‘– = 1, ๐‘ฅ๐‘– ๐‘‡ , ฮ˜ = ๐œƒ0, ๐œƒ1 ๐‘‡ , ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘› ๐‘‡ , ๐‘‹ = 1 1 โ€ฆ ๐‘ฅ1 ๐‘ฅ2 โ€ฆ 1 ๐‘ฅ ๐‘› , ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž. ๐œƒ0 ๐‘ก+1 = ๐œƒ0 ๐‘ก โˆ’ ๐›ผ ๐œ• ๐œ•๐œƒ0 ๐ฝ(ฮ˜)๐‘ก ๐œƒ1 ๐‘ก+1 = ๐œƒ1 ๐‘ก โˆ’ ๐›ผ ๐œ• ๐œ•๐œƒ1 ๐ฝ(ฮ˜)๐‘ก ๐œƒ0์˜ ๐‘ก๋ฒˆ์งธ ๊ฐ’์„, ๐ฝ(ฮ˜)๋ฅผ ๐œƒ0์œผ๋กœ ๋ฏธ๋ถ„ํ•œ ์‹์—๋‹ค๊ฐ€ ๋Œ€์ž…. ๊ทธ ํ›„์—, ์ด ๊ฐ’์„ ๐œƒ0์—์„œ ๋นผ ์คŒ. ๋ฏธ๋ถ„ํ•  ๋•Œ ์ด์šฉ. Gradient descent๋ฅผ ์ค‘์ง€ํ•˜๋Š” ๊ธฐ์ค€์ด ๋˜๋Š” ํ•จ์ˆ˜
  • 28.
    ๐ฝ ฮ˜ = 1 2 ๐œƒ0+ ๐œƒ1 ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘– 2 ๐‘› ๐‘–=1 = 1 2 ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– 2 ๐‘› ๐‘–=1 ๐•ฉ๐‘– = 1, ๐‘ฅ๐‘– ๐‘‡ , ฮ˜ = ๐œƒ0, ๐œƒ1 ๐‘‡ , ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘› ๐‘‡ , ๐‘‹ = 1 1 โ€ฆ ๐‘ฅ1 ๐‘ฅ2 โ€ฆ 1 ๐‘ฅ ๐‘› , ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž. Gradient of ๐ฝ(ฮ˜) ๐œ• ๐œ•๐œƒ0 ๐ฝ ๐œƒ = (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–) ๐‘› ๐‘–=1 1 ๐œ• ๐œ•๐œƒ1 ๐ฝ ๐œƒ = (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–) ๐‘› ๐‘–=1 ๐‘ฅ๐‘– ๐›ป๐ฝ ฮ˜ = ๐œ• ๐œ•๐œƒ0 ๐ฝ ฮ˜ , ๐œ• ๐œ•๐œƒ1 ๐ฝ ฮ˜ ๐‘‡ = ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– ๐•ฉ๐‘– ๐‘› ๐‘–=1
  • 29.
    ๐•ฉ๐‘– = 1,๐‘ฅ๐‘– ๐‘‡ , ฮ˜ = ๐œƒ0, ๐œƒ1 ๐‘‡ , ๐•ช = ๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘› ๐‘‡ , ๐‘‹ = 1 1 โ€ฆ ๐‘ฅ1 ๐‘ฅ2 โ€ฆ 1 ๐‘ฅ ๐‘› , ๐•– = (๐œ–1, โ€ฆ , ๐œ– ๐‘›) ๋ผ๊ณ  ํ•˜์ž. ๐œƒ0 ๐‘ก+1 = ๐œƒ0 ๐‘ก โˆ’ ๐›ผ (ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–) ๐‘› ๐‘–=1 1 ๋‹จ, ์ด ๋•Œ์˜ ฮ˜์ž๋ฆฌ์—๋Š” ๐‘ก๋ฒˆ์งธ์— ์–ป์–ด์ง„ ฮ˜๊ฐ’์„ ๋Œ€์ž…ํ•ด์•ผ ํ•œ๋‹ค. ๐œƒ1 ๐‘ก+1 = ๐œƒ1 ๐‘ก โˆ’ ๐›ผ ฮ˜ ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘– ๐‘ฅ๐‘– ๐‘› ๐‘–=1
  • 30.
  • 31.
    Steepest Descent ์žฅ์  :easy to implement, conceptually clean, guaranteed convergence ๋‹จ์  : often slow converging ฮ˜ ๐‘ก+1 = ฮ˜ ๐‘ก โˆ’ ๐›ผ {(ฮ˜ ๐‘ก) ๐‘‡ ๐•ฉ๐‘– โˆ’ ๐‘ฆ๐‘–}๐•ฉ๐‘– ๐‘› ๐‘–=1 Normal Equations ์žฅ์  : a single-shot algorithm! Easiest to implement. ๋‹จ์  : need to compute pseudo-inverse ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 , expensive, numerical issues (e.g., matrix is singular..), although there are ways to get around this ... ๐•– = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐•ชห†
  • 32.
  • 33.
    ๐’š = ๐œฝ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’ ๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์€, input ๋ณ€์ˆ˜๊ฐ€ 1. ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์€, input ๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ ์ด์ƒ. Google์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ Yahoo์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ Microsoft์˜ ์ฃผ์‹ ๊ฐ€๊ฒฉ
  • 34.
    ๐’š = ๐œฝ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ ๐Ÿ’ + ๐ ์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜์™€ ๊ฐ™์€ ์‹์„ ์„ ํ˜•์œผ๋กœ ์ƒ๊ฐํ•˜์—ฌ ํ’€ ์ˆ˜ ์žˆ๋Š”๊ฐ€? ๋ฌผ๋ก , input ๋ณ€์ˆ˜๊ฐ€ polynomial(๋‹คํ•ญ์‹)์˜ ํ˜•ํƒœ์ด์ง€๋งŒ, coefficients ๐œƒ๐‘–๊ฐ€ ์„ ํ˜•(linear)์ด๋ฏ€๋กœ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์˜ ํ•ด๋ฒ•์œผ๋กœ ํ’€ ์ˆ˜ ์žˆ๋‹ค. ๐šฏ = ๐‘‹ ๐‘‡ ๐‘‹ โˆ’1 ๐‘‹ ๐‘‡ ๐•ชห† ๐œƒ0, ๐œƒ1, โ€ฆ , ๐œƒ ๐‘› ๐‘‡
  • 35.
  • 36.
    ๐’š = ๐œฝ๐ŸŽ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + ๐œฝ ๐Ÿ ๐’™ ๐Ÿ + โ‹ฏ + ๐œฝ ๐’ ๐’™ ๐’์ค‘ ํšŒ๊ท€ ๋ถ„์„ ์ผ๋ฐ˜ ํšŒ๊ท€ ๋ถ„์„ ๐’š = ๐œฝ ๐ŸŽ + ๐œฝ ๐Ÿ ๐’ˆ ๐Ÿ(๐’™ ๐Ÿ) + ๐œฝ ๐Ÿ ๐’ˆ ๐Ÿ(๐’™ ๐Ÿ) + โ‹ฏ + ๐œฝ ๐’ ๐’ˆ ๐’(๐’™ ๐’) ๐‘”๐‘—๋Š” ๐‘ฅ ๐‘— ๋˜๋Š” (๐‘ฅโˆ’๐œ‡ ๐‘—) 2๐œŽ ๐‘— ๋˜๋Š” 1 1+exp(โˆ’๐‘  ๐‘— ๐‘ฅ) ๋“ฑ์˜ ํ•จ์ˆ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์„ ํ˜• ํšŒ๊ท€ ํ’€์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋‹ค.
  • 37.
    ๐‘ค ๐‘‡ = (๐‘ค0,๐‘ค1, โ€ฆ , ๐‘ค ๐‘›) ๐œ™ ๐‘ฅ ๐‘– ๐‘‡ = ๐œ™0 ๐‘ฅ ๐‘– , ๐œ™1 ๐‘ฅ ๐‘– , โ€ฆ , ๐œ™ ๐‘› ๐‘ฅ ๐‘–
  • 38.
    ๐‘ค ๐‘‡ = (๐‘ค0,๐‘ค1, โ€ฆ , ๐‘ค ๐‘›) ๐œ™ ๐‘ฅ ๐‘– ๐‘‡ = ๐œ™0 ๐‘ฅ ๐‘– , ๐œ™1 ๐‘ฅ ๐‘– , โ€ฆ , ๐œ™ ๐‘› ๐‘ฅ ๐‘– normal equation
  • 39.
    [ ์ž๋ฃŒ์˜ ๋ถ„์„] โ‘  ๋ชฉ์  : ์ง‘์„ ํŒ”๊ธฐ ์›ํ•จ. ์•Œ๋งž์€ ๊ฐ€๊ฒฉ์„ ์ฐพ๊ธฐ ์›ํ•จ. โ‘ก ๊ณ ๋ คํ•  ๋ณ€์ˆ˜(feature) : ์ง‘์˜ ํฌ๊ธฐ(in square feet), ์นจ์‹ค์˜ ๊ฐœ์ˆ˜, ์ง‘ ๊ฐ€๊ฒฉ
  • 40.
    (์ถœ์ฒ˜ : http://aimotion.blogspot.kr/2011/10/machine-learning-with-python-linear.html) โ‘ข์ฃผ์˜์‚ฌํ•ญ : ์ง‘์˜ ํฌ๊ธฐ์™€ ์นจ์‹ค์˜ ๊ฐœ์ˆ˜์˜ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ง‘์˜ ํฌ๊ธฐ๊ฐ€ 4000 square feet์ธ๋ฐ, ์นจ์‹ค์˜ ๊ฐœ์ˆ˜๋Š” 3๊ฐœ์ด๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ ์ƒ feature๋“ค ๊ฐ„ ๊ทœ๋ชจ์˜ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค. ์ด๋Ÿด ๊ฒฝ์šฐ, feature์˜ ๊ฐ’์„ ์ •๊ทœํ™”(normalizing)๋ฅผ ํ•ด์ค€๋‹ค. ๊ทธ๋ž˜์•ผ, Gradient Descent๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋‹ค. โ‘ฃ ์ •๊ทœํ™”์˜ ๋ฐฉ๋ฒ• - feature์˜ mean(ํ‰๊ท )์„ ๊ตฌํ•œ ํ›„, feature๋‚ด์˜ ๋ชจ๋“  data์˜ ๊ฐ’์—์„œ mean์„ ๋นผ์ค€๋‹ค. - data์—์„œ mean์„ ๋นผ ์ค€ ๊ฐ’์„, ๊ทธ data๊ฐ€ ์†ํ•˜๋Š” standard deviation(ํ‘œ์ค€ ํŽธ์ฐจ)๋กœ ๋‚˜๋ˆ„์–ด ์ค€๋‹ค. (scaling) ์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ๊ณ ๋“ฑํ•™๊ต ๋•Œ ๋ฐฐ์› ๋˜ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” ๊ฒƒ์„ ๋– ์˜ฌ๋ ค๋ณด์ž. ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ  ์ค‘ ํ•˜๋‚˜๋Š”, ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๋ถ„ํฌ, ์ฆ‰ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฑฐ๋‚˜ ์–ด๋ ค์šด ๋‘ ๋ถ„ํฌ๋ฅผ ์‰ฝ๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ๐‘ = ๐‘‹ โˆ’ ๐œ‡ ๐œŽ If ๐‘‹~(๐œ‡, ๐œŽ) then ๐‘~๐‘(1,0)
  • 41.
    1. http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture5-LiR.pdf 2. http://www.cs.cmu.edu/~10701/lecture/RegNew.pdf 3.ํšŒ๊ท€๋ถ„์„ ์ œ 3ํŒ (๋ฐ•์„ฑํ˜„ ์ €) 4. ํŒจํ„ด์ธ์‹ (์˜ค์ผ์„ ์ง€์Œ) 5. ์ˆ˜๋ฆฌํ†ต๊ณ„ํ•™ ์ œ 3ํŒ (์ „๋ช…์‹ ์ง€์Œ)
  • 42.
    Laplacian Smoothing multinomial randomvariable ๐‘ง : ๐‘ง๋Š” 1๋ถ€ํ„ฐ ๐‘˜๊นŒ์ง€์˜ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” test set์œผ๋กœ ๐‘š๊ฐœ์˜ ๋…๋ฆฝ์ธ ๊ด€์ฐฐ ๊ฐ’ ๐‘ง 1 , โ€ฆ , ๐‘ง ๐‘š ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ด€์ฐฐ ๊ฐ’์„ ํ†ตํ•ด, ๐’‘(๐’› = ๐’Š) ๋ฅผ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค. (๐‘– = 1, โ€ฆ , ๐‘˜) ์ถ”์ • ๊ฐ’(MLE)์€, ๐‘ ๐‘ง = ๐‘— = ๐ผ{๐‘ง ๐‘– = ๐‘—}๐‘š ๐‘–=1 ๐‘š ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๐ผ . ๋Š” ์ง€์‹œ ํ•จ์ˆ˜ ์ด๋‹ค. ๊ด€์ฐฐ ๊ฐ’ ๋‚ด์—์„œ์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”์ •ํ•œ๋‹ค. ํ•œ ๊ฐ€์ง€ ์ฃผ์˜ ํ•  ๊ฒƒ์€, ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•˜๋ ค๋Š” ๊ฐ’์€ ๋ชจ์ง‘๋‹จ(population)์—์„œ์˜ ๋ชจ์ˆ˜ ๐‘(๐‘ง = ๐‘–)๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ test set(or ํ‘œ๋ณธ ์ง‘๋‹จ)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ๋ฟ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๐‘ง(๐‘–) โ‰  3 for all ๐‘– = 1, โ€ฆ , ๐‘š ์ด๋ผ๋ฉด, ๐‘ ๐‘ง = 3 = 0 ์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€, ํ†ต๊ณ„์ ์œผ๋กœ ๋ณผ ๋•Œ, ์ข‹์ง€ ์•Š์€ ์ƒ๊ฐ์ด๋‹ค. ๋‹จ์ง€, ํ‘œ๋ณธ ์ง‘๋‹จ์—์„œ ๋ณด์ด์ง€ ์•Š๋Š” ๋‹ค๋Š” ์ด์œ ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์ถ”์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชจ์ง‘๋‹จ์˜ ๋ชจ์ˆ˜ ๊ฐ’์„ 0์œผ๋กœ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ํ†ต๊ณ„์ ์œผ๋กœ ์ข‹์ง€ ์•Š์€ ์ƒ๊ฐ(bad idea)์ด๋‹ค. (MLE์˜ ์•ฝ์ )
  • 43.
    ์ด๊ฒƒ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, โ‘ ๋ถ„์ž๊ฐ€ 0์ด ๋˜์–ด์„œ๋Š” ์•ˆ ๋œ๋‹ค. โ‘ก ์ถ”์ • ๊ฐ’์˜ ํ•ฉ์ด 1์ด ๋˜์–ด์•ผ ํ•œ๋‹ค. ๐‘ ๐‘ง = ๐‘—๐‘ง =1 (โˆต ํ™•๋ฅ ์˜ ํ•ฉ์€ 1์ด ๋˜์–ด์•ผ ํ•จ) ๋”ฐ๋ผ์„œ, ๐’‘ ๐’› = ๐’‹ = ๐‘ฐ ๐’› ๐’Š = ๐’‹ + ๐Ÿ๐’Ž ๐’Š=๐Ÿ ๐’Ž + ๐’Œ ์ด๋ผ๊ณ  ํ•˜์ž. โ‘ ์˜ ์„ฑ๋ฆฝ : test set ๋‚ด์— ๐‘—์˜ ๊ฐ’์ด ์—†์–ด๋„, ํ•ด๋‹น ์ถ”์ • ๊ฐ’์€ 0์ด ๋˜์ง€ ์•Š๋Š”๋‹ค. โ‘ก์˜ ์„ฑ๋ฆฝ : ๐‘ง(๐‘–) = ๐‘—์ธ data์˜ ์ˆ˜๋ฅผ ๐‘›๐‘—๋ผ๊ณ  ํ•˜์ž. ๐‘ ๐‘ง = 1 = ๐‘›1+1 ๐‘š+๐‘˜ , โ€ฆ , ๐‘ ๐‘ง = ๐‘˜ = ๐‘› ๐‘˜+1 ๐‘š+๐‘˜ ์ด๋‹ค. ๊ฐ ์ถ”์ • ๊ฐ’์„ ๋‹ค ๋”ํ•˜๊ฒŒ ๋˜๋ฉด 1์ด ๋‚˜์˜จ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ Laplacian smoothing์ด๋‹ค. ๐‘ง๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ด 1๋ถ€ํ„ฐ ๐‘˜๊นŒ์ง€ ๊ท ๋“ฑํ•˜๊ฒŒ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์ •์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค๊ณ  ์ง๊ด€์ ์œผ๋กœ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 1