Deep Learning by JSKIM (Korean)

1. %ìÝ(Deep Learning)X í¬@ ¬, ø¬à ôtYX © @Ä- 2014D 10Ô 8| ” } ü ‰X” 0ÄYµ(Machine Learning) )•x %ìÝ(Deep Learning)Ð t LDôà ôtY X © ¥1Ð t Ðtøä. 1 ` 0ÄYµ(Machine Learning)@ ôè0 ¤¤ YµXì !¨D X” xõÀ¥X „|tp, %ìÝ(Deep Learning)@ xX à½ÝX Ð¬| t© ì5à½Ý(Deep Neural Network)t`D t© 0ÄYµ)•tä. %ìÝ 0 @ tø l, ˜t¤, DÈt ñ Œ IT0ÅäÐ ”XŒ t© à ˆ” 0 tp ¹ˆ, (4xÝt˜ ¬Ä, L1 ñX xÝ, 0Äˆí ñX ð¸´˜¬(Natural Language Processing)Ð ‹@ 1¥D X” ƒ L$8 ˆ”p, MIT 2013DD [¼ 10 à0 X˜ Xà ¸(Gartner, Inc.) 2014 8Ä IT Ü¥ 10 ü” !0 ¸ X” ñ ü ¥ (p´ tˆ à ˆä[34, 6]. %ìÝt 0tX µÄYt˜ äx 0ÄYµ )•ü äx p (t@ xX Ì| 0Xì $ÄÈä” tä. x@ ôè0 Dü ç@ ÜÐ ` ˆ” Ä°Ä }Œ t¼ Æ” t, ôè0” xt }Œ xÀX” ¬Ät˜ L1D tXÀ »X”p t” xX Ì Äœ X tðü ÜÅ¤X Ñ,ð° tè´8 ˆ0 L8tä. X tð@ 0¥t ô˜ƒ ÆÀÌ Î@ tðät õ¡XŒ ð°´ Ñ,ð°D ‰h ôè0 XÀ »X” L1, ÁxÝD ÔXŒ ` ˆ” ƒtp %ìÝ@ t Î@ tðü ÜÅ¤X Ñ,ð°D ôè0 ¬X” )•x ƒtä. tƒt Œ IT 0Åät %ìÝD ü©X” t | ƒtä. tÐ ø Ð” xõà½Ý t`X í¬€0 ¬ ÁiLÀ| µˆ ¬ðt ü ƒtp t )•t õôtD ¥ÁÜ¤”p ´»Œ t© ˆD ƒxÀ Ýt ôÄ] X ä. 2 %ìÝX í¬ %ìÝX í¬” lŒ 3Ü0 ˜X´À”p 18” X x½àõÝx |I¸`(Perceptron), 28” ä5(Multilayer) |I¸`, ø¬à ¬X %ìÝD 38|à ` ˆD ƒtä. 1

2. 2.1 18: Perceptron xõà½Ý(Neural Network)X 0Ð@ 1958DÐ Rosenblatt H |I¸`t Ü‘t| ` ˆä[25]. nX inputü 1X outputÐ Xì X inputX weight| wi| Ä |I¸`D Ý ˜À´t äLü ä(ø¼ 1)[14]. y = '( Xn i=1 wixi + b) (1) (b: bias, ': activation function(e.g: logistic or tanh) ø¼ 1: Concept of Perceptron ‰ nX inputX °i(Linear Combination)Ð Activation h| ©Xì 01 ¬tX U` yD õX” ƒtp, U` @ ÄÐ” ¸XÐ 0| 0| 0 eventÐ DÈÐ(1 VS -1)| èä. tƒt xõà½Ý ¨X Ü‘tä. È˜ t ¨@ Dü è XOR problemÈ YµXÀ »X” ñ, ì 8 ˆ”p(ø¼ 2) t L8Ð ÙH Æt ìXŒ ä[9]. 2.2 28: Multilayer Perceptron XORñ è ƒÄ YµXÀ »X” |I¸`X èD t°X0 )•@ Xx èX”p Input layer@ output layer¬tÐ X˜ tÁX hidden layer| ”Xì YµX” ƒt øƒtp t| ä5 |I ¸`(Multilayer perceptron)t| ä. ø¼ 5| ôt hidden layer `] „X%t ‹DÀ” ƒD Ux` ˆä. È˜ t )•@ hidden layerX / `] weightX /Ä Ä XŒ ´ Yµ(Traning)t ´5ä” èt ˆ”p Rumelhartñ@ Ðìí

3. Là¬˜(Error Backpropagation Algorithm)D Xì ä5 |I¸`X YµD ¥XŒ Xä[26]. Ðìí

4. Là¬˜X 8 …@ 8à8Ìt˜ x07D Dô0 |p ì0” 0¥ 0x $…t ô ä. 2

5. ø¼ 2: XOR problem in Perceptron : LÝX © Þ”0[9] à0(

7. (a) Large Gradient (b) Small Gradient (c) Small Learning Rate (d) Large Learning Rate ø¼ 4: Example: Gradient Descent Algorithm 4

8. ø¼ 5: Multilayer Perceptron ü 5, © 3X ´ ©@ 500Ðt ä. ¸ Ðìh| wiÐ t ø„Xt, @E @wi = @y @wi dE dy (3) à E = 1 2 (t y)2, y = xfishwfish + xchipswchips + xketchupwketchupD …Xt @E @wi = @y @wi dE dy = xi(t y) (4) à Learning rate| H wiX ÀTÉ@ wi = @E @wi = xi(t y) (5) t ä. …tôt t=850, y=500, wfish = wchips = wketchup=50 tà è Ä°D t = 1 35 Xt wfish=20, wchips=50, wchips=30t à t| ©Xt äÜ ltÄ weight” 70, 100, 80t p t| ” ©@ 880t ä. t 880ü 850D Àà X üD Ä õXt 8Ð L´ D »D ˆD ƒtä. À $… ƒ@ ¥ 0x Là¬˜D $… ƒtp ä $Xí

9. Là¬˜@ 0 weight| ä5 hidden layer| pÐ X˜X !D lXà ø !ü äX (t| í weightäD t ˜Œ ä(ø¼ 6)[15]. Ðìí

10. Là¬˜ ä5|I¸`D Yµ` ˆŒ È˜ tƒD ä ¬Œät t©X0Ð” Î@ ´$Àt 0”p ø t ä@ äLü ä. 5

11. (a) Forward Propagation (b) Back Propagation ø¼ 6: Backpropagation algorithm 1. Î@ Labeled data D”Xä. 2. YµD Xt `] 1¥t ¨´Ää(Vanishing gradient problem). 3. Over

12. tting problem 4. Local minimaÐ `È ¥1 X˜) ´´ô. ”t| X” ¨ Î0 L8Ð pt0 Ît D”Xà ø ÐÄ labeled data Ît D”Xä. È˜ °¬ à ˆ” pt0” unlabeled data è, Îp ä xX ÌX Yµ Î@ €„t unlabeled data| t© Unsupervised Learningtp, @ ‘X labeled data ä5|I¸` D YµXt …… hidden layer 1x ½°ôä 1¥t ¨´À” ½°| 0` ˆp tƒt üi (Over

13. tting)X Ütä. äL Activation functionD ´´ôt logistic functiontà tanh functiontà ´p €„ôä ‘ ] t ˆ 0¸0X ÀT ‘@ ƒD ¬` ˆä(ø¼ 7). L8Ð Yµt Ä‰ ] Ä 0¸0D 0Ð LÌ8 ˜Ð” pX Gradient descent |´˜À JD Yµt À J” èt ˆä[2]. ÈÀÉ Œñ”Ét˜ ¥Ä”Éñ Á ŒD lX” )•D t©XÀ »X à Là¬˜D t©Xì ŒÐ LÌÀŒ ˆ0 L8Ð YµÐ ˜( Œt üð ÄÜ Œ(Global minima)x? mŒ Œ(Local minima)” DÌ..Ð X8t €¬À JŒ ä. Ü‘D ´»Œ PÐÐ 0| Local minimaÐ `È Ä ˆ0 L8tä (ø¼ 8)[15]. tð 8ä L8Ð ä Neural Network@ ÀÀ¡08à(Support Vector Machine)ñÐ $ 2000D LÀ ©À »Xä. 2.3 38: Unsupervised Learning - Boltzmann Machine ^ ¸ èä L8Ð xõà½Ý t`t ˜ t©À »Xä, 2006D ü Ì 8àD t© Yµ) •t ¬p…t xõà½Ý t`t äÜ YÄX ü©D Œ È”p t ü Ì 8àX uì Dt´” Unsupervised Learning, ‰ labelt Æ” pt0 ø¬ ©„ YµD ä” ƒtp ø ÄÐ ^Ð ˜( 6

14. ø¼ 7: Sigmoid functions ø¼ 8: Global and Local Minima í

15. Là¬˜ ñD µt 0tX supervised learningD ‰ä[28, 12]. ø¼ 9Ð µx ¬ ´ ˆ”p D0ä@ è´˜ L, 8¥X ;D ¨t” ÁÜ YµD Ü‘XŒ à LŒ(phoneme), è´ (word), 8¥(sentence) Unsupervised learningD ‰XŒ p ø ÄÐ õD Àà supervised learningD ‰XŒ ä[15]. tð )•D µt ^ ¸ ä |I¸`X èät Ît t°”p, Unlabeled data| t©` ˆà t| t©t unsupervised pre-trainingD ‰h vanishing gradient problem, over

16. tting problemt ùõ ˆp, pre-trainingt ,x 0 ÐÄ ÄÀD ü´ local minima problemÄ t°` ˆD ƒt| ì¨Àà ˆä[1]. tÐ ø Ð” ¥ x )•x Deep Belief Network(DBN)ü t| ‰X0 t D” Restrict Boltzmann Machine(RBM)Ð Xì èˆ $…X0 X ä. 7

17. ø¼ 9: Description of Unsupervised Learning Restricted Boltzmann Machine(RBM) ü Ì 8à@ visible layer@ 1X hidden layer tè´Ä )¥t Æ” ø˜(undirected graph) tè´8 ˆä. tƒX ¹Õ@ Energy based modelt|” xp Energy based modelt|” ƒ@ ´¤ ÁÜ ˜, U`Ä h| ÐÀX Ü ˜À´ ä” ƒtp visible unitX ¡0| v, hidden unitX ¡0| h| Xt(v,h: binary vector- 0 or 1) ü Ì 8àX ø˜@ U`Äh” äLü ä(ø¼ 10)[18, 19]. ø¼ 10: Diagram of a Restricted Boltzmann[32] P(v; h) = 1 Z expE(v;h) (6) (Z: Normalized Constant) ø¼ 10D ôt v|¬” t ð°´ ˆÀ Jà h|¬Ä È,Àxp tƒt RestrictedX Xøtp t ptt Æt øå Boltmann Machinetp RBMÜX ø˜| t„ø˜(bipartite graph)| ä. 8

18. øå Boltzmann Machine@ 4 õ¡t Yµt ´$Ì ø H ˜( ƒt RBMx ƒtä. ¸ Ý 6| ôt ø˜X ÐÀ ÁÜ ®D] U`t ‘DÀ” ƒD L ˆ”p t @ ¬YX ôíY 2•YD ðÁÜ¨ä. t RBMX Energy functionD ´´ôt E(v; h) = X i aivi X j bjhj X i X j hjwi;jvi = aTv bTh hTWv (7) (ai: oset of visible variable, bj : oset of hidden variable, wi;j : weight between vi and hj) P ÐÀÐ ÝÐ ì¨ ôD| ` €„t j hjwi;jvixp, vi, hj 1x óÐ weight t] Ð Àh t ‘DÀà °ü U`ÄhX t ’DÄä. t” xX ÜÅ¤Ð |´˜” |ü D·p, t À” ót ÜÅ¤ ð° ¥1t ’0 L8tä(ø¼ 11). ø¼ 11: Hebb's Law[21, 15] t °¬ ÐX” ƒ@ P(v) = P h P(v; h) X D lX” ƒxp RBM@ t„ø˜ v|¬, h|¬” € Å½tÀ P(vjh) = mY i=1 P(vijh) (8a) P(hjv) = Yn j=1 P(hj jv) (8b) èˆ „¬` ˆp, tÐ 0x individual activation probabilities” p(hj = 1jv) = bj + Xm i=1 wi;jvi ! (9a) p(vi = 1jh) = 0 @ai + Xn j=1 wi;jhj 1 A (9b) ` ˆä(: activation function). t ¥D t©Xì Gibbs samplingD ðt weight| t ˜t èˆ logP(v)X ü øLX weightäD l` ˆ”p t )•Ð t µˆ LDô ä. 9

19. Training RBM 0øx Dt´” ^ |I¸`Ð ¸ Gradient descent )•ü |Xä. 0 ¸0(ø„Ä)Ð D@t ŒÜ¤t °m minimaÐ Äì` ƒt|” Dt´xp RBMÐ” logP(v) X ü øLX w, a , b äD lX” ƒt ©tÀ Gradient descentX )¥ Là¬˜D ðt ä(Gradient ascent). ¸ logP(v)| äÜ Xt logP(v) = log( X h expE(v;h) Z X ) = log( h X expE(v;h)) logZ = log( h expE(v;h)) log( X v;h expE(v;h)) (10) ü à t| ø„Xt @logP(v) @ = 1 P h expE(v;h) X h expE(v;h) @E(v; h) @ + 1 P v;h expE(v;h) X v;h expE(v;h) @E(v; h) @ (11a) = X h p(h j v) @E(v; h) @ + X v;h p(h; v) @E(v; h) @ (11b) ü à ì0 Ý 11b| ´´ôt

20. rst term@ @E(v;h) @ | p(hjv)| Àà ÉàD ¸ ƒtà, second term@ @ ƒD p(h; v)X U`„ì| 0 ÉàD ¸ ƒt| ` ˆä. t ä@ ¨P Gibbs sam- plingD t©Xì sampling` ˆ”p ì0” |x Gibbs samplingü” } äx Contrastive Divergence(CD-k) )•D ðŒ ä. CD-kX üD èˆ $…Xt ˜LÔ training sampleÐ v| random samplingXà ø v| Á p(hjv) h| t©Xì

21. rst termD t°Xp, ø äL” Ä t´ k-1ˆX ŸÙH p(hjv), p(vjh) h| t©Xì samplingD ‰Xì ˜( v(k)| t©Xì second termD t°ä” ƒtä. k” ôµ 1Ä ©„Xäà L$8 ˆp X üD | ‰Xì ˜( D 0 w, a, b| Åpt¸ Xà @ üD õt˜t ” ƒtä. t| pseudocode ”}Xt ø¼ 12@ ä. Deep Belief Network(DBN) DBN@ µˆ ÐXt RBMD ìì 5 @ ƒxp ø¼ 13Ð µx ¬ ´ ˆä. ‰ RBMD ìì 5D P´ YµXà ÈÀÉ 0tX $Xí

22. Là¬˜D t©Xì œÝ(Fine tuning)X” ƒtä[10, 11, 1]. t” Ð Ü à D0 ¸´| YµX” )Ýx LŒ(phoneme), è´(word), 8¥(sentence)X Unsupervised trainingü ø ÄX supervised learningD X” ¨t| ` ˆä. DBN@ X üÐ U` hidden layeräX D Ý1XŒ ”p t| í Ä‰Xt õD inputD Ý1` Ä ˆä. t| generative modelt| X”p HintonPX H˜tÀ http: //www.cs.toronto.edu/~hinton/adi/index.htmÐ ô +xÝÐ ˜@ ˆÈ 8à X0 €ä. 0À txÐÄ locally connected convolution layer| t© Convolutional RBM, M¤¸@ tøÀ| ÙÜ Ð Yµ` ˆ” Multimodal LearningÐ © Deep Boltzmann Machine(DBM) ñt ˆ”p tƒä@ 10

23. ø¼ 12: Contrastive Divergence(CD-k)[5] ø¼ 13: Deep Belief Network[16] 8à8ÌD 8àX8 €ä[20, 27]. 2.4 38: Supervised Learning - Recti

24. ed linear unit (ReLU), Dropout RBMD t© Unsupervised learningD t©XŒ t ä5|I¸`X }t Î@ €„ ùõÈä. Unlabeled data| ¬©` ˆŒ Èà t| ©„ˆ ©Xì over

25. tting issue, vanishing gradient8 t°Èà pre-trainingt ‹@ Ü‘D õXì local minima 8Ä t°” ƒ˜ü ôä. ¸, 11

26. ¸ ä5|I¸`X }D øå Supervised LearningÐ t•D $” üX x%ät ˆÈà ø °ü ÀLÀ ˜( x Dt´ Recti

27. ed linear unit(ReLU), Dropouttä. ø 9XÐ” t P )•Ð t èˆ ŒXÄ] X ä. Recti

28. ed linear unit (ReLU) Vanishing gradient ‰, YµD `] weightX 0¸0 0Ð Ä LÌÀ”ƒtä5|I¸`(multilayer perceptron)ÐYµt´$´t X˜ä.t”activation hX ¨‘Ð 0xX”p ø¼ 7D äÜ ´´ôt logistic functiontà tanhhtà ´p €„Ì 0 ¸0

29. tà ‘] ] 0¸0 Ä 0Ð LÌÀ” ƒD ü ˆà tƒt vanishing gradientX ätä. 2010Dü 2011DÐ xÐ t| ùõX0 Dt´ ÜÈ”p øƒt Recti

30. ed linear unit(ReLU)tä. t„Ð Á „ ˆït t” x 0tÁ| L” X ” htp 0øÌ| L” ¨P 0x hxp, ø¼ 14Ð ü ˆït 0ôäÌ lt mÁ 0¸0 1 |t 0¸0 ŒX” ½° Æ´ Yµt ©tXp ä tƒt 0tX )•ôä Yµ1¥t ‹à pre-trainingX D”1D Æ`äà L$8 ˆä[22, 7]. ø¼ 14: The proposed non-linearity, ReLU, and the standard neural network non-linearity, logistic[33] DropOut Labeled dataX €q ø x over

32. D D” à hidden unitüX ð°D 50%X U` ¼ YµX” t DropOutüX (ttä(ø¼ 15)[31]. ø¼ 15: Description of DropOut DropConnect[30] ø¼ 16Ð ˜@ˆït ä DropOutü DropConnectX 1¥t øÀ J@ ½°ôä °ht L$8 ˆä[31]. ø¼ 16: Using the MNIST dataset, in a) Ability of Dropout and DropConnect to prevent over

33. tting as the size of the 2 fully connected layers increase. b) Varying the drop-rate in a 400-400 network shows near optimal performance around the p = 0.5[31] Local minima issue °¬ l ÐìX Œt ä Global minimax.. local minimaÐ `À” ƒ@ DÌ.. |” 8Ä $«ÙH ä5|I¸`ÐX tˆ”p ¬X õ” High dimension and non-convex optimizationÐ” €„X local minimaäX t D·D·` ƒtp 0| local minima ˜ global minima˜ lŒ (t ÆD ƒtà 4 t 8Ð à½ø D” Æä” tä(ø¼ 17)[24]. Î@ (ÐÐ (ÐÈä local minimat0” }À Jäà Á tt` ˆä. 13

34. Local minima are all similar, there are long plateaus, it can take long to break symmetries. Optimization is not the real problem when: – dataset is large – unit do not saturate too much – normalization layer 31 ConvNets: today Loss parameter ø¼ 17: Local minima when high dimension and non-convex optimization [24] 3 %ìÝ ”} 1950D |I¸`(perceptron)Ð Ü‘ xõà½Ý ðl” 1980D $Xí

35. Là¬˜(Error Back- propagation Algorithm) ä5|I¸`(Multilayer perceptron)D Yµ` ˆŒ t D tèÈ ä. È˜ Gradient vanishing, labeled dataX €q, over

36. tting, local minima issue ñt ˜ t°À »t 2000D LÀ xõà½Ý ðl” õô| tèà ˆ”p, 2006D€0 ü Ì8àD t© Unsupervised Learningx Restricted Boltzmann Machine(RBM), Deep Belief Network(DBN), Deep Boltzmann Ma- chine(DBM), Convolutional Deep Belief Network ñt t unlabeled data| t©Xì pre-training D ‰` ˆŒ ´ Ð ¸ ä5|I¸`X Ät ùõÈä. 2010D€0” Ept0| ù t©h Î@ labeled data| ¬©` ˆŒ Èà, Recti

37. ed linear unit (ReLU), DropOut, DropConnect ñX ¬ vanishing gradient8@ over

38. tting issue| t°Xì D Supervised learn- ingt¥XŒÈp, local minima issueÄHigh dimension non-convex optimizationÐ”Ä” €„t DÈ|” õ U¥à ˆä. 4 à0: ôtYX © %ìÝX ¹Õ@ lŒ 0ÄYµX ¹Õü äx 0ÄYµ)•ü lÄ” %ìÝÌX ¹Õ ˜

39. ˆ”p ø8Ð” t| ôtYÐX íYðl@ t DPt ô ä. íY vs 0ÄYµ µÄY | ´¨@ ÄT`X …D t µÄYD t©Xp, R.A Fisher” ¥ 1¥t ‹@ DÌ| ÝX0 t µÄYD t©X”p ” ÀÝX U¥ íYðlÐ t©X” 14

40. µÄYÐ tùXp, Ä” ,x X¬°D ƒ 0ÄYµÐ ¬©X” µÄYt|à ü ˆä. ‰, íY@ 0ø xüÄ t(causal inference) ©xp t, 0ÄYµ@ !(prediction)t ü ©t|” ƒtä. ©X (t” ¨ ÝX (t t´ÀŒ ”p, íYðlÐ” xüÄX …, °üX tt ”X0 L8Ð ¨ ÝÄ tƒD ˜ ` ˆ” ƒt ° 0tä. ‰, tt è ¨( ŒÀ or À¤ñŒÀ„t Odds Ratio)t 8p °ü ÄÜ ÿDÌ °üxÀ DÌÀ ”XŒ ´ P value,

41. , Odds Ratio(OR), Hazard Ratio(HR), ´ ÀùÄ ñt ” À ä. t 0ÄYµÐ” ¼È˜ ˜ !X”À ”X0 L8Ð õ¡ ¨D ÝX” ƒD üXÀ J p xt tX0 ´$´ ¨Ä !%Ì ‹ät ©` ˆä.

42. ˜ p value ñX tD Àôä” ^y,^p, cross-validation, ROC curve, AUC(Area Under Curve), x ÀùÄ ñX !D ˜À´” Àät ” À ä. %ìÝ vs äx 0ÄYµ %ìÝ@ ÌX ‘©D ¨T ƒ äX hidden layeräD Ñ,ð° D ‰X” )• 0ÄYµ ÐÄ ¨X exibility ¥ p ¸Ð Xì ¨D xt ttX0 ´$´p, t” °¬ ÌX TäÈ˜Ð t ¨t” €„t Î@ ƒü |åÁµXäà ` ˆä. t ÐÄ ÉX Ñ,ð°(parallel computing)t| Ð %ìÝX ¹Õx €„tp ü Google ñX Œ ITŒ¬Ð X” (4xÝðl| ´´ôt GPU(Graphical Processor Unit)| ¬©Xì œÐ ÌX Core| ¬© Ñ,ð°D ‰X” ƒD L ˆä[17, 4]. Ñ,ð°D ø˜ ¸´x CUDA(Compute Uni

43. ed Device Architecture), OpenCL ñÄ ´ Ñ,ôèÐ t©à ˆp ^ t ¸´äX ”1t T äÈ ƒt| Ýä[23, 29]. $€(hypothesis testing) vs $Ý1(hypothesis generating) ^ ¸ íYðl@ 0Ä YµX (tÐ g™ì Ept0(Big Data)| t©` ˆŒ à ôè0X 1¥t t| ·hXt üYðlX (ìä„Ä

44. à ˆ”p $Ý1t øƒtä[3]. 0tX üYðl ‰ðl| µt $D $Xà øÐ ÞŒ pt0| ¨D $D €Xà ÀÝD U¥Xät, $Ý1@ ˆ pt 0| ,tŒ tX” ƒÐ D T ƒtä. Ept0| øƒX (4D tˆ „Xt È´ $D $` ˆà t| €Xì ÀÝD U¥Xt ” ƒtä(ø¼ 18). ø¼ 18: Basic science hypothesis-testing and hypothesisgenerating paradigms [3] 15

45. ÀLÀ %ìÝX P, í¬@ ¬, ø¬à 0tX íYðl@X (tÐ t LDôXä. ¨| 00| t© ôtYÐ ¥ ” ƒ@ ¨| 00| µt Ý1” Ept0| ¨ü „X” ƒtp ø pt0” €„ Á, L1, M¤¸ ƒ„t …Xä. 0| tð D Ept0| ¨ü „` ˆ” %ìÝt Mobile HealthX uì 0 t ƒ„@ Xì` ìÀ ÆD ƒtp, t 0 D ôtYÐÄ ù ©t| ä. t| t à Ñ, ø˜ ¸´X ©ü ˆ|ôè0 Ü$X l•ÐÄ ˜h| ` ƒtp, - %ìÝt DÈT|Ä x 0ÄYµü $Ý1X YÐ tt ÜÜ D”Xp 0tX íYðlX causal inferenceX (ìä„Ð —´˜ Ept0@ !¨t|” È´ (ìä„Ð Qt| ` ƒtä. 8à 8Ì [1] Y. Bengio. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1{127, 2009. [2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dicult. Neural Networks, IEEE Transactions on, 5(2):157{166, 1994. [3] L. G. Biesecker. Hypothesis-generating research and predictive medicine. Genome research, 23(7):1051{1053, 2013. [4] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In Proceedings of The 30th International Conference on Machine Learning, pages 1337{1345, 2013. [5] A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 14{36. Springer, 2012. [6] I. Gartner. ¸, 2014 8Ä it Ü¥ 10 ü” ! . http://www.acrofan.com/ko-kr/ commerce/news/20131013/00000015. [7] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti

46. er networks. In Proceedings of the 14th International Conference on Arti

47. cial Intelligence and Statistics. JMLR WCP Volume, volume 15, pages 315{323, 2011. [8] T. Han-Hsing. [ml, python] gradient descent algorithm (revision 2). http://hhtucode.blogspot. kr/2013/04/ml-gradient-descent-algorithm.html. [9] G. Hinton. Coursera: Neural networks for machine learning. https://class.coursera.org/ neuralnets-2012-001. [10] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527{1554, 2006. 16

48. [11] G. E. Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009. [12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504{507, 2006. [13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [14] A. Honkela. Multilayer perceptrons. https://www.hiit.fi/u/ahonkela/dippa/node41.html. [15] J. Kim. 2014 (4xÝ 0ÄYµ ì„YP. http://prml.yonsei.ac.kr/. [16] H. Larochelle. Deep learning. http://www.dmi.usherb.ca/~larocheh/projects_deep_learning. html. [17] Q. V. Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8595{8598. IEEE, 2013. [18] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting structured data, 2006. [19] Y. LeCun and F. Huang. Loss functions for discriminative training of energybased models. AIStats, 2005. [20] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609{616. ACM, 2009. [21] L. Muehlhauser. A crash course in the neuroscience of human motivation. http://lesswrong.com/ lw/71x/a_crash_course_in_the_neuroscience_of_human/. [22] V. Nair and G. E. Hinton. Recti

49. ed linear units improve restricted boltzmann machines. In Proceed- ings of the 27th International Conference on Machine Learning (ICML-10), pages 807{814, 2010. [23] C. Nvidia. Compute uni

50. ed device architecture programming guide. 2007. [24] M. Ranzato. Deep learning for vision: Tricks of the trade. www.cs.toronto.edu/~ranzato. [25] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. 17

51. [27] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In International Conference on Arti

52. cial Intelligence and Statistics, pages 448{455, 2009. [28] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. 1986. [29] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science engineering, 12(3):66, 2010. [30] L.Wan. Regularization of neural networks using dropconnect. http://cs.nyu.edu/~wanli/dropc/. [31] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058{1066, 2013. [32] Wikipedia. Wikepedia. http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine. [33] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, et al. On recti

53. ed linear units for speech processing. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 3517{3521. IEEE, 2013. [34] ÙD|ô. Mit,,tX 10 à0 . http://news.donga.com/3/all/20130426/54713529/1. 18

Deep Learning by JSKIM (Korean)

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (8)

Similar to Deep Learning by JSKIM (Korean)

Similar to Deep Learning by JSKIM (Korean) (20)

More from Jinseob Kim

More from Jinseob Kim (20)

Recently uploaded

Recently uploaded (20)

Deep Learning by JSKIM (Korean)