1. This document discusses the history and development of deep learning from the perceptron in 1958 to modern deep neural networks.
2. It describes the key milestones as the perceptron in 1958, multilayer perceptrons in the 1980s which could solve the XOR problem, and Boltzmann machines in the 1980s which introduced unsupervised learning.
3. Deep learning has gained popularity since 2010 due to increases in data and computational power. It is now being applied to problems in computer vision, natural language processing and other domains.
32. D D” à hidden unitüX ð°D 50%X U` ¼ YµX” t DropOutüX (ttä(ø¼
15)[31].
ø¼ 15: Description of DropOut DropConnect[30]
ø¼ 16Ð ˜@ˆït ä DropOutü DropConnectX 1¥t øÀ J@ ½°ôä °ht L$8
ˆä[31].
ø¼ 16: Using the MNIST dataset, in a) Ability of Dropout and DropConnect to prevent
over
33. tting as the size of the 2 fully connected layers increase. b) Varying the drop-rate in
a 400-400 network shows near optimal performance around the p = 0.5[31]
Local minima issue °¬ l ÐìX Œt ä Global minimax.. local minimaÐ `À”
ƒ@ DÌ.. |” 8Ä $«ÙH ä5|I¸`ÐX tˆ”p ¬X õ” High dimension and
non-convex optimizationД €„X local minimaäX t D·D·` ƒtp 0| local minima
˜ global minima˜ lŒ (t ÆD ƒtà 4 t 8Ð à½ø D” Æä” tä(ø¼ 17)[24].
Î@ (ÐÐ (ÐÈä local minimat0” }À Jäà Á tt` ˆä.
13
34. Local minima are all similar, there are long
plateaus, it can take long to break symmetries.
Optimization is not the real problem when:
– dataset is large
– unit do not saturate too much
– normalization layer
31
ConvNets: today
Loss
parameter
ø¼ 17: Local minima when high dimension and non-convex optimization [24]
3 %ìÝ ”}
1950D |I¸`(perceptron)Ð Ü‘ xõà½Ý ðl” 1980D $Xí
35. Lଘ(Error Back-
propagation Algorithm) ä5|I¸`(Multilayer perceptron)D Yµ` ˆŒ t D tèÈ
ä. Ș Gradient vanishing, labeled dataX €q, over
46. er networks. In Proceedings of the 14th
International Conference on Arti
47. cial Intelligence and Statistics. JMLR WCP Volume, volume 15,
pages 315{323, 2011.
[8] T. Han-Hsing. [ml, python] gradient descent algorithm (revision 2). http://hhtucode.blogspot.
kr/2013/04/ml-gradient-descent-algorithm.html.
[9] G. Hinton. Coursera: Neural networks for machine learning. https://class.coursera.org/
neuralnets-2012-001.
[10] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural
computation, 18(7):1527{1554, 2006.
16
48. [11] G. E. Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009.
[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504{507, 2006.
[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[14] A. Honkela. Multilayer perceptrons. https://www.hiit.fi/u/ahonkela/dippa/node41.html.
[15] J. Kim. 2014 (4xÝ 0ÄYµ ì„YP. http://prml.yonsei.ac.kr/.
[16] H. Larochelle. Deep learning. http://www.dmi.usherb.ca/~larocheh/projects_deep_learning.
html.
[17] Q. V. Le. Building high-level features using large scale unsupervised learning. In Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8595{8598. IEEE,
2013.
[18] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning.
Predicting structured data, 2006.
[19] Y. LeCun and F. Huang. Loss functions for discriminative training of energybased models. AIStats,
2005.
[20] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International
Conference on Machine Learning, pages 609{616. ACM, 2009.
[21] L. Muehlhauser. A crash course in the neuroscience of human motivation. http://lesswrong.com/
lw/71x/a_crash_course_in_the_neuroscience_of_human/.
[22] V. Nair and G. E. Hinton. Recti
49. ed linear units improve restricted boltzmann machines. In Proceed-
ings of the 27th International Conference on Machine Learning (ICML-10), pages 807{814, 2010.
[23] C. Nvidia. Compute uni
50. ed device architecture programming guide. 2007.
[24] M. Ranzato. Deep learning for vision: Tricks of the trade. www.cs.toronto.edu/~ranzato.
[25] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in
the brain. Psychological review, 65(6):386, 1958.
[26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error
propagation. Technical report, DTIC Document, 1985.
17
51. [27] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In International Conference on
Arti
52. cial Intelligence and Statistics, pages 448{455, 2009.
[28] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. 1986.
[29] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous
computing systems. Computing in science engineering, 12(3):66, 2010.
[30] L.Wan. Regularization of neural networks using dropconnect. http://cs.nyu.edu/~wanli/dropc/.
[31] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using
dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),
pages 1058{1066, 2013.
[32] Wikipedia. Wikepedia. http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine.
[33] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke,
J. Dean, et al. On recti
53. ed linear units for speech processing. In Acoustics, Speech and Signal
Processing (ICASSP), 2013 IEEE International Conference on, pages 3517{3521. IEEE, 2013.
[34] ÙD|ô. Mit,,tX 10 à0 . http://news.donga.com/3/all/20130426/54713529/1.
18