Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

6,765 views

Published on

Published in:
Science

No Downloads

Total views

6,765

On SlideShare

0

From Embeds

0

Number of Embeds

161

Shares

0

Downloads

207

Comments

0

Likes

31

No embeds

No notes for slide

- 1. IDS Lab Universal Approximation Theorem why does deep neural network work? basic maths for deep learning Jamie Seol
- 2. IDS Lab Jamie Seol Motivation • Basic logistic regression, or a perceptron can’t classify data like: • because the decision boundary will be just a line!
- 3. IDS Lab Jamie Seol Motivation • Unless we put additional features like x3 = x1 2, x4 = x2 2 • but this is a feature engineering! • we need to do it by our hands!! • we don’t want to do this! • this is the one of main reasons why we’re studying deep learning! • we can’t do this always! • what if the input data was like 1000 dimensional vector? • we know that deep learning automatically finds these additional features • but why and how?
- 4. IDS Lab Jamie Seol Toy experiment • We know that SVM will probably work well on the previous problem • How about MLP(multilayer perceptron)? • It works, with 100% accuracy • https://github.com/theeluwin/kata/blob/master/tasks/ Universal%20Approximation%20Theorem/uat.py
- 5. IDS Lab Jamie Seol Toy experiment • MLP actually learned a non-linear decision boundary without any feature engineering! data decision boundary
- 6. IDS Lab Jamie Seol Universal Approximation Theorem • The theorem states that, long story short, MLP can represent ANY given (nice) function • Formally proved by G. Cybenko in 1989 with sigmoidal activation function • K. Hornik proved any activation function works too • therefore, theoretically, MLP can learn ANYTHING • “it’s not working in practice” ← it will, if you have zillion neurons with zillion layers, and infinite data • now we can say a deep learning is an universal learning algorithm • then why do we need things like CNN and RNN? because we don’t have infinite data
- 7. IDS Lab Jamie Seol Universal Approximation Theorem • We’ll prove this theorem • We can’t skip maths forever! • let’s use Times New Roman font
- 8. IDS Lab Jamie Seol Analysis 101 • In Euclidean space ℝn, we call the following set open ball • B(x, ε) = {y ∈ ℝn | d(x, y) < ε} • for given x ∈ ℝn and ε ∈ ℝ+ • function d can be any metric, but usually we use • d(x, y) = l2(x - y) = ||x - y||2 • also known as Euclidean distance • We call a set A ⊂ ℝn is open if for ∀x ∈ A, ∃ε ∈ ℝ+ such that B(x, ε) ⊂ A • We generalize this concept of open set by open set itself, which we call a topology
- 9. IDS Lab Jamie Seol Analysis 101 open ball open set open ball when d(x, y) = l1(x - y)
- 10. IDS Lab Jamie Seol Topology 101 • For some set X, we say 𝒯 is a topology on X if: • 𝒯 is a family of subsets of X • that is, 𝒯 ⊂ 𝒫(X) • ∅, X ∈ 𝒯 • any union of elements of 𝒯 is an element of 𝒯 • any intersection of finite elements of 𝒯 is an element of 𝒯 • Members of 𝒯 are often called open sets • for example, with X = ℝn, we can give topology 𝒯 by collecting all open sets defined in previous slide • note that this is a typical non-contructive definition! • we can’t specify all members of 𝒯, though we just know that they do exist well
- 11. IDS Lab Jamie Seol Topology 101 • Note that topology is motivated by generalization of open set! • for example, in analysis, we say a sequence {an} converges to a if for ∀ε > 0, ∃N s.t. for ∀n > N, |an - a| < ε holds • in topology, we say a sequence {an} converges to a if for any open set O ∈ 𝒯 containing a, ∃N s.t. for n > N, an ∈ O holds • x ∈ X is called a limit point of A (for some given A ⊂ X) if for any open set O ∈ 𝒯 containing x, O∩A ≠ ∅ holds • we denote A’to be a set of limit points of A • Ā is called a closure of A, defined by Ā = A∪A’ • we say A is dense in X if Ā = X • THIS is the true meaning of dense! • dense dance ~
- 12. IDS Lab Jamie Seol FourierAnalysis 101 • Easy example: any irrational number is a limit point of ℚ (why?), so closure of ℚ is equal to ℝ, or, ℚ is dense in ℝ • One of the most useful application of the concept dense is Fourier series/ transform: • trigonometric polynomials (a.k.a. Fourier series) are dense in L2 • this is non-trivial when support is not compact • how about Lp? • Fourier transform can be defined in Lp • this is also non-trivial • can be proved by following step: • show that Swarts class is dense in Lp • extend domain by completion to Lp
- 13. IDS Lab Jamie Seol Topology 101 • We say a topological space X is compact if every open covers has a finite subcover • open covers are just • Actually, talking about compactness requires a lot of time • For Euclidean space, we can think compact as bounded and closed • this is not trivial: see Heine-Borel theorem • bounded means that the set can be contained in some big open ball • closed is just opposite of open; A is closed set when XA is open
- 14. IDS Lab Jamie Seol LinearAlgebra 101 • For a vector space V over field F, we call || • || → ℝ a norm of a vector if • for all a ∈ F and all u, v ∈ V, • ||av|| = |a| ||v|| • ||u + v|| ≤ ||u|| + ||v|| • ||u|| = 0 then u = 0 • Examples • lp-norm: lp(x) = • Lp-norm: Lp(f) = • Note that set of real valued-functions are vector space! functions are vector! • (3f + g) is a function(vector) defined by (3f + g)(x) = 3f(x) + g(x)
- 15. IDS Lab Jamie Seol LinearAlgebra 101 • For a vector space V over field F, we call <•, •> → F a inner product of two vectors if • for all a ∈ F and all x, y, z ∈ V, • Note that we can induce canonical(trivial) norm from inner product by • Example: inner product of two function
- 16. IDS Lab Jamie Seol LinearAlgebra 101 • Function space - just a vector space • which means, its elements, or vectors, are functions • mostly we’ll deal with continuous and real-valued function • Typical example of function spaces are: • linear functional (dual space) • Swartz class • Banach space, Hilbert space • Reproducing Kernel Hilbert Space (RKHS) • this is really really really important concept! mandatory for understanding regularization and transfer learning • we’ll cover it someday… • Domain of functions(vectors) are often called support
- 17. IDS Lab Jamie Seol Analysis 101 • Aset M is called a metric space if metric d is given, holding • d: M × M → ℝ, and for all x, y, z in M, • d(x, y) ≥ 0 • d(x, y) = 0 then x = y • d(x, y) = d(y, x) • d(x, z) ≤ d(x, y) + d(y, z) • Obviously, if some vector space is norm space, then it has canonical(trivial) metric d induced by d(x, y) = ||x - y|| • Another obvious: metric space has canonical(trivial) topology induced from open balls • For example, Kullback-Leibler divergence is not a metric
- 18. IDS Lab Jamie Seol Analysis 101 • Asequence {an} in metric space M is said to be a cauchy if for ∀ε > 0, ∃N s.t. for ∀n, m > N, d(an, am) < ε holds • Ametric space M is said to be complete if every cauchy sequence converges • very, very typical complete space: ℝ • but this is not trivial • for example, rational sequence {an} satisfying an ∈ [π - 1/n, π + 1/n] is cauchy, but it never converges!
- 19. IDS Lab Jamie Seol Analysis 101 • In function space with functions having a metric codomain, we say a sequence {fn} converges uniformly to f if for ∀ε > 0, ∃N s.t. for ∀n > N, d(fn(x), f(x)) < ε holds all x in support • why uniform? because, N was determined independently by x • which means, N is a function of ε • if N was determined by both ε and x, then we call it point-wise convergence
- 20. IDS Lab Jamie Seol RealAnalysis 101 • We say a function is measurable if f−1(A) is measurable for any open set A • maybe we should stop here… • if someone meets non-measurable function by chance, it’ll be one of the most unlucky day ever • one old mathematician said, “If someone built an airplane using non-measurable function, then I’ll not ride the plane” • Anyway, a measure is something that measures area of a set • for example, µ([a, b]) = b - a will hold as intuitively in ℝ with Lebesgue measure µ • we say some property p holds almost everywhere in A if the measure of {x ∈ A | ¬p(x)} is 0 • note that the probability, is measure of an event!
- 21. IDS Lab Jamie Seol All together • Want to show: “set of MLPis uniformly dense in measurable normed function space on every compact support” • at last! now we can read this statement • Let’s prove this theorem
- 22. IDS Lab Jamie Seol Notations • For some natural number r, let Ar be set of all affine functions from ℝr to ℝ, where affine function means in form of A(x) = wTx + b • For any measurable function G from ℝ to ℝ, let ∑r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽jG(Aj(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Aj ∈Ar, finite sum} • note that ∑r(G) is family of single hidden layer feedforward neural network with one output, having G as activation function • actually, we’re talking about Borel measurable, not just any measure • For any measurable function G from ℝ to ℝ, let ∑∏r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽j∏G(Ajk(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Ajk ∈Ar, finite sum and finite product} • somewhat complex but more general form of MLP
- 23. IDS Lab Jamie Seol Notations • Afunction 𝛹: ℝ → [0, 1] is a squashing function if • non-decreasing • limit to +∞ gives 1, -∞ gives 0 • examples: positive indicator function, standard sigmoid function • Cr = set of continuous function from ℝr to ℝ • if G is continuous, both ∑r(G) and ∑∏r(G) belongs to Cr • Mr = set of measurable function from ℝr to ℝ • if G is measurable, both ∑r(G) and ∑∏r(G) belongs to Mr • note that Cr is subset of Mr
- 24. IDS Lab Jamie Seol Notations • Subset S in Cr is said to be uniformly dense on compacta in Cr if for every compact subset K ⊂ ℝr, S is dK-dense in Cr • dK(f, g) = supx∈K |f(x) - g(x)| • uniform convergence should hold • Given a measure µ on (ℝr, Br), we define metric • dµ(f, g) = inf{ε > 0: µ{x: |f(x) - g(x)| > ε} < ε} • so dµ(f, g) = 0 if and only if f and g agrees almost everywhere • Br is Borel 𝜎-field of ℝr (we omit detail explanation) • note that this measure µ is some kind of input space environment (think as a probability of an event)
- 25. IDS Lab Jamie Seol Stone-Weierstrass Theorem • Afamily A of real functions defined on E is an algebra if A is closed under addition, multiplication, and scalar multiplication • Afamily A separates points on E if for every x, y in E that x ≠ y, ∃f ∈ A s. t. f(x) ≠ f(y) • Afamily A vanishes at no point of E if for each x in E, ∃f ∈ A s. t. f(x) ≠ 0 • statement if A is an continuous algebra on compact metric space K and separates points on K and vanished at no point of K, then A is dK-dense on C(K) • proof too long to prove in here • see the reference for full proof
- 26. IDS Lab Jamie Seol Sketch of the proof 1. Apply Stone-Weierstrass theorem to ∑∏r(G) 2. Extend uniformly dense on compacta to dense in sense of measure 3. For squashing functions, approximate it to continuous squashing function 4. For continuous squashing function, approximate it to Fourier series 5. Remove ∏ by cosine rule
- 27. IDS Lab Jamie Seol Theorem 1 • statement let G be any continuous non-constant function from ℝ to ℝ, then ∑∏r(G) is uniformly dense on compacta in Cr • proof let K ⊂ ℝr be any compact set, then for any G, ∑∏r(G) is obviously an algebra on K which separates points and vanished at no point, since we collected affine functions in form of A(x) = wTx + b, so we can use Stone- Weierstrass theorem
- 28. IDS Lab Jamie Seol Lemma 1 • statement if {fn} is a sequence of functions in Mr that converges uniformly on compacta to the function f, then dµ(fn, f) → 0 • by this lemma, extending Mr from Cr becomes really easy! • note that our activation functions are not always continous • but surely measurable, because we don’t want to build any non- measurable MLP - it would be a terror! • proof pick a big compact set (so that measure of the set becomes almost large as the whole), find large N from uniform convergence, then apply Chebyshev’s inequality • fn will be similar to f in the compact set, and the rest part of support is small, so the integral on Chebyshev’s inequality becomes tiny • we’re working on ℝr, which is a locally compact metric space with regular measure, so many problems are relatively simple
- 29. IDS Lab Jamie Seol Theorem 2 • statement let G be any continuous non-constant function from ℝ to ℝ, then ∑∏r(G) is dµ-dense in Mr for any finite measure µ • proof immediately follows from Theorem 1 and Lemma 1 if we know that for any finite measure µ, Cr is dµ-dense in Mr • this can be proved by similar way that we did in Lemma 1 • So the activation function don’t really have to be continuous! we can use positive indicator function as activation function
- 30. IDS Lab Jamie Seol Theorem 3 • statement let 𝛹 be any squashing function, then ∑∏r(𝛹) is uniformly dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof show that ∑∏r(𝛹) is uniformly dense on compacta in ∑∏r(F), where F is continuous squashing function • any continuous squashing function F can be uniformly approximated by some element H of ∑l(𝛹), in sense of supx∈ℝ |F(x) - H(x)| < ε for ∀ε > 0 • if ∏l(F) can be uniformly approximated by members of ∑∏r(𝛹), then we’re done • these proof can be done by analytic (rather dirty) works • the rest part will be completed by Lemma 1 and Theorem 2
- 31. IDS Lab Jamie Seol Theorem 4 • statement let 𝛹 be any squashing function, then ∑r(𝛹) is uniformly dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof using the fact that continous squashing function can be uniformly approximated by Fourier series in each compact support and note that cosAcosB = cos(A+B) - cos(A-B) • wow! that’s really all! • we removed the ∏ part by uniformly approximating 𝛹 to some continuous squashing function (using Theorem 3), which also can uniformly approximated to its Fourier series in each compact set (for more details, see the reference book), and some cosine rules
- 32. IDS Lab Jamie Seol Corollaries • The rest part is just some natural corollaries like • it can be extended to general Lp space • it can be extended to multi-output MLP • it can be extended to multi-layer MLP • hmm? actually, Theorem 4 claims that 1-layer NN is enough universal approximator
- 33. IDS Lab Jamie Seol Conclusion • We now know that MLP actually works theoretically! • MLPs are universal approximators • Taylor series and Fourier series are also good universal approximators but those two requires detailed information of the original function (or computable oracle), while MLP can be learned by only using the train data • me? • Jamie Seol • email - jamie@europa.snu.ac.kr • twitter - @theeluwin • website - theeluwin.kr • blog - blog.theeluwin.kr
- 34. IDS Lab Jamie Seol References • Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366. • Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems 2.4 (1989): 303-314. • Rudin, Walter. Principles of mathematical analysis. Vol. 3. New York: McGraw-Hill, 1964. • Rudin, Walter. Real and complex analysis. Tata McGraw-Hill Education, 1987. • Munkres, James R. Topology. Prentice Hall, 2000. • Gockenbach, Mark S. Finite-dimensional linear algebra. CRC Press, 2011. • Young, Matt. "The Stone-Weierstrass Theorem." (2006). • Stein, Elias M., and Rami Shakarchi. "FourierAnalysis, Princeton Lectures inAnalysis I." (2003).

No public clipboards found for this slide

Be the first to comment