Successfully reported this slideshow.
Upcoming SlideShare
×

# Machine Learning Preliminaries and Math Refresher

1,069 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/Gfb4Q ◀ ◀ ◀ ◀

Are you sure you want to  Yes  No

### Machine Learning Preliminaries and Math Refresher

1. 1. General remarks about learning Probability Theory and Statistics Linear spaces Machine Learning Preliminaries and Math Refresher M. L¨thi, T. Vetter u February 18, 2008 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
2. 2. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
3. 3. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
4. 4. General remarks about learning Probability Theory and Statistics Linear spaces The problem of learning is arguably at the very core of the problem of intelligence, both biological and artiﬁcial. T. Poggio and C.R. Shelton M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
5. 5. General remarks about learning Probability Theory and Statistics Linear spaces Model building in natural sciences Model building Given a phenomenon, construct a model for it. Example (Heat Conduction) Phenomenon: The spontaneous transfer of thermal energy through matter, from a region of higher temperature to a region of lower temperature Model: ∂Q = −k T · dS ∂t S M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
6. 6. General remarks about learning Probability Theory and Statistics Linear spaces Learning as Model Building Example (Learning) Phenomenon: Learning (Inferring general rules from examples) Model: P(f )P(f |D) f ∗ = arg max f ∈H P(D) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
7. 7. General remarks about learning Probability Theory and Statistics Linear spaces Learning as Model Building Example (Learning) Phenomenon: Learning (Inferring general rules from examples) Model: ∗ P(f ) )P(D|f f = arg max f ∈H P(D) Neural networks, Decision Trees, Naive Bayes, Support Vector machines, etc. Models for learning The models for learning are the learning algorithms M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
8. 8. General remarks about learning Probability Theory and Statistics Linear spaces Goals of the ﬁrst block Life is short . . . We want to cover the essentials of learning. General Setting Statistical Kernel Methods Mathematically Learning Theory Theory of precise setting When does Kernels of the learning learning work Make linear problem Conditions any algorithms Valid for any algorithm has non-linear. kind of learning to satisfy Learning from algorithm Performance non-vectorial bounds data. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
9. 9. General remarks about learning Probability Theory and Statistics Linear spaces Mathematics needed in the ﬁrst block The need for mathematics As we treat the learning problem in a formal setting, the results and methods are necessarily formulated in mathematical terms. General Setting Statistical Kernel Methods Probability Learning Theory Linear spaces theory More Linear algebra Statistics probability Basic theory Basic optimization optimization More statistics theory theory M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
10. 10. General remarks about learning Probability Theory and Statistics Linear spaces Mathematics needed in the ﬁrst block The need for mathematics As we treat the learning problem in a formal setting, the results and methods are necessarily formulated in mathematical terms. General Setting Statistical Kernel Methods Probability Learning Theory Linear spaces theory More Linear algebra Statistics probability Basic theory Basic optimization optimization More statistics theory theory A bit of mathematical maturity and an open mind is required. The rest will be explained. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
11. 11. General remarks about learning Probability Theory and Statistics Linear spaces Nothing is more practical than a good theory. Vladimir N. Vapnik M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
12. 12. General remarks about learning Probability Theory and Statistics Linear spaces Nothing is more practical than a good theory. Vladimir N. Vapnik Nothing (in computer science) is more beautiful than learning theory? M. L¨thi u M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
13. 13. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
14. 14. General remarks about learning Probability Theory and Statistics Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
15. 15. General remarks about learning Probability Theory and Statistics Linear spaces Probability theory vs Statistics Deﬁnition (Probability Theory) Deﬁnition (Statistics) A branch of mathematics The science of collecting, concerned with the analysis of analyzing, presenting, and random phenomena. interpreting data. General ⇒ Speciﬁc Speciﬁc ⇒ General Statistical Machine learning is closely related to (inferential) statistics. Many state-of-the-art learning algorithms are based on concepts from probability theory. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
16. 16. General remarks about learning Probability Theory and Statistics Linear spaces Probabilities Deﬁnition (Probability Space) A probability space is the triple (Ω, F, P) where Ω is a set of events ω F is a collection of events (e.g. the power-set P(Ω)) P is a measure that satisﬁes the probability axioms. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
17. 17. General remarks about learning Probability Theory and Statistics Linear spaces Axioms of Probability 1 For any A ∈ F, there exists a number P(A), the probability of A, satisﬁtying P(A) ≥ 0. 2 P(Ω) = 1. 3 Let {An , n ≥ 1} be a collection of pairwise disjoint events, and let A be their union. Then ∞ P(A) = P(An ). n=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
18. 18. General remarks about learning Probability Theory and Statistics Linear spaces Independence Deﬁnition (Independence) Two events, A and B, are independent iﬀ the probability of their intersection equals the product of the individual probabilities, i.e. P(A ∩ B) = P(A) · P(B). Deﬁnition (Conditional probability) Given two events A and B, with P(B) 0, we deﬁne the conditional probability for A given B, P(A|B), by the relation P(A ∩ B) P(A|B) = . P(B) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
19. 19. General remarks about learning Probability Theory and Statistics Linear spaces Random Variables A single event is not that interesting. Deﬁnition (Random Variable) A random variable X is a function from the probability space to a vector of real numbers X : Ω → Rn . Random variables are characterized by their distribution function F : Deﬁnition (Probability Distribution Function) Let X : Ω → R be a random variable. We deﬁne FX (x) = P(X ≤ x) − ∞ x ∞. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
20. 20. General remarks about learning Probability Theory and Statistics Linear spaces Probability density function Deﬁnition (Probability density function) The density function, is the function fX , with the property x FX (x) = fX (y ) dy , −∞ x ∞. −∞ M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
21. 21. General remarks about learning Probability Theory and Statistics Linear spaces Convergence Deﬁnition (Convergence in Probability) Let X1 , X2 , . . . be random variables. We say that Xn converges in probability to the random variable X as n → ∞, iﬀ, for all ε 0, P(|Xn − X | ε) → 0, as n → ∞. p We write Xn − X as n → ∞. → M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
22. 22. General remarks about learning Probability Theory and Statistics Linear spaces Weak law of large numbers Theorem (Bernoulli’s Theorems (Weak law of large numbers)) Let X1 , . . . , Xn be a sequence of independent and identically distributed (i.i.d.) random variables, each having mean µ and standard deviation σ. Then P[|(X1 + . . . + Xn )/n − µ| ε] → 0 as n → ∞. Thus given enough observations xi ∼ FX , the sample mean x = n n xi will approach the true mean µ. 1 i=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
23. 23. General remarks about learning Probability Theory and Statistics Linear spaces Expectation Deﬁnition (Expectation) Let X be a random variable with probability density function fX , and g : R → R a function. We deﬁne the expectation ∞ E [g (X )] := g (x)fX (x) dx. −∞ Deﬁnition (Sample mean) Let a sample x = {x1 , x2 , . . . , xn } be given. We deﬁne the (sample) mean to be n 1 x= xi . n i=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
24. 24. General remarks about learning Probability Theory and Statistics Linear spaces Variance Deﬁnition (Variance) Let X be a random variable with density funciton fX . The variance is given by Var[X ] = E [(X − E [X ])2 ] = E [X 2 ] − (E [X ])2 . The square root Var[X ] of the variance is referred to as the standard deviation. Deﬁnition (Sample Variance) Let the sample x = {x1 , x2 , . . . , xn } with sample mean x be given. We deﬁne the sample variance to be 1 s2 = (xi − x)2 . n−1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
25. 25. General remarks about learning Probability Theory and Statistics Linear spaces Notation Assume F has a probability density function: dF (x) f (x) = dx Formally, we write: f (x) dx = dF (x) Example: Expectation ∞ ∞ E [g (X )] := g (x)f (x) dx. = g (x)dF (x) −∞ −∞ M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
26. 26. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
27. 27. General remarks about learning Probability Theory and Statistics Linear spaces Vector Space A set V together with two binary operations 1 vector addition + : V × V → V and 2 scalar multiplication · : R × V → V is called a vector space over R, if it satisﬁes the following axioms: 1 ∀x, y ∈ V : x + y = y + x (commutativity) 2 ∀x, y ∈ V : x + (y + z) = (x + y ) + z (associativity) 3 ∃0 ∈ V , ∀x ∈ V : 0 + x = x (identity of vector addition) 4 ∃1 ∈ V , ∀x ∈ V : 1 · x = x (identity of vector multiplication) 5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element) 6 ∀α ∈ R, ∀x, y ∈ V : α · (x + y ) = α · x + α · y (distributivity) 7 ∀α, β ∈ R, ∀x ∈ V : (α + β) · x = α · x + β · x (distributivity) 8 ∀α, β ∈ R, ∀x ∈ V : α(β · x) = (αβ) · x M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
28. 28. General remarks about learning Probability Theory and Statistics Linear spaces Vector Space More importantly for us, the deﬁnition implies: x + y ∈ V, ∀x, y ∈ V αx ∈ V , ∀α ∈ R, ∀x ∈ V Subspace criterion Let V be a vector space over R, and let W be a subset of V . Then W is a subspace if and only if it satisﬁes the following 3 conditions: 1 0∈W 2 If x, y ∈ W then x + y ∈ W 3 If x ∈ W and α ∈ R then αx ∈ W M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
29. 29. General remarks about learning Probability Theory and Statistics Linear spaces Normed spaces Deﬁnition (Normed vector space) A normed vector space is a pair (V , · ) where V is a vector space and · is the associated norm, satisfying the following properties for all u, v ∈ V : 1 v ≥ 0 (positivity) 2 u + v ≤ u + v (triangle inequality) 3 αv = |α| v (positive scalability) 4 v = 0 ⇔ v = 0 (positive deﬁniteness) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
30. 30. General remarks about learning Probability Theory and Statistics Linear spaces Deﬁnition (Inner product space) An real inner product space is a pair (V , ·, · ), where V is a real vector space and ·, · the associated inner product, satisfying the following properties for all u, v , ∈ V 1 u, v = v , u (symmetry) 2 αu, v = α u, v , u, αv = α u, v and u + v , w = u, w + v , w , u, v + w = u, v + u, w , (bilinearity) 3 u, u ≥ 0 (positive deﬁniteness) Deﬁnition (Strict inner product space) A inner product space is called strict if u, u = 0 ⇔ u = 0 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
31. 31. General remarks about learning Probability Theory and Statistics Linear spaces Inner product space The strict inner product induces a norm: f 2 = f ,f . is used to deﬁne distances and angles between elements. Theorem (Cauchy Schwarz inequality) For all vectors u and v of a real inner product space (V , ·, · ), the following inequality holds: | u, v | ≤ u v . M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
32. 32. General remarks about learning Probability Theory and Statistics Linear spaces If you’re not comfortable with any of the presented material, you should take your favourite textbook and read it up within the next two weeks. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher