Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning Preliminaries and Math Refresher

1,017 views

Published on

  • Be the first to comment

Machine Learning Preliminaries and Math Refresher

  1. 1. General remarks about learning Probability Theory and Statistics Linear spaces Machine Learning Preliminaries and Math Refresher M. L¨thi, T. Vetter u February 18, 2008 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  2. 2. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  3. 3. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  4. 4. General remarks about learning Probability Theory and Statistics Linear spaces The problem of learning is arguably at the very core of the problem of intelligence, both biological and artificial. T. Poggio and C.R. Shelton M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  5. 5. General remarks about learning Probability Theory and Statistics Linear spaces Model building in natural sciences Model building Given a phenomenon, construct a model for it. Example (Heat Conduction) Phenomenon: The spontaneous transfer of thermal energy through matter, from a region of higher temperature to a region of lower temperature Model: ∂Q = −k T · dS ∂t S M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  6. 6. General remarks about learning Probability Theory and Statistics Linear spaces Learning as Model Building Example (Learning) Phenomenon: Learning (Inferring general rules from examples) Model: P(f )P(f |D) f ∗ = arg max f ∈H P(D) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  7. 7. General remarks about learning Probability Theory and Statistics Linear spaces Learning as Model Building Example (Learning) Phenomenon: Learning (Inferring general rules from examples) Model: ∗ P(f ) )P(D|f f = arg max f ∈H P(D) Neural networks, Decision Trees, Naive Bayes, Support Vector machines, etc. Models for learning The models for learning are the learning algorithms M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  8. 8. General remarks about learning Probability Theory and Statistics Linear spaces Goals of the first block Life is short . . . We want to cover the essentials of learning. General Setting Statistical Kernel Methods Mathematically Learning Theory Theory of precise setting When does Kernels of the learning learning work Make linear problem Conditions any algorithms Valid for any algorithm has non-linear. kind of learning to satisfy Learning from algorithm Performance non-vectorial bounds data. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  9. 9. General remarks about learning Probability Theory and Statistics Linear spaces Mathematics needed in the first block The need for mathematics As we treat the learning problem in a formal setting, the results and methods are necessarily formulated in mathematical terms. General Setting Statistical Kernel Methods Probability Learning Theory Linear spaces theory More Linear algebra Statistics probability Basic theory Basic optimization optimization More statistics theory theory M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  10. 10. General remarks about learning Probability Theory and Statistics Linear spaces Mathematics needed in the first block The need for mathematics As we treat the learning problem in a formal setting, the results and methods are necessarily formulated in mathematical terms. General Setting Statistical Kernel Methods Probability Learning Theory Linear spaces theory More Linear algebra Statistics probability Basic theory Basic optimization optimization More statistics theory theory A bit of mathematical maturity and an open mind is required. The rest will be explained. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  11. 11. General remarks about learning Probability Theory and Statistics Linear spaces Nothing is more practical than a good theory. Vladimir N. Vapnik M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  12. 12. General remarks about learning Probability Theory and Statistics Linear spaces Nothing is more practical than a good theory. Vladimir N. Vapnik Nothing (in computer science) is more beautiful than learning theory? M. L¨thi u M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  13. 13. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  14. 14. General remarks about learning Probability Theory and Statistics Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  15. 15. General remarks about learning Probability Theory and Statistics Linear spaces Probability theory vs Statistics Definition (Probability Theory) Definition (Statistics) A branch of mathematics The science of collecting, concerned with the analysis of analyzing, presenting, and random phenomena. interpreting data. General ⇒ Specific Specific ⇒ General Statistical Machine learning is closely related to (inferential) statistics. Many state-of-the-art learning algorithms are based on concepts from probability theory. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  16. 16. General remarks about learning Probability Theory and Statistics Linear spaces Probabilities Definition (Probability Space) A probability space is the triple (Ω, F, P) where Ω is a set of events ω F is a collection of events (e.g. the power-set P(Ω)) P is a measure that satisfies the probability axioms. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  17. 17. General remarks about learning Probability Theory and Statistics Linear spaces Axioms of Probability 1 For any A ∈ F, there exists a number P(A), the probability of A, satisfitying P(A) ≥ 0. 2 P(Ω) = 1. 3 Let {An , n ≥ 1} be a collection of pairwise disjoint events, and let A be their union. Then ∞ P(A) = P(An ). n=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  18. 18. General remarks about learning Probability Theory and Statistics Linear spaces Independence Definition (Independence) Two events, A and B, are independent iff the probability of their intersection equals the product of the individual probabilities, i.e. P(A ∩ B) = P(A) · P(B). Definition (Conditional probability) Given two events A and B, with P(B) 0, we define the conditional probability for A given B, P(A|B), by the relation P(A ∩ B) P(A|B) = . P(B) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  19. 19. General remarks about learning Probability Theory and Statistics Linear spaces Random Variables A single event is not that interesting. Definition (Random Variable) A random variable X is a function from the probability space to a vector of real numbers X : Ω → Rn . Random variables are characterized by their distribution function F : Definition (Probability Distribution Function) Let X : Ω → R be a random variable. We define FX (x) = P(X ≤ x) − ∞ x ∞. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  20. 20. General remarks about learning Probability Theory and Statistics Linear spaces Probability density function Definition (Probability density function) The density function, is the function fX , with the property x FX (x) = fX (y ) dy , −∞ x ∞. −∞ M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  21. 21. General remarks about learning Probability Theory and Statistics Linear spaces Convergence Definition (Convergence in Probability) Let X1 , X2 , . . . be random variables. We say that Xn converges in probability to the random variable X as n → ∞, iff, for all ε 0, P(|Xn − X | ε) → 0, as n → ∞. p We write Xn − X as n → ∞. → M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  22. 22. General remarks about learning Probability Theory and Statistics Linear spaces Weak law of large numbers Theorem (Bernoulli’s Theorems (Weak law of large numbers)) Let X1 , . . . , Xn be a sequence of independent and identically distributed (i.i.d.) random variables, each having mean µ and standard deviation σ. Then P[|(X1 + . . . + Xn )/n − µ| ε] → 0 as n → ∞. Thus given enough observations xi ∼ FX , the sample mean x = n n xi will approach the true mean µ. 1 i=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  23. 23. General remarks about learning Probability Theory and Statistics Linear spaces Expectation Definition (Expectation) Let X be a random variable with probability density function fX , and g : R → R a function. We define the expectation ∞ E [g (X )] := g (x)fX (x) dx. −∞ Definition (Sample mean) Let a sample x = {x1 , x2 , . . . , xn } be given. We define the (sample) mean to be n 1 x= xi . n i=1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  24. 24. General remarks about learning Probability Theory and Statistics Linear spaces Variance Definition (Variance) Let X be a random variable with density funciton fX . The variance is given by Var[X ] = E [(X − E [X ])2 ] = E [X 2 ] − (E [X ])2 . The square root Var[X ] of the variance is referred to as the standard deviation. Definition (Sample Variance) Let the sample x = {x1 , x2 , . . . , xn } with sample mean x be given. We define the sample variance to be 1 s2 = (xi − x)2 . n−1 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  25. 25. General remarks about learning Probability Theory and Statistics Linear spaces Notation Assume F has a probability density function: dF (x) f (x) = dx Formally, we write: f (x) dx = dF (x) Example: Expectation ∞ ∞ E [g (X )] := g (x)f (x) dx. = g (x)dF (x) −∞ −∞ M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  26. 26. General remarks about learning Probability Theory and Statistics Linear spaces Outline 1 General remarks about learning 2 Probability Theory and Statistics 3 Linear spaces M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  27. 27. General remarks about learning Probability Theory and Statistics Linear spaces Vector Space A set V together with two binary operations 1 vector addition + : V × V → V and 2 scalar multiplication · : R × V → V is called a vector space over R, if it satisfies the following axioms: 1 ∀x, y ∈ V : x + y = y + x (commutativity) 2 ∀x, y ∈ V : x + (y + z) = (x + y ) + z (associativity) 3 ∃0 ∈ V , ∀x ∈ V : 0 + x = x (identity of vector addition) 4 ∃1 ∈ V , ∀x ∈ V : 1 · x = x (identity of vector multiplication) 5 ∀x ∈ V : ∃x ∈ V : x + (−x) = 0 (additive inverse element) 6 ∀α ∈ R, ∀x, y ∈ V : α · (x + y ) = α · x + α · y (distributivity) 7 ∀α, β ∈ R, ∀x ∈ V : (α + β) · x = α · x + β · x (distributivity) 8 ∀α, β ∈ R, ∀x ∈ V : α(β · x) = (αβ) · x M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  28. 28. General remarks about learning Probability Theory and Statistics Linear spaces Vector Space More importantly for us, the definition implies: x + y ∈ V, ∀x, y ∈ V αx ∈ V , ∀α ∈ R, ∀x ∈ V Subspace criterion Let V be a vector space over R, and let W be a subset of V . Then W is a subspace if and only if it satisfies the following 3 conditions: 1 0∈W 2 If x, y ∈ W then x + y ∈ W 3 If x ∈ W and α ∈ R then αx ∈ W M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  29. 29. General remarks about learning Probability Theory and Statistics Linear spaces Normed spaces Definition (Normed vector space) A normed vector space is a pair (V , · ) where V is a vector space and · is the associated norm, satisfying the following properties for all u, v ∈ V : 1 v ≥ 0 (positivity) 2 u + v ≤ u + v (triangle inequality) 3 αv = |α| v (positive scalability) 4 v = 0 ⇔ v = 0 (positive definiteness) M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  30. 30. General remarks about learning Probability Theory and Statistics Linear spaces Definition (Inner product space) An real inner product space is a pair (V , ·, · ), where V is a real vector space and ·, · the associated inner product, satisfying the following properties for all u, v , ∈ V 1 u, v = v , u (symmetry) 2 αu, v = α u, v , u, αv = α u, v and u + v , w = u, w + v , w , u, v + w = u, v + u, w , (bilinearity) 3 u, u ≥ 0 (positive definiteness) Definition (Strict inner product space) A inner product space is called strict if u, u = 0 ⇔ u = 0 M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  31. 31. General remarks about learning Probability Theory and Statistics Linear spaces Inner product space The strict inner product induces a norm: f 2 = f ,f . is used to define distances and angles between elements. Theorem (Cauchy Schwarz inequality) For all vectors u and v of a real inner product space (V , ·, · ), the following inequality holds: | u, v | ≤ u v . M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher
  32. 32. General remarks about learning Probability Theory and Statistics Linear spaces If you’re not comfortable with any of the presented material, you should take your favourite textbook and read it up within the next two weeks. M. L¨thi, T. Vetter u Machine Learning Preliminaries and Math Refresher

×