Probability &
Information
Theory
2017. 09. 25
Presented by Choi SeongJae
Random Variables
 Denote the random variable itself with a lower case letter
 𝒙 𝟏, 𝒙 𝟐, …
 Just a description of the states that are possible.
 May be Discrete or Continuous
Probability Distribution
 Is a description of how likely a random variable or set of random variables is
to take on each of its possible states
 Discrete Variables & Probability Mass Functions(PMF)
 Continuous Variables & Probability Density Functions(PDF)
Probability Distribution > PMF
 Denote PMF with a capital 𝑃
 Properties
 𝑃 must be the set of all possible states of x
 ∀𝑥 ∈ 𝑥, 0 ≤ 𝑃 𝑥 ≤ 1
 𝑥∈𝑋 𝑃 𝑥 = 1
 Uniform distribution on 𝑋 with 𝒌 different states
 𝑃 𝑋 = 𝑥𝑖 =
1
𝑘
𝑥 𝑃(𝑋)
1
6
Probability Distribution > PDF
 Probability Density Function
 Properties
 𝑃 must be the set of all possible states of x
 ∀𝑥 ∈ 𝑥, 𝑝 𝑥 ≥ 0 Do not require 𝑝 𝑥 ≤ 1
 𝑝 𝑥 𝑑𝑥 = 1
Marginal Probability(주변확률)
 Joint probability 𝑝(𝑥, 𝑦)만 알 경우, 𝑝(𝑥)의 확률을 구하려고 할 때 사용
 Discrete variable
 ∀𝑥 ∈ 𝑥, 𝑃 𝑋 = 𝑥 = 𝑦 𝑃(𝑋 = 𝑥 𝑌 = 𝑦)
 Continuous variable
 𝑝 𝑥 = 𝑝 𝑥, 𝑦 𝑑𝑦
Conditional Probability(조건부 확률)
 𝑌 = 𝑦 given X = 𝑥 as 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
 𝑃 𝑌 = 𝑦 𝑋 = 𝑥 =
𝑃(𝑌=𝑦,𝑋=𝑥)
𝑃(𝑋=𝑥)
Chain Rule of Conditional Probabilities
 Joint probability distribution may be decomposed into conditional
distributions
 𝑷 𝒙 𝟏, 𝒙 𝟐, … , 𝒙 𝒏 = 𝑷(𝒙 𝟏) 𝒊=𝟐
𝒏
𝑷(𝒙𝒊|𝒙 𝟏, … , 𝒙 𝒊−𝟏 )
Independence and Conditional Independence
 두 확률의 곱으로 나타낼 수 있으면 𝑥, 𝑦는 Independent
 ∀𝑥 ∈ 𝑥,𝑦 ∈ 𝑦, 𝑝 𝑋 = 𝑥, 𝑌 = 𝑦 = 𝑝 𝑋 = 𝑥 𝑝 𝑌 = 𝑦
 아래 조건을 만족하면 𝑥, 𝑦는 Conditional Independence
 ∀𝑥 ∈ 𝑥, 𝑦 ∈ 𝑦, 𝑧 ∈ 𝑧, 𝑝 𝑋 = 𝑥, 𝑌 = 𝑦 𝑍 = 𝑧 = 𝑝 𝑋 = 𝑥 𝑍 = 𝑧 𝑝 𝑌 = 𝑦 𝑍 = 𝑧
 𝑥 ⊥ 𝑦
Expectation(기댓값), Variance(분산) and Covariance(공분산)
 Expectation
 𝔼 𝑥~𝑝 𝑓 𝑥 = 𝑥 𝑃 𝑥 𝑓(𝑥)
 𝔼 𝑥~𝑝 𝑓 𝑥 = 𝑝 𝑥 𝑓 𝑥 𝑑𝑥
 Mean value
 Variance
 𝑉𝑎𝑟 𝑓 𝑥 = 𝔼 (𝑓 𝑥 − 𝔼 𝑓 𝑥 )2
 Covariance
 𝐶𝑜𝑣(𝑓 𝑥 , 𝑔 𝑦 ) = 𝔼 (𝑓 𝑥 − 𝔼 𝑓 𝑥 )(𝑔 𝑦 − 𝔼[𝑔(𝑦)])
 𝑥, 𝑦 사이에 선형관계가 있으면 𝐶𝑜𝑣는 양 또는 음의 값을 갖음
 선형관계가 존재하지 않으면 0
 𝐶𝑜𝑣가 0일지라도 선형관계 외의 관계가 존재할 수 있음
Common Probability Distributions
 Bernoulli Distribution(베르누이 분포)
 The distribution over a single binary random variable
 Multinoulli Distribution
 Generalized Bernoulli distribution
 Categorical distribution over a single discrete variable with 𝒌 different states
 Gaussian Distribution(=Normal Distribution)
Bernoulli Distribution
 𝑃 𝑥 = 1 = 𝜙
 𝑃 𝑥 = 0 = 1 − 𝜙
 𝑃 𝑋 = 𝑥 = 𝜙 𝑥
(1 − 𝜙)1−𝑥
 𝔼 𝑥 = 𝜙
 𝑉𝑎𝑟 𝑥 = 𝜙 1 − 𝜙
 Binary Entropy
 𝐻 𝑥 = −𝔼(log 𝑝(𝑥))
 𝐻 𝑥 = 𝜙 − 1 log 1 − 𝜙 − 𝜙 log 𝜙
Gaussian Distribution(정규분포)
 Normal Distribution
 𝒩 𝑥; 𝜇, 𝜎2 =
1
2𝜋𝜎2 𝑒
−
1
2𝜎2(𝑥−𝜇)
 Normal distribution is a good default choice for two major reasons.
 Many distributions we wish to model are truly close to being normal distributions.
 Central limit theorem
 Normal distribution encodes the maximum amount of uncertainty over the real
numbers
 Similar to real world
• Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He, ICCV 2015
• Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, AISTATS 2010
Gaussian Distribution > Central Limit Theorem(중심극한정리)
Exponential and Laplace Distributions
Mixture of Distribution
 Probability distributions by combining other simpler probability distributions
 𝑃 𝑥 = 𝑖 𝑃 𝑐 = 𝑖 𝑃(𝑥|𝑐 = 𝑖)
 Gaussian Mixture Model(GMM)
Useful Function: Logistic Sigmoid
 Commonly used to produce the 𝜙 parameter of a Bernoulli distribution
 𝜎 𝑥 =
1
1+exp(−𝑥)
 Saturates to (0,1)
Bayes’ Rule
 𝑃 𝑥 𝑦 =
𝑃 𝑥 𝑃(𝑦|𝑥)
𝑃(𝑦)
Information Theory
Information Theory
 Mathematics that revolves around quantifying how much information is
present in a signal
 “The sun rose this morning” -> uninformative
 “There was a solar eclipse this morning” -> informative
 Intuitions
 Likely events should have low information content
 Less likely events should have higher information content
 𝐼 𝑥 = − log 𝑃(𝑥)
Shannon Entropy
 Can quantify the amount of uncertainty in an entire probability distribution
using the Shannon entropy
 𝐻 𝑥 = 𝔼 𝑥~𝑃 𝐼 𝑥 = −𝔼 𝑥~𝑃[𝑙𝑜𝑔𝑃 𝑥 ]
Kullback-Leibler(KL) Divergence
 To measure how different two distributions are
 𝐷 𝐾𝐿(𝑃| 𝑄 = 𝔼 𝑥~𝑃 log
𝑃 𝑥
𝑄 𝑥
= 𝔼 𝑥~𝑃[log 𝑃 𝑥 − log 𝑄(𝑥)]
 Properties
 Non Negative
 KL Divergence is Asymmetric
 𝐷 𝐾𝐿(𝑃||𝑄) ≠ 𝐷 𝐾𝐿(𝑄||𝑃)
Structured Probabilistic Models
 Using a single function to describe e the entire joint probability
distribution can be very inefficient
 Can split a probability into many factors
 Directed Model
 Undirected Model
Directed Model
 Use graphs with directed edges
 Represent factorization into conditional probability distributions
 𝑝 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = 𝑝 𝑎 𝑝 𝑏 𝑎 𝑝 𝑐 𝑎, 𝑏 𝑝 𝑑 𝑏 𝑝(𝑒|𝑐)
Undirected Model
 Use graphs with undirected edges
 Represent factorizations into a set of functions
 𝑝 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 =
1
𝑍
∅1 𝑎, 𝑏, 𝑐 ∅2 𝑏, 𝑑 ∅3 𝑐, 𝑒
 Normalizing constant Z

Probability & Information theory

  • 1.
    Probability & Information Theory 2017. 09.25 Presented by Choi SeongJae
  • 2.
    Random Variables  Denotethe random variable itself with a lower case letter  𝒙 𝟏, 𝒙 𝟐, …  Just a description of the states that are possible.  May be Discrete or Continuous
  • 3.
    Probability Distribution  Isa description of how likely a random variable or set of random variables is to take on each of its possible states  Discrete Variables & Probability Mass Functions(PMF)  Continuous Variables & Probability Density Functions(PDF)
  • 4.
    Probability Distribution >PMF  Denote PMF with a capital 𝑃  Properties  𝑃 must be the set of all possible states of x  ∀𝑥 ∈ 𝑥, 0 ≤ 𝑃 𝑥 ≤ 1  𝑥∈𝑋 𝑃 𝑥 = 1  Uniform distribution on 𝑋 with 𝒌 different states  𝑃 𝑋 = 𝑥𝑖 = 1 𝑘 𝑥 𝑃(𝑋) 1 6
  • 5.
    Probability Distribution >PDF  Probability Density Function  Properties  𝑃 must be the set of all possible states of x  ∀𝑥 ∈ 𝑥, 𝑝 𝑥 ≥ 0 Do not require 𝑝 𝑥 ≤ 1  𝑝 𝑥 𝑑𝑥 = 1
  • 6.
    Marginal Probability(주변확률)  Jointprobability 𝑝(𝑥, 𝑦)만 알 경우, 𝑝(𝑥)의 확률을 구하려고 할 때 사용  Discrete variable  ∀𝑥 ∈ 𝑥, 𝑃 𝑋 = 𝑥 = 𝑦 𝑃(𝑋 = 𝑥 𝑌 = 𝑦)  Continuous variable  𝑝 𝑥 = 𝑝 𝑥, 𝑦 𝑑𝑦
  • 7.
    Conditional Probability(조건부 확률) 𝑌 = 𝑦 given X = 𝑥 as 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)  𝑃 𝑌 = 𝑦 𝑋 = 𝑥 = 𝑃(𝑌=𝑦,𝑋=𝑥) 𝑃(𝑋=𝑥)
  • 8.
    Chain Rule ofConditional Probabilities  Joint probability distribution may be decomposed into conditional distributions  𝑷 𝒙 𝟏, 𝒙 𝟐, … , 𝒙 𝒏 = 𝑷(𝒙 𝟏) 𝒊=𝟐 𝒏 𝑷(𝒙𝒊|𝒙 𝟏, … , 𝒙 𝒊−𝟏 )
  • 9.
    Independence and ConditionalIndependence  두 확률의 곱으로 나타낼 수 있으면 𝑥, 𝑦는 Independent  ∀𝑥 ∈ 𝑥,𝑦 ∈ 𝑦, 𝑝 𝑋 = 𝑥, 𝑌 = 𝑦 = 𝑝 𝑋 = 𝑥 𝑝 𝑌 = 𝑦  아래 조건을 만족하면 𝑥, 𝑦는 Conditional Independence  ∀𝑥 ∈ 𝑥, 𝑦 ∈ 𝑦, 𝑧 ∈ 𝑧, 𝑝 𝑋 = 𝑥, 𝑌 = 𝑦 𝑍 = 𝑧 = 𝑝 𝑋 = 𝑥 𝑍 = 𝑧 𝑝 𝑌 = 𝑦 𝑍 = 𝑧  𝑥 ⊥ 𝑦
  • 10.
    Expectation(기댓값), Variance(분산) andCovariance(공분산)  Expectation  𝔼 𝑥~𝑝 𝑓 𝑥 = 𝑥 𝑃 𝑥 𝑓(𝑥)  𝔼 𝑥~𝑝 𝑓 𝑥 = 𝑝 𝑥 𝑓 𝑥 𝑑𝑥  Mean value  Variance  𝑉𝑎𝑟 𝑓 𝑥 = 𝔼 (𝑓 𝑥 − 𝔼 𝑓 𝑥 )2  Covariance  𝐶𝑜𝑣(𝑓 𝑥 , 𝑔 𝑦 ) = 𝔼 (𝑓 𝑥 − 𝔼 𝑓 𝑥 )(𝑔 𝑦 − 𝔼[𝑔(𝑦)])  𝑥, 𝑦 사이에 선형관계가 있으면 𝐶𝑜𝑣는 양 또는 음의 값을 갖음  선형관계가 존재하지 않으면 0  𝐶𝑜𝑣가 0일지라도 선형관계 외의 관계가 존재할 수 있음
  • 11.
    Common Probability Distributions Bernoulli Distribution(베르누이 분포)  The distribution over a single binary random variable  Multinoulli Distribution  Generalized Bernoulli distribution  Categorical distribution over a single discrete variable with 𝒌 different states  Gaussian Distribution(=Normal Distribution)
  • 12.
    Bernoulli Distribution  𝑃𝑥 = 1 = 𝜙  𝑃 𝑥 = 0 = 1 − 𝜙  𝑃 𝑋 = 𝑥 = 𝜙 𝑥 (1 − 𝜙)1−𝑥  𝔼 𝑥 = 𝜙  𝑉𝑎𝑟 𝑥 = 𝜙 1 − 𝜙  Binary Entropy  𝐻 𝑥 = −𝔼(log 𝑝(𝑥))  𝐻 𝑥 = 𝜙 − 1 log 1 − 𝜙 − 𝜙 log 𝜙
  • 13.
    Gaussian Distribution(정규분포)  NormalDistribution  𝒩 𝑥; 𝜇, 𝜎2 = 1 2𝜋𝜎2 𝑒 − 1 2𝜎2(𝑥−𝜇)  Normal distribution is a good default choice for two major reasons.  Many distributions we wish to model are truly close to being normal distributions.  Central limit theorem  Normal distribution encodes the maximum amount of uncertainty over the real numbers  Similar to real world • Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He, ICCV 2015 • Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, AISTATS 2010
  • 14.
    Gaussian Distribution >Central Limit Theorem(중심극한정리)
  • 15.
  • 16.
    Mixture of Distribution Probability distributions by combining other simpler probability distributions  𝑃 𝑥 = 𝑖 𝑃 𝑐 = 𝑖 𝑃(𝑥|𝑐 = 𝑖)  Gaussian Mixture Model(GMM)
  • 17.
    Useful Function: LogisticSigmoid  Commonly used to produce the 𝜙 parameter of a Bernoulli distribution  𝜎 𝑥 = 1 1+exp(−𝑥)  Saturates to (0,1)
  • 18.
    Bayes’ Rule  𝑃𝑥 𝑦 = 𝑃 𝑥 𝑃(𝑦|𝑥) 𝑃(𝑦)
  • 19.
  • 20.
    Information Theory  Mathematicsthat revolves around quantifying how much information is present in a signal  “The sun rose this morning” -> uninformative  “There was a solar eclipse this morning” -> informative  Intuitions  Likely events should have low information content  Less likely events should have higher information content  𝐼 𝑥 = − log 𝑃(𝑥)
  • 21.
    Shannon Entropy  Canquantify the amount of uncertainty in an entire probability distribution using the Shannon entropy  𝐻 𝑥 = 𝔼 𝑥~𝑃 𝐼 𝑥 = −𝔼 𝑥~𝑃[𝑙𝑜𝑔𝑃 𝑥 ]
  • 22.
    Kullback-Leibler(KL) Divergence  Tomeasure how different two distributions are  𝐷 𝐾𝐿(𝑃| 𝑄 = 𝔼 𝑥~𝑃 log 𝑃 𝑥 𝑄 𝑥 = 𝔼 𝑥~𝑃[log 𝑃 𝑥 − log 𝑄(𝑥)]  Properties  Non Negative  KL Divergence is Asymmetric  𝐷 𝐾𝐿(𝑃||𝑄) ≠ 𝐷 𝐾𝐿(𝑄||𝑃)
  • 23.
    Structured Probabilistic Models Using a single function to describe e the entire joint probability distribution can be very inefficient  Can split a probability into many factors  Directed Model  Undirected Model
  • 24.
    Directed Model  Usegraphs with directed edges  Represent factorization into conditional probability distributions  𝑝 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = 𝑝 𝑎 𝑝 𝑏 𝑎 𝑝 𝑐 𝑎, 𝑏 𝑝 𝑑 𝑏 𝑝(𝑒|𝑐)
  • 25.
    Undirected Model  Usegraphs with undirected edges  Represent factorizations into a set of functions  𝑝 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = 1 𝑍 ∅1 𝑎, 𝑏, 𝑐 ∅2 𝑏, 𝑑 ∅3 𝑐, 𝑒  Normalizing constant Z

Editor's Notes

  • #14 적률생성함수? 중심극한정리: 일반적으로 가설검증에 사용된다. 모집단의 분포를 모를 때 크기 N개의 표본들을 추출해 표본 평균을 구한다. 이 표본 추출 횟수를 무수히 많이 하면 표본평균의 분포는 normal distribution이다 표본평균의 평균은 모집단의 평균과 같다. 검색결과 Deep learning에서 사용하는 이유는 모든 input feature가 같은 scale을 갖고 있다고 가정하게 한다. 그러지 않으면 learning rate가 높은 것 처럼 weight가 왔다 갔다 할 수 있으니까
  • #15 중심극한정리: 일반적으로 가설검증에 사용된다. 모집단의 분포를 모를 때 크기 N개의 표본들을 추출해 표본 평균을 구한다. 이 표본 추출 횟수를 무수히 많이 하면 표본평균의 분포는 normal distribution이다
  • #16 Laplace는 이중지수분포
  • #17 어떠한 데이터 들에 대한 확률 모델을 만들 때, 일반적으로 사용하는 것은 Gaussian Distribution이다. Gaussian 분포가 현실세계를 가장 잘 반영하니까. 하지만 그렇다고 해도 1개의 정규 분포로는 모든 데이터를 표현할 수 없다. 이 때 사용하는 것이 Gaussian Mixture Model이다. K-means와 유사
  • #22 앞서 얘기했듯, 아주 많이 일어나거나 아예 일어나지 않는 일은 정보량이 적다. -> deterministic 일어날 수도 있고, 일어나지 않을 수 있는 사건이 정보량이 많다.