SlideShare a Scribd company logo
1 of 60
Download to read offline
Motivation Information Entropy Compressing Information
An Introduction to Information Theory
Carlton Downey
November 12, 2013
Motivation Information Entropy Compressing Information
INTRODUCTION
Today’s recitation will be an introduction to Information
Theory
Information theory studies the quantification of
Information
Compression
Transmission
Error Correction
Gambling
Founded by Claude Shannon in 1948 by his classic paper
“A Mathematical Theory of Communication”
It is an area of mathematics which I think is particularly
elegant
Motivation Information Entropy Compressing Information
OUTLINE
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence
Motivation Information Entropy Compressing Information
OUTLINE
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence
Motivation Information Entropy Compressing Information
MOTIVATION: CASINO
You’re at a casino
You can bet on coins, dice, or roulette
Coins = 2 possible outcomes. Pays 2:1
Dice = 6 possible outcomes. Pays 6:1
roulette = 36 possible outcomes. Pays 36:1
Suppose you can predict the outcome of a single coin
toss/dice roll/roulette spin.
Which would you choose?
Motivation Information Entropy Compressing Information
MOTIVATION: CASINO
You’re at a casino
You can bet on coins, dice, or roulette
Coins = 2 possible outcomes. Pays 2:1
Dice = 6 possible outcomes. Pays 6:1
roulette = 36 possible outcomes. Pays 36:1
Suppose you can predict the outcome of a single coin
toss/dice roll/roulette spin.
Which would you choose?
Roulette. But why? these are all fair games
Motivation Information Entropy Compressing Information
MOTIVATION: CASINO
You’re at a casino
You can bet on coins, dice, or roulette
Coins = 2 possible outcomes. Pays 2:1
Dice = 6 possible outcomes. Pays 6:1
roulette = 36 possible outcomes. Pays 36:1
Suppose you can predict the outcome of a single coin
toss/dice roll/roulette spin.
Which would you choose?
Roulette. But why? these are all fair games
Answer: Roulette provides us with the most Information
Motivation Information Entropy Compressing Information
MOTIVATION: COIN TOSS
Consider two coins:
Fair coin CF with P(H) = 0.5, P(T) = 0.5
Bent coin CB with P(H) = 0.99, P(T) = 0.01
Suppose we flip both coins, and they both land heads
Intuitively we are more “surprised” or “Informed” by first
outcome.
We know CB is almost certain to land heads, so the
knowledge that it lands heads provides us with very little
information.
Motivation Information Entropy Compressing Information
MOTIVATION: COMPRESSION
Suppose we observe a sequence of events:
Coin tosses
Words in a language
notes in a song
etc.
We want to record the sequence of events in the smallest
possible space.
In other words we want the shortest representation which
preserves all information.
Another way to think about this: How much information
does the sequence of events actually contain?
Motivation Information Entropy Compressing Information
MOTIVATION: COMPRESSION
To be concrete, consider the problem of recording coin tosses in
unary.
T, T, T, T, H
Approach 1:
H T
0 00
00, 00, 00, 00, 0
We used 9 characters
Motivation Information Entropy Compressing Information
MOTIVATION: COMPRESSION
To be concrete, consider the problem of recording coin tosses in
unary.
T, T, T, T, H
Approach 2:
H T
00 0
0, 0, 0, 0, 00
We used 6 characters
Motivation Information Entropy Compressing Information
MOTIVATION: COMPRESSION
Frequently occuring events should have short encodings
We see this in english with words such as “a”, “the”,
“and”, etc.
We want to maximise the information-per-character
seeing common events provides little information
seeing uncommon events provides a lot of information
Motivation Information Entropy Compressing Information
OUTLINE
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence
Motivation Information Entropy Compressing Information
INFORMATION
Let X be a random variable with distribution p(X).
We want to quantify the information provided by each
possible outcome.
Specifically we want a function which maps the
probability of an event p(x) to the information I(x)
Our metric I(x) should have the following properties:
I(xi) ≥ 0 ∀i.
I(x1) > I(x2) if p(x1) < p(x2)
I(x1, x2) = I(x1) + I(x2)
Motivation Information Entropy Compressing Information
INFORMATION
I(x) = f(p(x))
We want f() such that I(x1, x2) = I(x1) + I(x2)
We know p(x1, x2) = p(x1)p(x2)
The only function with this property is log():
log(ab) = log(a) + log(b)
Hence we define:
I(X) = log(
1
p(x)
)
Motivation Information Entropy Compressing Information
INFORMATION: COIN
Fair Coin:
h t
0.5 0.5
I(h) = log(
1
0.5
) = log(2) = 1
I(t) = log(
1
0.5
) = log(2) = 1
Motivation Information Entropy Compressing Information
INFORMATION: COIN
Bent Coin:
h t
0.25 0.75
I(h) = log(
1
0.25
) = log(4) = 2
I(t) = log(
1
0.75
) = log(1.33) = 0.42
Motivation Information Entropy Compressing Information
INFORMATION: COIN
Really Bent Coin:
h t
0.01 0.99
I(h) = log(
1
0.01
) = log(100) = 6.65
I(t) = log(
1
0.99
) = log(1.01) = 0.01
Motivation Information Entropy Compressing Information
INFORMATION: TWO EVENTS
Question: How much information do we get from observing
two events?
I(x1, x2) = log(
1
p(x1, x2)
)
= log(
1
p(x1)p(x2)
)
= log(
1
p(x1)
1
p(x2)
)
= log(
1
p(x1)
) + log(
1
p(x2)
)
= I(x1) + I(x2)
Answer: Information sums!
Motivation Information Entropy Compressing Information
INFORMATION IS ADDITIVE
I(k fair coin tosses) = log 1
1/2k = k bits
So:
Random word from a 100,000 word vocabulary:
I(word) = log(100, 000) = 16.61 bits
A 1000 word document from same source:
I(documents) = 16,610 bits
A 480 pixel, 16-greyscale video picture:
I(picture) = 307, 200 × log(16) = 1,228,800 bits
A picture is worth (a lot more than) 1000 words!
In reality this is a gross overestimate
Motivation Information Entropy Compressing Information
INFORMATION: TWO COINS
Bent Coin:
x h t
p(x) 0.25 0.75
I(x) 2 0.42
I(hh) = I(h) + I(h) = 4
I(ht) = I(h) + I(t) = 2.42
I(th) = I(t) + I(h) = 2.42
I(th) = I(t) + I(t) = 0.84
Motivation Information Entropy Compressing Information
INFORMATION: TWO COINS
Bent Coin Twice:
hh ht th tt
0.0625 0.1875 0.1875 0.5625
I(hh) = log(
1
0.0625
) = log(4) = 4
I(ht) = log(
1
0.1875
) = log(4) = 2.42
I(th) = log(
1
0.1875
) = log(4) = 2.42
I(tt) = log(
1
0.5625
) = log(4) = 0.84
Motivation Information Entropy Compressing Information
OUTLINE
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence
Motivation Information Entropy Compressing Information
ENTROPY
Suppose we have a sequence of observations of a random
variable X.
A natural question to ask is what is the average amount of
information per observation.
This quantitity is called the Entropy and denoted H(X)
H(X) = E[I(X)] = E[log(
1
p(X)
)]
Motivation Information Entropy Compressing Information
ENTROPY
Information is associated with an event - heads, tails, etc.
Entropy is associated with a distribution over events - p(x).
Motivation Information Entropy Compressing Information
ENTROPY: COIN
Fair Coin:
x h t
p(x) 0.5 0.5
I(x) 1 1
H(X) = E[I(X)]
=
i
p(xi)I(X)
= p(h)I(h) + p(t)I(t)
= 0.5 × 1 + 0.5 × 1
= 1
Motivation Information Entropy Compressing Information
ENTROPY: COIN
Bent Coin:
x h t
p(x) 0.25 0.75
I(x) 2 0.42
H(X) = E[I(X)]
=
i
p(xi)I(X)
= p(h)I(h) + p(t)I(t)
= 0.25 × 2 + 0.75 × 0.42
= 0.85
Motivation Information Entropy Compressing Information
ENTROPY: COIN
Very Bent Coin:
x h t
p(x) 0.01 0.99
I(x) 6.65 0.01
H(X) = E[I(X)]
=
i
p(xi)I(X)
= p(h)I(h) + p(t)I(t)
= 0.01 × 6.65 + 0.99 × 0.01
= 0.08
Motivation Information Entropy Compressing Information
ENTROPY: ALL COINS
Motivation Information Entropy Compressing Information
ENTROPY: ALL COINS
H(P) = p log
1
p
+ (1 − p) log
1
1 − p
Motivation Information Entropy Compressing Information
ALTERNATIVE EXPLANATIONS OF ENTROPY
H(S) =
i
pi log
1
pi
Average amount of information provided per event
Average amount of surprise when observing a event
Uncertainty an observer has before seeing the event
Average number of bits needed to communicate each event
Motivation Information Entropy Compressing Information
THE ENTROPY OF ENGLISH
27 Characters (A-Z, space)
100,000 words (average 5.5 characters each)
Assuming independence between successive characters:
Uniform character distribution: log(27) = 4.75
bits/characters
True character distribution: 4.03 bits/character
Assuming independence between successive words:
Uniform word distribution:
log(100,000)
6.5 = 2.55 bits/character
True word distribution: 9.45
6.5 = 1.45 bits/character
True Entropy of English is much lower
Motivation Information Entropy Compressing Information
TYPES OF ENTROPY
There are 3 Types of Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
We will now define these quantities, and study how they
are related.
Motivation Information Entropy Compressing Information
MARGINAL ENTROPY
A single random variable X has a Marginal Distribution
p(X)
This distribution has an associated Marginal Entropy
H(X) =
i
p(xi) log
1
p(xi)
Marginal entropy is the average information provided by
observing a variable X
Motivation Information Entropy Compressing Information
JOINT ENTROPY
Two or more random variables X, Y have a Joint
Distribution
p(X, Y)
This distribution has an associated Joint Entropy
H(X, Y) =
i j
p(xi, yj) log
1
p(xi, yj)
Marginal entropy is the average total information provided
by observing two variables X, Y
Motivation Information Entropy Compressing Information
CONDITIONAL ENTROPY
Two random variables X, Y also have two Conditional
Distributions
p(X|Y) and P(Y|X)
These distributions have associated Conditional Entropys
H(X|Y) =
j
p(yj)H(X|yj)
=
j
p(yj)
i
p(xi|yj) log
1
p(xi|yj)
=
i j
p(xi, yj) log
1
p(xi|yj)
Conditional entropy is the average additional information
provided by observing X, given we already observed Y
Motivation Information Entropy Compressing Information
TYPES OF ENTROPY: SUMMARY
Entropy: Average information gained by observing a
single variable
Joint Entropy: Average total information gained by
observing two or more variables
Conditional Entropy: Average additional information
gained by observing a new variable
Motivation Information Entropy Compressing Information
ENTROPY RELATIONSHIPS
Motivation Information Entropy Compressing Information
RELATIONSHIP: H(X, Y) = H(X) + H(Y|X)
H(X, Y) =
i,j
p(xi, yj) log(
1
p(xi, yj)
)
=
i,j
p(xi, yj) log(
1
p(yj|xi)p(xi)
)
=
i,j
p(xi, yj) log(
1
p(xi)
) + log(
1
p(yj|xi)
)
=
i
p(xi) log(
1
p(xi)
) +
i
p(xi, yj) log(
1
p(yj|xi)
)
= H(X) + H(Y|X)
Motivation Information Entropy Compressing Information
RELATIONSHIP: H(X, Y) ≤ H(X) + H(Y)
We know H(X, Y) = H(X) + H(Y|X)
Therefore we need only show H(Y|X) ≤ H(Y)
This makes sense, knowing X can only decrease the
addition information provided by Y.
Motivation Information Entropy Compressing Information
RELATIONSHIP: H(X, Y) ≤ H(X) + H(Y)
We know H(X, Y) = H(X) + H(Y|X)
Therefore we need only show H(Y|X) ≤ H(Y)
This makes sense, knowing X can only decrease the
addition information provided by Y.
Proof? Possible homework =)
Motivation Information Entropy Compressing Information
ENTROPY RELATIONSHIPS
Motivation Information Entropy Compressing Information
MUTUAL INFORMATION
The Mutual Information I(X; Y) is defined as:
I(X; Y) = H(X) − H(X|Y)
The mutual information is the amount of information
shared by X and Y.
It is a measure of how much X tells us about Y, and vice
versa.
If X and Y are independent then I(X; Y) = 0, because X
tells us nothing about Y and vice versa.
If X = Y then I(X; Y) = H(X) = H(Y). X tells us everything
about Y and vice versa.
Motivation Information Entropy Compressing Information
EXAMPLE
Marginal Distribution:
X sun rain
P(X) 0.6 0.4
Y hot cold
P(Y) 0.6 0.4
Conditional Distribution:
Y hot cold
P(Y|X = sun) 0.8 0.2
Y hot cold
P(Y|X = rain) 0.3 0.7
Joint Distribution:
hot cold
sun 0.48 0.12
rain 0.12 0.28
Motivation Information Entropy Compressing Information
EXAMPLE: MARGINAL ENTROPY
Marginal Distribution:
X sun rain
P(X) 0.6 0.4
H(X) =
i
p(xi) log(
1
p(xi)
)
= 0.6 log(
1
0.6
) + 0.4 log(
1
0.4
)
= 0.97
Motivation Information Entropy Compressing Information
EXAMPLE: JOINT ENTROPY
Joint Distribution:
hot cold
sun 0.48 0.12
rain 0.12 0.28
H(X) =
i,j
p(xi, yi) log(
1
p(xi, yi)
)
= 0.48 log(
1
0.48
) + 2 0.12 log(
1
0.12
) + 0.28 log(
1
0.28
)
= 1.76
Motivation Information Entropy Compressing Information
EXAMPLE: CONDITIONAL ENTROPY
Joint Distribution:
hot cold
sun 0.48 0.12
rain 0.12 0.28
Conditional Distribution:
Y hot cold
P(Y|X = sun) 0.8 0.2
Y hot cold
P(Y|X = rain) 0.3 0.7
H(Y|X) =
i,j
p(xi, xj) log(
1
p(yi|xi)
)
= 0.48 log(
1
0.8
) + 0.12 log(
1
0.2
) + 0.12 log(
1
0.3
) + 0.28 log(
1
0.7
)
= 0.79
Motivation Information Entropy Compressing Information
EXAMPLE: SUMMARY
Results:
H(X) = H(Y) = 0.97
H(X, Y) = 1.76
H(Y|X) = 0.79
I(X; Y) = H(Y) − H(Y|X) = 0.18
Note that H(X, Y) = H(X) + H(Y|X) as required.
Interpreting the Results:
I(X; Y) > 0, therefore X tells us something about Y and vice
versa
H(Y|X) > 0, therefore X doesn’t tell us everything about Y
Motivation Information Entropy Compressing Information
MOTIVATION RECAP
Gambling: Coins vs. Dice vs. Roulette
Prediction: Bent Coin vs. Fair Coin
Compression: How to best record a sequence of events
Motivation Information Entropy Compressing Information
OUTLINE
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence
Motivation Information Entropy Compressing Information
PREFIX CODES
Compression maps events to code words
We already saw an example when we mapped coin tosses
to unary numbers
We want mapping which generates short encodings
One good way of doing this is prefix codes
Motivation Information Entropy Compressing Information
PREFIX CODES
Encoding where no code word is a prefix of any other code
word.
Example:
Event a b c d
Code Word 0 10 110 111
Previously we reserved 0 as a separator
If we use a prefix code we do not need a separator symbol
101000110111110111 = bbaacdcd
Motivation Information Entropy Compressing Information
DISTRIBUTION AS PREFIX CODES
Every probability distribution can be thought of as
specifying an encoding via the Information I(X)
Map each event xi to a word of length I(xi)
Table: Fair Coin
X h t
P(X) 0.5 0.5
I(X) 1 1
code(X) 1 0
Motivation Information Entropy Compressing Information
DISTRIBUTION AS PREFIX CODES
Every probability distribution can be thought of as
specifying an encoding via the Information I(X)
Map each event xi to a word of length I(xi)
Table: Fair 4-Sided Dice
X 1 2 3 4
P(X) 0.25 0.25 0.25 0.25
I(X) 2 2 2 2
code(X) 11 10 01 00
Motivation Information Entropy Compressing Information
DISTRIBUTION AS PREFIX CODES
Every probability distribution can be thought of as
specifying an encoding via the Information I(X)
Map each event xi to a word of length I(xi)
Table: Bent 4-Sided Dice
X 1 2 3 4
P(X) 0.5 0.25 0.125 0.125
I(X) 1 2 3 3
code(X) 0 10 110 111
Motivation Information Entropy Compressing Information
DISTRIBUTION AS PREFIX CODES
Prefix codes built from the distribution are optimal
Information is contained in the smallest possible number of
characters
Entropy is maximized
Encoding is not always this obvious. e.g. How to encode a
bent coin
Question: If use a different (suboptimal) encoding, how
many extra characters do I need
Motivation Information Entropy Compressing Information
KL DIVERGENCE
Motivation Information Entropy Compressing Information
KL DIVERGENCE
The expected number of additional bits required to encode
p using q, rather than p using p.
DKL(p||q) =
i
p(xi) codeq(xi) −
i
p(xi) codep(xi)
=
i
p(xi)Iq(xi) −
i
p(xi)Ip(xi)
=
i
p(xi) log(
1
q(xi)
) −
i
p(xi) log(
1
p(xi)
)
Motivation Information Entropy Compressing Information
KL DIVERGENCE
The KL Divergence is a measure of the ’Dissimilarity’ of
two distributions
If p and q are similar, then KL(p||q) will be small.
Common events in p will be common events in q
This means they will still have short code words
If p and q are dissimilar, then KL(p||q) will be large.
Common events in p may be uncommon events in q
This means commonly occuring events might be given long
codewords
Motivation Information Entropy Compressing Information
SUMMARY
Motivation
Information
Entropy
Marginal Entropy
Joint Entropy
Conditional Entropy
Mutual Information
Compressing Information
Prefix Codes
KL Divergence

More Related Content

Viewers also liked

Transmission of digital signals
Transmission of digital signalsTransmission of digital signals
Transmission of digital signalsSachin Artani
 
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying M
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying MResource Allocation using ASK, FSK and PSK Modulation Techniques with varying M
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying Mchiragwarty
 
Phase Shift Keying & π/4 -Quadrature Phase Shift Keying
Phase Shift Keying & π/4 -Quadrature Phase Shift KeyingPhase Shift Keying & π/4 -Quadrature Phase Shift Keying
Phase Shift Keying & π/4 -Quadrature Phase Shift KeyingNaveen Jakhar, I.T.S
 
Fsk modulation and demodulation
Fsk modulation and demodulationFsk modulation and demodulation
Fsk modulation and demodulationMafaz Ahmed
 
PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )vijidhivi
 
Hardware Implementation Of QPSK Modulator for Satellite Communications
Hardware Implementation Of QPSK Modulator for Satellite CommunicationsHardware Implementation Of QPSK Modulator for Satellite Communications
Hardware Implementation Of QPSK Modulator for Satellite Communicationspradeepps88
 
Phase shift keying(PSK)
Phase shift keying(PSK)Phase shift keying(PSK)
Phase shift keying(PSK)MOHAN MOHAN
 
Digital Modulation Techniques ppt
Digital Modulation Techniques pptDigital Modulation Techniques ppt
Digital Modulation Techniques pptPankaj Singh
 
Digital Modulation Unit 3
Digital Modulation Unit 3Digital Modulation Unit 3
Digital Modulation Unit 3Anil Nigam
 
Digital communication viva questions
Digital communication viva questionsDigital communication viva questions
Digital communication viva questionsishan0019
 
Digital communication system
Digital communication systemDigital communication system
Digital communication systembabak danyal
 

Viewers also liked (15)

Mini Project Communication Link Simulation Digital Modulation Techniques Lec...
Mini Project Communication Link Simulation  Digital Modulation Techniques Lec...Mini Project Communication Link Simulation  Digital Modulation Techniques Lec...
Mini Project Communication Link Simulation Digital Modulation Techniques Lec...
 
Transmission of digital signals
Transmission of digital signalsTransmission of digital signals
Transmission of digital signals
 
Modulacion-digital
 Modulacion-digital Modulacion-digital
Modulacion-digital
 
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying M
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying MResource Allocation using ASK, FSK and PSK Modulation Techniques with varying M
Resource Allocation using ASK, FSK and PSK Modulation Techniques with varying M
 
Phase Shift Keying & π/4 -Quadrature Phase Shift Keying
Phase Shift Keying & π/4 -Quadrature Phase Shift KeyingPhase Shift Keying & π/4 -Quadrature Phase Shift Keying
Phase Shift Keying & π/4 -Quadrature Phase Shift Keying
 
Information theory
Information theoryInformation theory
Information theory
 
Fsk modulation and demodulation
Fsk modulation and demodulationFsk modulation and demodulation
Fsk modulation and demodulation
 
PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )PSK (PHASE SHIFT KEYING )
PSK (PHASE SHIFT KEYING )
 
Hardware Implementation Of QPSK Modulator for Satellite Communications
Hardware Implementation Of QPSK Modulator for Satellite CommunicationsHardware Implementation Of QPSK Modulator for Satellite Communications
Hardware Implementation Of QPSK Modulator for Satellite Communications
 
Phase shift keying(PSK)
Phase shift keying(PSK)Phase shift keying(PSK)
Phase shift keying(PSK)
 
Digital Modulation Techniques ppt
Digital Modulation Techniques pptDigital Modulation Techniques ppt
Digital Modulation Techniques ppt
 
Digital Modulation Unit 3
Digital Modulation Unit 3Digital Modulation Unit 3
Digital Modulation Unit 3
 
Ch 05
Ch 05Ch 05
Ch 05
 
Digital communication viva questions
Digital communication viva questionsDigital communication viva questions
Digital communication viva questions
 
Digital communication system
Digital communication systemDigital communication system
Digital communication system
 

Similar to R8 information theory

information_theory_1.ppt
information_theory_1.pptinformation_theory_1.ppt
information_theory_1.pptTrongMinhHoang1
 
Recitation decision trees-adaboost-02-09-2006-3
Recitation decision trees-adaboost-02-09-2006-3Recitation decision trees-adaboost-02-09-2006-3
Recitation decision trees-adaboost-02-09-2006-3Charu Khatwani
 
Mrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningMrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningJaouad Dabounou
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Factoring polynomials
Factoring polynomials Factoring polynomials
Factoring polynomials eixarc
 
Information theory 1
Information theory 1Information theory 1
Information theory 1ksrinivas_ece
 
Pointing the Unknown Words
Pointing the Unknown WordsPointing the Unknown Words
Pointing the Unknown Wordshytae
 
CSE357 fa21 (1) Course Intro and Probability 8-26.pdf
CSE357 fa21 (1) Course Intro and Probability 8-26.pdfCSE357 fa21 (1) Course Intro and Probability 8-26.pdf
CSE357 fa21 (1) Course Intro and Probability 8-26.pdfNermeenKamel7
 
Probability distribution 2
Probability distribution 2Probability distribution 2
Probability distribution 2Nilanjan Bhaumik
 
Brian Prior - Probability and gambling
Brian Prior - Probability and gamblingBrian Prior - Probability and gambling
Brian Prior - Probability and gamblingonthewight
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataLuca Mastrostefano
 
Factoring polynomials
Factoring polynomialsFactoring polynomials
Factoring polynomialsPaco Marcos
 
Information Theory - Introduction
Information Theory  -  IntroductionInformation Theory  -  Introduction
Information Theory - IntroductionBurdwan University
 

Similar to R8 information theory (16)

information_theory_1.ppt
information_theory_1.pptinformation_theory_1.ppt
information_theory_1.ppt
 
Recitation decision trees-adaboost-02-09-2006-3
Recitation decision trees-adaboost-02-09-2006-3Recitation decision trees-adaboost-02-09-2006-3
Recitation decision trees-adaboost-02-09-2006-3
 
Mrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningMrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Factoring polynomials
Factoring polynomials Factoring polynomials
Factoring polynomials
 
Information theory 1
Information theory 1Information theory 1
Information theory 1
 
Pointing the Unknown Words
Pointing the Unknown WordsPointing the Unknown Words
Pointing the Unknown Words
 
CSE357 fa21 (1) Course Intro and Probability 8-26.pdf
CSE357 fa21 (1) Course Intro and Probability 8-26.pdfCSE357 fa21 (1) Course Intro and Probability 8-26.pdf
CSE357 fa21 (1) Course Intro and Probability 8-26.pdf
 
Probability distribution 2
Probability distribution 2Probability distribution 2
Probability distribution 2
 
Brian Prior - Probability and gambling
Brian Prior - Probability and gamblingBrian Prior - Probability and gambling
Brian Prior - Probability and gambling
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
 
Factoring polynomials
Factoring polynomialsFactoring polynomials
Factoring polynomials
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Sufficient statistics
Sufficient statisticsSufficient statistics
Sufficient statistics
 
Information Theory - Introduction
Information Theory  -  IntroductionInformation Theory  -  Introduction
Information Theory - Introduction
 
Sufficiency
SufficiencySufficiency
Sufficiency
 

R8 information theory

  • 1. Motivation Information Entropy Compressing Information An Introduction to Information Theory Carlton Downey November 12, 2013
  • 2. Motivation Information Entropy Compressing Information INTRODUCTION Today’s recitation will be an introduction to Information Theory Information theory studies the quantification of Information Compression Transmission Error Correction Gambling Founded by Claude Shannon in 1948 by his classic paper “A Mathematical Theory of Communication” It is an area of mathematics which I think is particularly elegant
  • 3. Motivation Information Entropy Compressing Information OUTLINE Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence
  • 4. Motivation Information Entropy Compressing Information OUTLINE Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence
  • 5. Motivation Information Entropy Compressing Information MOTIVATION: CASINO You’re at a casino You can bet on coins, dice, or roulette Coins = 2 possible outcomes. Pays 2:1 Dice = 6 possible outcomes. Pays 6:1 roulette = 36 possible outcomes. Pays 36:1 Suppose you can predict the outcome of a single coin toss/dice roll/roulette spin. Which would you choose?
  • 6. Motivation Information Entropy Compressing Information MOTIVATION: CASINO You’re at a casino You can bet on coins, dice, or roulette Coins = 2 possible outcomes. Pays 2:1 Dice = 6 possible outcomes. Pays 6:1 roulette = 36 possible outcomes. Pays 36:1 Suppose you can predict the outcome of a single coin toss/dice roll/roulette spin. Which would you choose? Roulette. But why? these are all fair games
  • 7. Motivation Information Entropy Compressing Information MOTIVATION: CASINO You’re at a casino You can bet on coins, dice, or roulette Coins = 2 possible outcomes. Pays 2:1 Dice = 6 possible outcomes. Pays 6:1 roulette = 36 possible outcomes. Pays 36:1 Suppose you can predict the outcome of a single coin toss/dice roll/roulette spin. Which would you choose? Roulette. But why? these are all fair games Answer: Roulette provides us with the most Information
  • 8. Motivation Information Entropy Compressing Information MOTIVATION: COIN TOSS Consider two coins: Fair coin CF with P(H) = 0.5, P(T) = 0.5 Bent coin CB with P(H) = 0.99, P(T) = 0.01 Suppose we flip both coins, and they both land heads Intuitively we are more “surprised” or “Informed” by first outcome. We know CB is almost certain to land heads, so the knowledge that it lands heads provides us with very little information.
  • 9. Motivation Information Entropy Compressing Information MOTIVATION: COMPRESSION Suppose we observe a sequence of events: Coin tosses Words in a language notes in a song etc. We want to record the sequence of events in the smallest possible space. In other words we want the shortest representation which preserves all information. Another way to think about this: How much information does the sequence of events actually contain?
  • 10. Motivation Information Entropy Compressing Information MOTIVATION: COMPRESSION To be concrete, consider the problem of recording coin tosses in unary. T, T, T, T, H Approach 1: H T 0 00 00, 00, 00, 00, 0 We used 9 characters
  • 11. Motivation Information Entropy Compressing Information MOTIVATION: COMPRESSION To be concrete, consider the problem of recording coin tosses in unary. T, T, T, T, H Approach 2: H T 00 0 0, 0, 0, 0, 00 We used 6 characters
  • 12. Motivation Information Entropy Compressing Information MOTIVATION: COMPRESSION Frequently occuring events should have short encodings We see this in english with words such as “a”, “the”, “and”, etc. We want to maximise the information-per-character seeing common events provides little information seeing uncommon events provides a lot of information
  • 13. Motivation Information Entropy Compressing Information OUTLINE Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence
  • 14. Motivation Information Entropy Compressing Information INFORMATION Let X be a random variable with distribution p(X). We want to quantify the information provided by each possible outcome. Specifically we want a function which maps the probability of an event p(x) to the information I(x) Our metric I(x) should have the following properties: I(xi) ≥ 0 ∀i. I(x1) > I(x2) if p(x1) < p(x2) I(x1, x2) = I(x1) + I(x2)
  • 15. Motivation Information Entropy Compressing Information INFORMATION I(x) = f(p(x)) We want f() such that I(x1, x2) = I(x1) + I(x2) We know p(x1, x2) = p(x1)p(x2) The only function with this property is log(): log(ab) = log(a) + log(b) Hence we define: I(X) = log( 1 p(x) )
  • 16. Motivation Information Entropy Compressing Information INFORMATION: COIN Fair Coin: h t 0.5 0.5 I(h) = log( 1 0.5 ) = log(2) = 1 I(t) = log( 1 0.5 ) = log(2) = 1
  • 17. Motivation Information Entropy Compressing Information INFORMATION: COIN Bent Coin: h t 0.25 0.75 I(h) = log( 1 0.25 ) = log(4) = 2 I(t) = log( 1 0.75 ) = log(1.33) = 0.42
  • 18. Motivation Information Entropy Compressing Information INFORMATION: COIN Really Bent Coin: h t 0.01 0.99 I(h) = log( 1 0.01 ) = log(100) = 6.65 I(t) = log( 1 0.99 ) = log(1.01) = 0.01
  • 19. Motivation Information Entropy Compressing Information INFORMATION: TWO EVENTS Question: How much information do we get from observing two events? I(x1, x2) = log( 1 p(x1, x2) ) = log( 1 p(x1)p(x2) ) = log( 1 p(x1) 1 p(x2) ) = log( 1 p(x1) ) + log( 1 p(x2) ) = I(x1) + I(x2) Answer: Information sums!
  • 20. Motivation Information Entropy Compressing Information INFORMATION IS ADDITIVE I(k fair coin tosses) = log 1 1/2k = k bits So: Random word from a 100,000 word vocabulary: I(word) = log(100, 000) = 16.61 bits A 1000 word document from same source: I(documents) = 16,610 bits A 480 pixel, 16-greyscale video picture: I(picture) = 307, 200 × log(16) = 1,228,800 bits A picture is worth (a lot more than) 1000 words! In reality this is a gross overestimate
  • 21. Motivation Information Entropy Compressing Information INFORMATION: TWO COINS Bent Coin: x h t p(x) 0.25 0.75 I(x) 2 0.42 I(hh) = I(h) + I(h) = 4 I(ht) = I(h) + I(t) = 2.42 I(th) = I(t) + I(h) = 2.42 I(th) = I(t) + I(t) = 0.84
  • 22. Motivation Information Entropy Compressing Information INFORMATION: TWO COINS Bent Coin Twice: hh ht th tt 0.0625 0.1875 0.1875 0.5625 I(hh) = log( 1 0.0625 ) = log(4) = 4 I(ht) = log( 1 0.1875 ) = log(4) = 2.42 I(th) = log( 1 0.1875 ) = log(4) = 2.42 I(tt) = log( 1 0.5625 ) = log(4) = 0.84
  • 23. Motivation Information Entropy Compressing Information OUTLINE Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence
  • 24. Motivation Information Entropy Compressing Information ENTROPY Suppose we have a sequence of observations of a random variable X. A natural question to ask is what is the average amount of information per observation. This quantitity is called the Entropy and denoted H(X) H(X) = E[I(X)] = E[log( 1 p(X) )]
  • 25. Motivation Information Entropy Compressing Information ENTROPY Information is associated with an event - heads, tails, etc. Entropy is associated with a distribution over events - p(x).
  • 26. Motivation Information Entropy Compressing Information ENTROPY: COIN Fair Coin: x h t p(x) 0.5 0.5 I(x) 1 1 H(X) = E[I(X)] = i p(xi)I(X) = p(h)I(h) + p(t)I(t) = 0.5 × 1 + 0.5 × 1 = 1
  • 27. Motivation Information Entropy Compressing Information ENTROPY: COIN Bent Coin: x h t p(x) 0.25 0.75 I(x) 2 0.42 H(X) = E[I(X)] = i p(xi)I(X) = p(h)I(h) + p(t)I(t) = 0.25 × 2 + 0.75 × 0.42 = 0.85
  • 28. Motivation Information Entropy Compressing Information ENTROPY: COIN Very Bent Coin: x h t p(x) 0.01 0.99 I(x) 6.65 0.01 H(X) = E[I(X)] = i p(xi)I(X) = p(h)I(h) + p(t)I(t) = 0.01 × 6.65 + 0.99 × 0.01 = 0.08
  • 29. Motivation Information Entropy Compressing Information ENTROPY: ALL COINS
  • 30. Motivation Information Entropy Compressing Information ENTROPY: ALL COINS H(P) = p log 1 p + (1 − p) log 1 1 − p
  • 31. Motivation Information Entropy Compressing Information ALTERNATIVE EXPLANATIONS OF ENTROPY H(S) = i pi log 1 pi Average amount of information provided per event Average amount of surprise when observing a event Uncertainty an observer has before seeing the event Average number of bits needed to communicate each event
  • 32. Motivation Information Entropy Compressing Information THE ENTROPY OF ENGLISH 27 Characters (A-Z, space) 100,000 words (average 5.5 characters each) Assuming independence between successive characters: Uniform character distribution: log(27) = 4.75 bits/characters True character distribution: 4.03 bits/character Assuming independence between successive words: Uniform word distribution: log(100,000) 6.5 = 2.55 bits/character True word distribution: 9.45 6.5 = 1.45 bits/character True Entropy of English is much lower
  • 33. Motivation Information Entropy Compressing Information TYPES OF ENTROPY There are 3 Types of Entropy Marginal Entropy Joint Entropy Conditional Entropy We will now define these quantities, and study how they are related.
  • 34. Motivation Information Entropy Compressing Information MARGINAL ENTROPY A single random variable X has a Marginal Distribution p(X) This distribution has an associated Marginal Entropy H(X) = i p(xi) log 1 p(xi) Marginal entropy is the average information provided by observing a variable X
  • 35. Motivation Information Entropy Compressing Information JOINT ENTROPY Two or more random variables X, Y have a Joint Distribution p(X, Y) This distribution has an associated Joint Entropy H(X, Y) = i j p(xi, yj) log 1 p(xi, yj) Marginal entropy is the average total information provided by observing two variables X, Y
  • 36. Motivation Information Entropy Compressing Information CONDITIONAL ENTROPY Two random variables X, Y also have two Conditional Distributions p(X|Y) and P(Y|X) These distributions have associated Conditional Entropys H(X|Y) = j p(yj)H(X|yj) = j p(yj) i p(xi|yj) log 1 p(xi|yj) = i j p(xi, yj) log 1 p(xi|yj) Conditional entropy is the average additional information provided by observing X, given we already observed Y
  • 37. Motivation Information Entropy Compressing Information TYPES OF ENTROPY: SUMMARY Entropy: Average information gained by observing a single variable Joint Entropy: Average total information gained by observing two or more variables Conditional Entropy: Average additional information gained by observing a new variable
  • 38. Motivation Information Entropy Compressing Information ENTROPY RELATIONSHIPS
  • 39. Motivation Information Entropy Compressing Information RELATIONSHIP: H(X, Y) = H(X) + H(Y|X) H(X, Y) = i,j p(xi, yj) log( 1 p(xi, yj) ) = i,j p(xi, yj) log( 1 p(yj|xi)p(xi) ) = i,j p(xi, yj) log( 1 p(xi) ) + log( 1 p(yj|xi) ) = i p(xi) log( 1 p(xi) ) + i p(xi, yj) log( 1 p(yj|xi) ) = H(X) + H(Y|X)
  • 40. Motivation Information Entropy Compressing Information RELATIONSHIP: H(X, Y) ≤ H(X) + H(Y) We know H(X, Y) = H(X) + H(Y|X) Therefore we need only show H(Y|X) ≤ H(Y) This makes sense, knowing X can only decrease the addition information provided by Y.
  • 41. Motivation Information Entropy Compressing Information RELATIONSHIP: H(X, Y) ≤ H(X) + H(Y) We know H(X, Y) = H(X) + H(Y|X) Therefore we need only show H(Y|X) ≤ H(Y) This makes sense, knowing X can only decrease the addition information provided by Y. Proof? Possible homework =)
  • 42. Motivation Information Entropy Compressing Information ENTROPY RELATIONSHIPS
  • 43. Motivation Information Entropy Compressing Information MUTUAL INFORMATION The Mutual Information I(X; Y) is defined as: I(X; Y) = H(X) − H(X|Y) The mutual information is the amount of information shared by X and Y. It is a measure of how much X tells us about Y, and vice versa. If X and Y are independent then I(X; Y) = 0, because X tells us nothing about Y and vice versa. If X = Y then I(X; Y) = H(X) = H(Y). X tells us everything about Y and vice versa.
  • 44. Motivation Information Entropy Compressing Information EXAMPLE Marginal Distribution: X sun rain P(X) 0.6 0.4 Y hot cold P(Y) 0.6 0.4 Conditional Distribution: Y hot cold P(Y|X = sun) 0.8 0.2 Y hot cold P(Y|X = rain) 0.3 0.7 Joint Distribution: hot cold sun 0.48 0.12 rain 0.12 0.28
  • 45. Motivation Information Entropy Compressing Information EXAMPLE: MARGINAL ENTROPY Marginal Distribution: X sun rain P(X) 0.6 0.4 H(X) = i p(xi) log( 1 p(xi) ) = 0.6 log( 1 0.6 ) + 0.4 log( 1 0.4 ) = 0.97
  • 46. Motivation Information Entropy Compressing Information EXAMPLE: JOINT ENTROPY Joint Distribution: hot cold sun 0.48 0.12 rain 0.12 0.28 H(X) = i,j p(xi, yi) log( 1 p(xi, yi) ) = 0.48 log( 1 0.48 ) + 2 0.12 log( 1 0.12 ) + 0.28 log( 1 0.28 ) = 1.76
  • 47. Motivation Information Entropy Compressing Information EXAMPLE: CONDITIONAL ENTROPY Joint Distribution: hot cold sun 0.48 0.12 rain 0.12 0.28 Conditional Distribution: Y hot cold P(Y|X = sun) 0.8 0.2 Y hot cold P(Y|X = rain) 0.3 0.7 H(Y|X) = i,j p(xi, xj) log( 1 p(yi|xi) ) = 0.48 log( 1 0.8 ) + 0.12 log( 1 0.2 ) + 0.12 log( 1 0.3 ) + 0.28 log( 1 0.7 ) = 0.79
  • 48. Motivation Information Entropy Compressing Information EXAMPLE: SUMMARY Results: H(X) = H(Y) = 0.97 H(X, Y) = 1.76 H(Y|X) = 0.79 I(X; Y) = H(Y) − H(Y|X) = 0.18 Note that H(X, Y) = H(X) + H(Y|X) as required. Interpreting the Results: I(X; Y) > 0, therefore X tells us something about Y and vice versa H(Y|X) > 0, therefore X doesn’t tell us everything about Y
  • 49. Motivation Information Entropy Compressing Information MOTIVATION RECAP Gambling: Coins vs. Dice vs. Roulette Prediction: Bent Coin vs. Fair Coin Compression: How to best record a sequence of events
  • 50. Motivation Information Entropy Compressing Information OUTLINE Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence
  • 51. Motivation Information Entropy Compressing Information PREFIX CODES Compression maps events to code words We already saw an example when we mapped coin tosses to unary numbers We want mapping which generates short encodings One good way of doing this is prefix codes
  • 52. Motivation Information Entropy Compressing Information PREFIX CODES Encoding where no code word is a prefix of any other code word. Example: Event a b c d Code Word 0 10 110 111 Previously we reserved 0 as a separator If we use a prefix code we do not need a separator symbol 101000110111110111 = bbaacdcd
  • 53. Motivation Information Entropy Compressing Information DISTRIBUTION AS PREFIX CODES Every probability distribution can be thought of as specifying an encoding via the Information I(X) Map each event xi to a word of length I(xi) Table: Fair Coin X h t P(X) 0.5 0.5 I(X) 1 1 code(X) 1 0
  • 54. Motivation Information Entropy Compressing Information DISTRIBUTION AS PREFIX CODES Every probability distribution can be thought of as specifying an encoding via the Information I(X) Map each event xi to a word of length I(xi) Table: Fair 4-Sided Dice X 1 2 3 4 P(X) 0.25 0.25 0.25 0.25 I(X) 2 2 2 2 code(X) 11 10 01 00
  • 55. Motivation Information Entropy Compressing Information DISTRIBUTION AS PREFIX CODES Every probability distribution can be thought of as specifying an encoding via the Information I(X) Map each event xi to a word of length I(xi) Table: Bent 4-Sided Dice X 1 2 3 4 P(X) 0.5 0.25 0.125 0.125 I(X) 1 2 3 3 code(X) 0 10 110 111
  • 56. Motivation Information Entropy Compressing Information DISTRIBUTION AS PREFIX CODES Prefix codes built from the distribution are optimal Information is contained in the smallest possible number of characters Entropy is maximized Encoding is not always this obvious. e.g. How to encode a bent coin Question: If use a different (suboptimal) encoding, how many extra characters do I need
  • 57. Motivation Information Entropy Compressing Information KL DIVERGENCE
  • 58. Motivation Information Entropy Compressing Information KL DIVERGENCE The expected number of additional bits required to encode p using q, rather than p using p. DKL(p||q) = i p(xi) codeq(xi) − i p(xi) codep(xi) = i p(xi)Iq(xi) − i p(xi)Ip(xi) = i p(xi) log( 1 q(xi) ) − i p(xi) log( 1 p(xi) )
  • 59. Motivation Information Entropy Compressing Information KL DIVERGENCE The KL Divergence is a measure of the ’Dissimilarity’ of two distributions If p and q are similar, then KL(p||q) will be small. Common events in p will be common events in q This means they will still have short code words If p and q are dissimilar, then KL(p||q) will be large. Common events in p may be uncommon events in q This means commonly occuring events might be given long codewords
  • 60. Motivation Information Entropy Compressing Information SUMMARY Motivation Information Entropy Marginal Entropy Joint Entropy Conditional Entropy Mutual Information Compressing Information Prefix Codes KL Divergence