SlideShare a Scribd company logo
1 of 10
Download to read offline
Note to other teachers and users of these slides.
                                                             Andrew would be delighted if you found this source
                                                             material useful in giving your own lectures. Feel free
                                                             to use these slides verbatim, or to modify them to fit
                                                             your own needs. PowerPoint originals are available. If
                                                             you make use of a significant portion of these slides in
                                                             your own lecture, please include this message, or the
                                                             following link to the source repository of Andrew’s
                                                             tutorials: http://www.cs.cmu.edu/~awm/tutorials .
                                                             Comments and corrections gratefully received.




                Information Gain
                                    Andrew W. Moore
                                         Professor
                                School of Computer Science
                                Carnegie Mellon University
                                           www.cs.cmu.edu/~awm
                                             awm@cs.cmu.edu
                                               412-268-7599


              Copyright © 2001, 2003, Andrew W. Moore




                                             Bits
 You are watching a set of independent random samples of X
 You see that X has four possible values
P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4


 So you might see: BAACBADCDADDDA…
 You transmit data over a binary serial link. You can encode
 each reading with two bits (e.g. A = 00, B = 01, C = 10, D =
 11)

 0100001001001110110011111100…
 Copyright © 2001, 2003, Andrew W. Moore                                           Information Gain: Slide 2




                                                                                                                        1
Fewer Bits
 Someone tells you that the probabilities are not equal


P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8



 It’s possible…

       …to invent a coding for your transmission that only uses
       1.75 bits on average per symbol. How?


 Copyright © 2001, 2003, Andrew W. Moore             Information Gain: Slide 3




                                       Fewer Bits
 Someone tells you that the probabilities are not equal

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

 It’s possible…
       …to invent a coding for your transmission that only uses
       1.75 bits on average per symbol. How?
                                           A   0
                                           B   10
                                           C   110
                                           D   111



       (This is just one of several ways)
 Copyright © 2001, 2003, Andrew W. Moore             Information Gain: Slide 4




                                                                                 2
Fewer Bits
Suppose there are three equally likely values…

              P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3

    Here’s a naïve coding, costing 2 bits per symbol

                                           A    00
                                           B    01
                                           C    10


    Can you think of a coding that would need only 1.6 bits
    per symbol on average?

    In theory, it can in fact be done with 1.58496 bits per
    symbol.
 Copyright © 2001, 2003, Andrew W. Moore                         Information Gain: Slide 5




                                  General Case
Suppose X can have one of m values… V1, V2,               … Vm

P(X=V1) = p1                     P(X=V2) = p2        ….      P(X=Vm) = pm
What’s the smallest possible number of bits, on average, per
 symbol, needed to transmit a stream of symbols drawn from
 X’s distribution? It’s
      H ( X ) = − p1 log 2 p1 − p2 log 2 p2 − K − pm log 2 pm
                          m
                  = −∑ p j log 2 p j
                          j =1


H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                         Information Gain: Slide 6




                                                                                             3
General Case
Suppose X can have one of m values… V1, V2,                      … Vm

P(X=V1) = p1                     P(X=V2) = p2              ….         P(X=Vm) = pm
                                                          A histogram of the
What’s the smallest possible number of frequency average, per
                                       bits, on distribution of
 symbol, needed to transmit a stream values of X would have
                                        of symbols drawn from
            A histogram of the
 X’s distribution? It’s                many lows and one or
            frequency distribution of
                                                          two highs
      H ( X ) = − p1 log 2would be flat 2 p2 − K − pm log 2 pm
              values of X p − p log
                            1    2
                          m
                  = −∑ p j log 2 p j
                          j =1


H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 7




                                  General Case
Suppose X can have one of m values… V1, V2,                      … Vm

P(X=V1) = p1                     P(X=V2) = p2              ….         P(X=Vm) = pm
                                                          A histogram of the
What’s the smallest possible number of frequency average, per
                                       bits, on distribution of
 symbol, needed to transmit a stream values of X would have
                                        of symbols drawn from
            A histogram of the
 X’s distribution? It’s                many lows and one or
            frequency distribution of
                                                          two highs
      H ( X ) = − p1 log 2would be flat 2 p2 − K − pm log 2 pm
              values of X p − p log
                            1    2
                          m
                  = −∑ p ..and sop j values
                         j log 2 the                          ..and so the values
                                  sampled from it would
                          j =1
                                                              sampled from it would
                                  be all over the place       be more predictable
H(X) = The entropy of X
• “High Entropy” means X is from a uniform (boring) distribution
• “Low Entropy” means X is from varied (peaks and valleys) distribution
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 8




                                                                                                    4
Entropy in a nut-shell




   Low Entropy                                 High Entropy




Copyright © 2001, 2003, Andrew W. Moore                                  Information Gain: Slide 9




                    Entropy in a nut-shell




   Low Entropy                                 High Entropy
                                                              ..the values (locations of
                          ..the values (locations             soup) unpredictable...
                          of soup) sampled                    almost uniformly sampled
                          entirely from within                throughout our dining room
                          the soup bowl
Copyright © 2001, 2003, Andrew W. Moore                                 Information Gain: Slide 10




                                                                                                     5
Specific Conditional Entropy H(Y|X=v)
Suppose I’m trying to predict output Y and I have input X
X = College Major                     Let’s assume this reflects the true
                                      probabilities
Y = Likes “Gladiator”
                                      E.G. From this data we estimate
      X         Y
                                           • P(LikeG = Yes) = 0.5
   Math            Yes
   History         No                      • P(Major = Math & LikeG = No) = 0.25
   CS              Yes                     • P(Major = Math) = 0.5
   Math            No
                                           • P(LikeG = Yes | Major = History) = 0
   Math            No
                                      Note:
   CS              Yes
                                           • H(X) = 1.5
   History         No
                                           •H(Y) = 1
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 11




     Specific Conditional Entropy H(Y|X=v)
                                      Definition of Specific Conditional
X = College Major
                                      Entropy:
Y = Likes “Gladiator”
                                      H(Y |X=v) = The entropy of Y
                                      among only those records in which
        X               Y
                                      X has value v
   Math            Yes
   History         No
   CS              Yes
   Math            No
   Math            No
   CS              Yes
   History         No
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 12




                                                                                                     6
Specific Conditional Entropy H(Y|X=v)
                                      Definition of Specific Conditional
X = College Major
                                      Entropy:
Y = Likes “Gladiator”
                                      H(Y |X=v) = The entropy of Y
                                      among only those records in which
        X               Y
                                      X has value v
   Math            Yes
                                      Example:
   History         No
   CS              Yes
                                      • H(Y|X=Math) = 1
   Math            No
                                      • H(Y|X=History) = 0
   Math            No
                                      • H(Y|X=CS) = 0
   CS              Yes
   History         No
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                           Information Gain: Slide 13




            Conditional Entropy H(Y|X)
X = College Major                     Definition of Conditional
                                      Entropy:
Y = Likes “Gladiator”

                                      H(Y |X) = The average specific
        X               Y
                                      conditional entropy of Y
   Math            Yes
                                      = if you choose a record at random what
   History         No
                                      will be the conditional entropy of Y,
   CS              Yes
                                      conditioned on that row’s value of X
   Math            No
   Math            No                 = Expected number of bits to transmit Y if
   CS              Yes                both sides will know the value of X
   History         No
                                      = Σj Prob(X=vj) H(Y | X = vj)
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                           Information Gain: Slide 14




                                                                                                7
Conditional Entropy
                                     Definition of Conditional Entropy:
X = College Major
Y = Likes “Gladiator”
                                           H(Y|X) = The average conditional
                                           entropy of Y
                                           = ΣjProb(X=vj) H(Y | X = vj)
        X               Y
                                           Example:
   Math            Yes
   History         No                         vj      Prob(X=vj)    H(Y | X = vj)
   CS              Yes
                                           Math       0.5           1
   Math            No
                                           History    0.25          0
   Math            No
                                           CS         0.25          0
   CS              Yes
   History         No
                                     H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 15




                            Information Gain
                                     Definition of Information Gain:
X = College Major
Y = Likes “Gladiator”
                                            IG(Y|X) = I must transmit Y.
                                           How many bits on average
                                           would it save me if both ends of
                                           the line knew X?
        X               Y
                                           IG(Y|X) = H(Y) - H(Y | X)
   Math            Yes
   History         No
                                           Example:
   CS              Yes
   Math            No                        • H(Y) = 1
   Math            No
                                             • H(Y|X) = 0.5
   CS              Yes
                                             • Thus IG(Y|X) = 1 – 0.5 = 0.5
   History         No
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                                Information Gain: Slide 16




                                                                                                     8
Information Gain Example




Copyright © 2001, 2003, Andrew W. Moore      Information Gain: Slide 17




                           Another example




Copyright © 2001, 2003, Andrew W. Moore      Information Gain: Slide 18




                                                                          9
Relative Information Gain
                      Definition of Relative Information
X = College Major
Y = Likes “Gladiator” Gain:
                                     RIG(Y|X) = I must transmit Y, what
                                     fraction of the bits on average would
                                     it save me if both ends of the line
        X               Y
                                     knew X?
   Math            Yes
                                     RIG(Y|X) = H(Y) - H(Y | X) / H(Y)
   History         No
   CS              Yes
                                           Example:
   Math            No
                                               H(Y|X) = 0.5
   Math            No                      •
   CS              Yes
                                               H(Y) = 1
                                           •
   History         No
                                               Thus IG(Y|X) = (1 – 0.5)/1 = 0.5
                                           •
   Math            Yes
 Copyright © 2001, 2003, Andrew W. Moore                          Information Gain: Slide 19




    What is Information Gain used for?
 Suppose you are trying to predict whether someone
 is going live past 80 years. From historical data you
 might find…
       •IG(LongLife | HairColor) = 0.01
       •IG(LongLife | Smoker) = 0.2
       •IG(LongLife | Gender) = 0.25
       •IG(LongLife | LastDigitOfSSN) = 0.00001
 IG tells you how interesting a 2-d contingency table is
 going to be.
 Copyright © 2001, 2003, Andrew W. Moore                          Information Gain: Slide 20




                                                                                               10

More Related Content

What's hot

Team meeting 100325
Team meeting 100325Team meeting 100325
Team meeting 100325Yi-Hsin Liu
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesMatthew Leingang
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesMatthew Leingang
 
Discrete Models in Computer Vision
Discrete Models in Computer VisionDiscrete Models in Computer Vision
Discrete Models in Computer VisionYap Wooi Hen
 
Minimax Rates for Homology Inference
Minimax Rates for Homology InferenceMinimax Rates for Homology Inference
Minimax Rates for Homology InferenceDon Sheehy
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Bayesian Subset Simulation
Bayesian Subset SimulationBayesian Subset Simulation
Bayesian Subset SimulationJulien Bect
 
Accelerometers to Augmented Reality
Accelerometers to Augmented RealityAccelerometers to Augmented Reality
Accelerometers to Augmented Realityjblocksom
 
Introducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationIntroducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationEva Li
 
Applying the derivative
Applying the derivativeApplying the derivative
Applying the derivativeInarotul Faiza
 
Lesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsLesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsMatthew Leingang
 
Digital Communication: Information Theory
Digital Communication: Information TheoryDigital Communication: Information Theory
Digital Communication: Information TheoryDr. Sanjay M. Gulhane
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to TensorflowTzar Umang
 
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Matthew Leingang
 
New Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesNew Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesDon Sheehy
 
Nodebox for Data Visualization
Nodebox for Data VisualizationNodebox for Data Visualization
Nodebox for Data VisualizationLynn Cherny
 
Lesson 8: Derivatives of Polynomials and Exponential functions
Lesson 8: Derivatives of Polynomials and Exponential functionsLesson 8: Derivatives of Polynomials and Exponential functions
Lesson 8: Derivatives of Polynomials and Exponential functionsMatthew Leingang
 

What's hot (19)

Team meeting 100325
Team meeting 100325Team meeting 100325
Team meeting 100325
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum Vaues
 
Lesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum VauesLesson 18: Maximum and Minimum Vaues
Lesson 18: Maximum and Minimum Vaues
 
Discrete Models in Computer Vision
Discrete Models in Computer VisionDiscrete Models in Computer Vision
Discrete Models in Computer Vision
 
Minimax Rates for Homology Inference
Minimax Rates for Homology InferenceMinimax Rates for Homology Inference
Minimax Rates for Homology Inference
 
Midterm I Review
Midterm I ReviewMidterm I Review
Midterm I Review
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Bayesian Subset Simulation
Bayesian Subset SimulationBayesian Subset Simulation
Bayesian Subset Simulation
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Accelerometers to Augmented Reality
Accelerometers to Augmented RealityAccelerometers to Augmented Reality
Accelerometers to Augmented Reality
 
Introducing Copula to Risk Management Presentation
Introducing Copula to Risk Management PresentationIntroducing Copula to Risk Management Presentation
Introducing Copula to Risk Management Presentation
 
Applying the derivative
Applying the derivativeApplying the derivative
Applying the derivative
 
Lesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic FunctionsLesson 16: Derivatives of Exponential and Logarithmic Functions
Lesson 16: Derivatives of Exponential and Logarithmic Functions
 
Digital Communication: Information Theory
Digital Communication: Information TheoryDigital Communication: Information Theory
Digital Communication: Information Theory
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
 
New Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal MeshesNew Bounds on the Size of Optimal Meshes
New Bounds on the Size of Optimal Meshes
 
Nodebox for Data Visualization
Nodebox for Data VisualizationNodebox for Data Visualization
Nodebox for Data Visualization
 
Lesson 8: Derivatives of Polynomials and Exponential functions
Lesson 8: Derivatives of Polynomials and Exponential functionsLesson 8: Derivatives of Polynomials and Exponential functions
Lesson 8: Derivatives of Polynomials and Exponential functions
 

Similar to Information Gain

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coverianceudhayax793
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacityFrank Katta
 
advance coding techniques - probability
advance coding techniques -  probabilityadvance coding techniques -  probability
advance coding techniques - probabilityYaseenMo
 
Tele4653 l9
Tele4653 l9Tele4653 l9
Tele4653 l9Vin Voro
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for DummiesBalaji P
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and ApplicationsSaurav Jha
 
Moment-Generating Functions and Reproductive Properties of Distributions
Moment-Generating Functions and Reproductive Properties of DistributionsMoment-Generating Functions and Reproductive Properties of Distributions
Moment-Generating Functions and Reproductive Properties of DistributionsIJSRED
 
IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionguestfee8698
 

Similar to Information Gain (20)

Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
 
advance coding techniques - probability
advance coding techniques -  probabilityadvance coding techniques -  probability
advance coding techniques - probability
 
Losseless
LosselessLosseless
Losseless
 
Probabilistic systems exam help
Probabilistic systems exam helpProbabilistic systems exam help
Probabilistic systems exam help
 
Probabilistic systems assignment help
Probabilistic systems assignment helpProbabilistic systems assignment help
Probabilistic systems assignment help
 
Tele4653 l9
Tele4653 l9Tele4653 l9
Tele4653 l9
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for Dummies
 
Morphing Image
Morphing Image Morphing Image
Morphing Image
 
Dcs unit 2
Dcs unit 2Dcs unit 2
Dcs unit 2
 
Unit 5.pdf
Unit 5.pdfUnit 5.pdf
Unit 5.pdf
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
My7class
My7classMy7class
My7class
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
Moment-Generating Functions and Reproductive Properties of Distributions
Moment-Generating Functions and Reproductive Properties of DistributionsMoment-Generating Functions and Reproductive Properties of Distributions
Moment-Generating Functions and Reproductive Properties of Distributions
 
IJSRED-V2I5P56
IJSRED-V2I5P56IJSRED-V2I5P56
IJSRED-V2I5P56
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
7주차
7주차7주차
7주차
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Information Gain

  • 1. Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore Bits You are watching a set of independent random samples of X You see that X has four possible values P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4 So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 2 1
  • 2. Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 3 Fewer Bits Someone tells you that the probabilities are not equal P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8 It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? A 0 B 10 C 110 D 111 (This is just one of several ways) Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 4 2
  • 3. Fewer Bits Suppose there are three equally likely values… P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3 Here’s a naïve coding, costing 2 bits per symbol A 00 B 01 C 10 Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 5 General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H ( X ) = − p1 log 2 p1 − p2 log 2 p2 − K − pm log 2 pm m = −∑ p j log 2 p j j =1 H(X) = The entropy of X • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 6 3
  • 4. General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of the What’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have of symbols drawn from A histogram of the X’s distribution? It’s many lows and one or frequency distribution of two highs H ( X ) = − p1 log 2would be flat 2 p2 − K − pm log 2 pm values of X p − p log 1 2 m = −∑ p j log 2 p j j =1 H(X) = The entropy of X • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 7 General Case Suppose X can have one of m values… V1, V2, … Vm P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm A histogram of the What’s the smallest possible number of frequency average, per bits, on distribution of symbol, needed to transmit a stream values of X would have of symbols drawn from A histogram of the X’s distribution? It’s many lows and one or frequency distribution of two highs H ( X ) = − p1 log 2would be flat 2 p2 − K − pm log 2 pm values of X p − p log 1 2 m = −∑ p ..and sop j values j log 2 the ..and so the values sampled from it would j =1 sampled from it would be all over the place be more predictable H(X) = The entropy of X • “High Entropy” means X is from a uniform (boring) distribution • “Low Entropy” means X is from varied (peaks and valleys) distribution Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 8 4
  • 5. Entropy in a nut-shell Low Entropy High Entropy Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 9 Entropy in a nut-shell Low Entropy High Entropy ..the values (locations of ..the values (locations soup) unpredictable... of soup) sampled almost uniformly sampled entirely from within throughout our dining room the soup bowl Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 10 5
  • 6. Specific Conditional Entropy H(Y|X=v) Suppose I’m trying to predict output Y and I have input X X = College Major Let’s assume this reflects the true probabilities Y = Likes “Gladiator” E.G. From this data we estimate X Y • P(LikeG = Yes) = 0.5 Math Yes History No • P(Major = Math & LikeG = No) = 0.25 CS Yes • P(Major = Math) = 0.5 Math No • P(LikeG = Yes | Major = History) = 0 Math No Note: CS Yes • H(X) = 1.5 History No •H(Y) = 1 Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 11 Specific Conditional Entropy H(Y|X=v) Definition of Specific Conditional X = College Major Entropy: Y = Likes “Gladiator” H(Y |X=v) = The entropy of Y among only those records in which X Y X has value v Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 12 6
  • 7. Specific Conditional Entropy H(Y|X=v) Definition of Specific Conditional X = College Major Entropy: Y = Likes “Gladiator” H(Y |X=v) = The entropy of Y among only those records in which X Y X has value v Math Yes Example: History No CS Yes • H(Y|X=Math) = 1 Math No • H(Y|X=History) = 0 Math No • H(Y|X=CS) = 0 CS Yes History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 13 Conditional Entropy H(Y|X) X = College Major Definition of Conditional Entropy: Y = Likes “Gladiator” H(Y |X) = The average specific X Y conditional entropy of Y Math Yes = if you choose a record at random what History No will be the conditional entropy of Y, CS Yes conditioned on that row’s value of X Math No Math No = Expected number of bits to transmit Y if CS Yes both sides will know the value of X History No = Σj Prob(X=vj) H(Y | X = vj) Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 14 7
  • 8. Conditional Entropy Definition of Conditional Entropy: X = College Major Y = Likes “Gladiator” H(Y|X) = The average conditional entropy of Y = ΣjProb(X=vj) H(Y | X = vj) X Y Example: Math Yes History No vj Prob(X=vj) H(Y | X = vj) CS Yes Math 0.5 1 Math No History 0.25 0 Math No CS 0.25 0 CS Yes History No H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 15 Information Gain Definition of Information Gain: X = College Major Y = Likes “Gladiator” IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? X Y IG(Y|X) = H(Y) - H(Y | X) Math Yes History No Example: CS Yes Math No • H(Y) = 1 Math No • H(Y|X) = 0.5 CS Yes • Thus IG(Y|X) = 1 – 0.5 = 0.5 History No Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 16 8
  • 9. Information Gain Example Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 17 Another example Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 18 9
  • 10. Relative Information Gain Definition of Relative Information X = College Major Y = Likes “Gladiator” Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line X Y knew X? Math Yes RIG(Y|X) = H(Y) - H(Y | X) / H(Y) History No CS Yes Example: Math No H(Y|X) = 0.5 Math No • CS Yes H(Y) = 1 • History No Thus IG(Y|X) = (1 – 0.5)/1 = 0.5 • Math Yes Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 19 What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find… •IG(LongLife | HairColor) = 0.01 •IG(LongLife | Smoker) = 0.2 •IG(LongLife | Gender) = 0.25 •IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be. Copyright © 2001, 2003, Andrew W. Moore Information Gain: Slide 20 10