Fundamentals of information_theory_and_coding_design__discrete_mathematics_and_its_applications_
Abstract Algebra Applications with Maple,Richard E. Klima, Ernest Stitzinger, and Neil P. SigmonAlgebraic Number Theory, Richard A. MollinAn Atlas of The Smaller Maps in Orientable and Nonorientable Surfaces,David M. Jackson and Terry I. VisentinAn Introduction to Crytography, Richard A. MollinCombinatorial Algorithms: Generation Enumeration and Search,Donald L. Kreher and Douglas R. StinsonThe CRC Handbook of Combinatorial Designs,Charles J. Colbourn and Jeffrey H. DinitzCryptography: Theory and Practice, Second Edition, Douglas R. StinsonDesign Theory, Charles C. Lindner and Christopher A. RodgersFrames and Resolvable Designs: Uses, Constructions, and Existence,Steven Furino, Ying Miao, and Jianxing YinFundamental Number Theory with Applications, Richard A. MollinGraph Theory and Its Applications, Jonathan Gross and Jay YellenHandbook of Applied Cryptography,Alfred J. Menezes, Paul C. van Oorschot, and Scott A. VanstoneHandbook of Constrained Optimization,Herbert B. Shulman and Venkat VenkateswaranHandbook of Discrete and Combinatorial Mathematics, Kenneth H. RosenHandbook of Discrete and Computational Geometry,Jacob E. Goodman and Joseph O’RourkeIntroduction to Information Theory and Data Compression,Darrel R. Hankerson, Greg A. Harris, and Peter D. JohnsonSeries EditorKenneth H. Rosen, Ph.D.AT&T LaboratoriesMiddletown, New JerseyandDISCRETEMATHEMATICSITS APPLICATIONS
Continued TitlesNetwork Reliability: Experiments with a Symbolic Algebra Environment,Daryl D. Harms, Miroslav Kraetzl, Charles J. Colbourn, and John S. DevittRSA and Public-Key CryptographyRichard A. MollinQuadratics, Richard A. MollinVerificaton of Computer Codes in Computational Science and Engineering,Patrick Knupp and Kambiz Salari
CHAPMAN & HALL/CRCA CRC Press CompanyBoca Raton London NewYork Washington, D.C.Roberto TogneriChristopher J.S. deSilvaFUNDAMENTALS ofINFORMATION THEORYand CODING DESIGNDISCRETE MA THEMA TICS A ND ITS A PPLICA TIONSSeries Editor KENNETH H. ROSEN
,QFOXGHV ELEOLRJUDSKLFDO UHIHUHQFHV DQG LQGH[,6%1 DON SDSHU
,QIRUPDWLRQ WKHRU RGLQJ WKHRU , H6LOYD KULVWRSKHU - 6 ,, 7LWOH ,,, 53UHVV VHULHV RQ GLVFUHWH PDWKHPDWLFV DQG LWV DSSOLFDWLRQV4 7 d³GF This edition published in the Taylor Francis e-Library, 2006.“To purchase your own copy of this or any of Taylor Francis or Routledge’scollection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.”
PrefaceWhat is information? How do we quantify or measure the amount of informationthat is present in a ﬁle of data, or a string of text? How do we encode the informationso that it can be stored efﬁciently, or transmitted reliably?The main concepts and principles of information theory were developed by Claude E.Shannon in the 1940s. Yet only now, and thanks to the emergence of the informationage and digital communication, are the ideas of information theory being looked atagain in a new light. Because of information theory and the results arising fromcoding theory we now know how to quantify information, how we can efﬁcientlyencode it and how reliably we can transmit it.This book introduces the main concepts behind how we model information sourcesand channels, how we code sources for efﬁcient storage and transmission, and thefundamentals of coding theory and applications to state-of-the-art error correctingand error detecting codes.This textbook has been written for upper level undergraduate students and graduatestudents in mathematics, engineering and computer science. Most of the materialpresented in this text was developed over many years at The University of West-ern Australia in the unit Information Theory and Coding 314, which was a core unitfor students majoring in Communications and Electrical and Electronic Engineering,and was a unit offered to students enrolled in the Master of Engineering by Course-work and Dissertation in the Intelligent Information Processing Systems course.The number of books on the market dealing with information theory and coding hasbeen on the rise over the past ﬁve years. However, very few, if any, of these bookshave been able to cover the fundamentals of the theory without losing the reader inthe complex mathematical abstractions. And fewer books are able to provide theimportant theoretical framework when discussing the algorithms and implementa-tion details of modern coding systems. This book does not abandon the theoreticalfoundations of information and coding theory and presents working algorithms andimplementations which can be used to fabricate and design real systems. The mainemphasis is on the underlying concepts that govern information theory and the nec-essary mathematical background that describe modern coding systems. One of thestrengths of the book are the many worked examples that appear throughout the bookthat allow the reader to immediately understand the concept being explained, or thealgorithm being described. These are backed up by fairly comprehensive exercisesets at the end of each chapter (including exercises identiﬁed by an * which are moreadvanced or challenging).v
viThe material in the book has been selected for completeness and to present a balancedcoverage. There is discussion of cascading of information channels and additivityof information which is rarely found in modern texts. Arithmetic coding is fullyexplained with both worked examples for encoding and decoding. The connectionbetween coding of extensions and Markov modelling is clearly established (this isusually not apparent in other textbooks). Three complete chapters are devoted toblock codes for error detection and correction. A large part of these chapters dealswith an exposition of the concepts from abstract algebra that underpin the design ofthese codes. We decided that this material should form part of the main text (ratherthan be relegated to an appendix) to emphasise the importance of understanding themathematics of these and other advanced coding strategies.Chapter 1 introduces the concepts of entropy and information sources and explainshow information sources are modelled. In Chapter 2 this analysis is extended toinformation channels where the concept of mutual information is introduced andchannel capacity is discussed. Chapter 3 covers source coding for efﬁcient storageand transmission with an introduction to the theory and main concepts, a discussionof Shannon’s Noiseless Coding Theorem and details of the Huffman and arithmeticcoding algorithms. Chapter 4 provides the basic principles behind the various com-pression algorithms including run-length coding and dictionary coders. Chapter 5introduces the fundamental principles of channel coding, the importance of the Ham-ming distance in the analysis and design of codes and a statement of what Shannon’sFundamental Coding Theorem tells us we can do with channel codes. Chapter 6introduces the algebraic concepts of groups, rings, ﬁelds and linear spaces over thebinary ﬁeld and introduces binary block codes. Chapter 7 provides the details of thetheory of rings of polynomials and cyclic codes and describes how to analyse anddesign various linear cyclic codes including Hamming codes, Cyclic RedundancyCodes and Reed-Muller codes. Chapter 8 deals with burst-correcting codes and de-scribes the design of Fire codes, BCH codes and Reed-Solomon codes. Chapter 9completes the discussion on channel coding by describing the convolutional encoder,decoding of convolutional codes, trellis modulation and Turbo codes.This book can be used as a textbook for a one semester undergraduate course in in-formation theory and source coding (all of Chapters 1 to 4), a one semester graduatecourse in coding theory (all of Chapters 5 to 9) or as part of a one semester under-graduate course in communications systems covering information theory and coding(selected material from Chapters 1, 2, 3, 5, 6 and 7).We would like to thank Sean Davey and Nishith Arora for their help with the LATEXformatting of the manuscript. We would also like to thank Ken Rosen for his reviewof our draft manuscript and his many helpful suggestions and Sunil Nair from CRCPress for encouraging us to write this book in the ﬁrst place!Our examples on arithmetic coding were greatly facilitated by the use of the conver-sion calculator (which is one of the few that can handle fractions!) made availableby www.math.com.
viiThe manuscript was written in LATEX and we are indebted to the open source softwarecommunity for developing such a powerful text processing environment. We areespecially grateful to the developers of LyX (www.lyx.org) for making writing thedocument that much more enjoyable and to the makers of xﬁg (www.xﬁg.org) forproviding such an easy-to-use drawing package.Roberto TogneriChris deSilva
Chapter 1Entropy and Information1.1 StructureStructure is a concept of which we all have an intuitive understanding. However,it is not easy to articulate that understanding and give a precise deﬁnition of whatstructure is. We might try to explain structure in terms of such things as regularity,predictability, symmetry and permanence. We might also try to describe what struc-ture is not, using terms such as featureless, random, chaotic, transient and aleatory.Part of the problem of trying to deﬁne structure is that there are many different kindsof behaviour and phenomena which might be described as structured, and ﬁnding adeﬁnition that covers all of them is very difﬁcult.Consider the distribution of the stars in the night sky. Overall, it would appear thatthis distribution is random, without any structure. Yet people have found patterns inthe stars and imposed a structure on the distribution by naming constellations.Again, consider what would happen if you took the pixels on the screen of yourcomputer when it was showing a complicated and colourful scene and strung themout in a single row. The distribution of colours in this single row of pixels wouldappear to be quite arbitrary, yet the complicated pattern of the two-dimensional arrayof pixels would still be there.These two examples illustrate the point that we must distinguish between the pres-ence of structure and our perception of structure. In the case of the constellations,the structure is imposed by our brains. In the case of the picture on our computerscreen, we can only see the pattern if the pixels are arranged in a certain way.Structure relates to the way in which things are put together, the way in which theparts make up the whole. Yet there is a difference between the structure of, say, abridge and that of a piece of music. The parts of the Golden Gate Bridge or theSydney Harbour Bridge are solid and ﬁxed in relation to one another. Seeing onepart of the bridge gives you a good idea of what the rest of it looks like.The structure of pieces of music is quite different. The notes of a melody can bearranged according to the whim or the genius of the composer. Having heard partof the melody you cannot be sure of what the next note is going to be, leave alone1
2 Fundamentals of Information Theory and Coding Designany other part of the melody. In fact, pieces of music often have a complicated,multi-layered structure, which is not obvious to the casual listener.In this book, we are going to be concerned with things that have structure. The kindsof structure we will be concerned with will be like the structure of pieces of music.They will not be ﬁxed and obvious.1.2 Structure in RandomnessStructure may be present in phenomena that appear to be random. When it is present,it makes the phenomena more predictable. Nevertheless, the fact that randomness ispresent means that we have to talk about the phenomena in terms of probabilities.Let us consider a very simple example of how structure can make a random phe-nomenon more predictable. Suppose we have a fair die. The probability of any facecoming up when the die is thrown is 1/6. In this case, it is not possible to predictwhich face will come up more than one-sixth of the time, on average.On the other hand, if we have a die that has been biased, this introduces some struc-ture into the situation. Suppose that the biasing has the effect of making the probabil-ity of the face with six spots coming up 55/100, the probability of the face with onespot coming up 5/100 and the probability of any other face coming up 1/10. Thenthe prediction that the face with six spots will come up will be right more than halfthe time, on average.Another example of structure in randomness that facilitates prediction arises fromphenomena that are correlated. If we have information about one of the phenomena,we can make predictions about the other. For example, we know that the IQ of iden-tical twins is highly correlated. In general, we cannot make any reliable predictionabout the IQ of one of a pair of twins. But if we know the IQ of one twin, we canmake a reliable prediction of the IQ of the other.In order to talk about structure in randomness in quantitative terms, we need to useprobability theory.1.3 First Concepts of Probability TheoryTo describe a phenomenon in terms of probability theory, we need to deﬁne a setof outcomes, which is called the sample space. For the present, we will restrictconsideration to sample spaces which are ﬁnite sets.
Entropy and Information 3DEFINITION 1.1 Probability Distribution A probability distribution on asample space Ë ×½ ×¾ ×Æ is a function È that assigns a probabilityto each outcome in the sample space. È is a map from Ë to the unit interval,È Ë ¼ ½ , which must satisfyÈÆ½ È´× µ ½.DEFINITION 1.2 Events Events are subsets of the sample space.We can extend a probability distribution È from Ë to the set of all subsets of Ë,which we denote by È´Ëµ, by setting È´ µ È×¾ È´×µfor any ¾ È´Ëµ. Notethat È´ µ ¼.An event whose probability is ¼ is impossible and an event whose probability is ½ iscertain to occur.If and are events and then È´ µ È´ µ·È´ µ.DEFINITION 1.3 Expected Value If Ë ×½ ×¾ ×Æ is a sample spacewith probability distribution È, and Ë Î is a function from the sample spaceto a vector space Î , the expected value of isÈÆ½ È´× µ ´× µ.NOTE We will often have equations that involve summation over the elementsof a ﬁnite set. In the equations above, the set has been Ë ×½ ×¾ ×Æ andthe summation has been denoted byÈÆ½. In other places in the text we will denotesuch summations simply byÈ×¾Ë.1.4 Surprise and EntropyIn everyday life, events can surprise us. Usually, the more unlikely or unexpectedan event is, the more surprising it is. We can quantify this idea using a probabilitydistribution.DEFINITION 1.4 Surprise If is an event in a sample space Ë, we deﬁne thesurprise of to be ×´ µ ÐÓ ´È´ µµ ÐÓ ´½ È´ µµ.Events for which È´ µ ½, which are certain to occur, have zero surprise, as wewould expect, and events that are impossible, that is, for which È´ µ ¼, haveinﬁnite surprise.
4 Fundamentals of Information Theory and Coding DesignDeﬁning the surprise as the negative logarithm of the probability not only gives us theappropriate limiting values as the probability tends to ¼ or ½, it also makes surpriseadditive. If several independent events occur in succession, the total surprise theygenerate is the sum of their individual surprises.DEFINITION 1.5 Entropy We can restrict the surprise to the sample spaceand consider it to be a function from the sample space to the real numbers. Theexpected value of the surprise is the entropy of the probability distribution.If the sample space is Ë ×½ ×¾ ×Æ , with probability distribution È, theentropy of the probability distribution is given byÀ´Èµ Æ½È´× µ ÐÓ ´È´× µµ (1.1)The concept of entropy was introduced into thermodynamics in the nineteenth cen-tury. It was considered to be a measure of the extent to which a system was disor-dered. The tendency of systems to become more disordered over time is described bythe Second Law of Thermodynamics, which states that the entropy of a system can-not spontaneously decrease. In the 1940’s, Shannon  introduced the concept intocommunications theory and founded the subject of information theory. It was thenrealised that entropy is a property of any stochastic system and the concept is nowused widely in many ﬁelds. Today, information theory (as described in books suchas , , ) is still principally concerned with communications systems, but thereare widespread applications in statistics, information processing and computing (see, , ).Let us consider some examples of probability distributions and see how the entropy isrelated to predictability. First, let us note the form of the function ×´Ôµ ÔÐÓ ´Ôµwhere ¼ Ô ½ and ÐÓ denotes the logarithm to base 2. (The actual base does notmatter, but we shall be using base 2 throughout the rest of this book, so we may aswell start here.) The graph of this function is shown in Figure 1.1.Note that ÔÐÓ ´Ôµ approaches ¼ as Ô tends to ¼ and also as Ô tends to ½. Thismeans that outcomes that are almost certain to occur and outcomes that are unlikelyto occur both contribute little to the entropy. Outcomes whose probability is close to¼ make a comparatively large contribution to the entropy.EXAMPLE 1.1Ë ×½ ×¾ with È´×½µ ¼ È´×¾µ. The entropy isÀ´Èµ ´¼ µ´ ½µ ´¼ µ´ ½µ ½In this case, ×½ and ×¾ are equally likely to occur and the situation is as unpredictableas it can be.
Entropy and Information 500.10.20.30.40.50.60 0.2 0.4 0.6 0.8 1-p*log(p)pFIGURE 1.1The graph of Ô ÐÓ ´Ôµ.EXAMPLE 1.2Ë ×½ ×¾ with È´×½µ ¼ , and È´×¾µ ¼ ¼¿½¾ . The entropy isÀ´Èµ ´¼ µ´ ¼ ¼ µ ´¼ ¼¿½¾ µ´ µ ¼ ¾¼In this case, the situation is more predictable, with ×½ more than thirty times morelikely to occur than ×¾. The entropy is close to zero.EXAMPLE 1.3Ë ×½ ×¾ with È´×½µ ½ ¼, and È´×¾µ ¼ ¼. Using the convention that¼ ÐÓ ´¼µ ¼, the entropy is ¼. The situation is entirely predictable, as ×½ alwaysoccurs.EXAMPLE 1.4Ë ×½ ×¾ ×¿ × × × , with È´× µ ½ for ½ ¾ . The entropy is¾ and the situation is as unpredictable as it can be.EXAMPLE 1.5Ë ×½ ×¾ ×¿ × × × , with È´×½µ ¼ È´× µ ¼ ¼¼½ for ¾ ¿ .
6 Fundamentals of Information Theory and Coding DesignThe entropy is ¼ ¼ and the situation is fairly predictable as ×½ will occur far morefrequently than any other outcome.EXAMPLE 1.6Ë ×½ ×¾ ×¿ × × × , with È ´×½µ ¼ È ´×¾µ È ´× µ ¼ ¼¼½ for¿ . The entropy is ½ ¼ ¾ and the situation is about as predictable as inExample 1.1 above, with outcomes ×½ and ×¾ equally likely to occur and the othersvery unlikely to occur.Roughly speaking, a system whose entropy is is about as unpredictable as a systemwith ¾ equally likely outcomes.1.5 Units of EntropyThe units in which entropy is measured depend on the base of the logarithms usedto calculate it. If we use logarithms to the base 2, then the unit is the bit. If weuse natural logarithms (base ), the entropy is measured in natural units, sometimesreferred to as nits. Converting between the different units is simple.PROPOSITION 1.1If À is the entropy of a probability distribution measured using natural logarithms,and ÀÖ istheentropyofthesameprobabilitydistribution measured using logarithmsto the base Ö, thenÀÖÀÐÒ´Öµ(1.2)PROOF Let the sample space be Ë ×½ ×¾ ×Æ , with probability distri-bution È . For any positive number Ü,ÐÒ´Üµ ÐÒ´Öµ ÐÓ Ö´Üµ (1.3)It follows thatÀÖ´È µ Æ½È ´× µ ÐÓ Ö´È ´× µµ Æ½È ´× µÐÒ´È ´× µµÐÒ´Öµ
Entropy and Information 7 ÈÆ½ È´× µÐÒ´È´× µµÐÒ´ÖµÀ ´ÈµÐÒ´Öµ (1.4)1.6 The Minimum and Maximum Values of EntropyIf we have a sample space Ë with Æ elements, and probability distribution È on Ë,it is convenient to denote the probability of × ¾ Ë by Ô . We can construct a vectorin ÊÆ consisting of the probabilities:Ô¾Ô½Ô¾...ÔÆ¿Because the probabilities have to add up to unity, the set of all probability distribu-tions forms a simplex in ÊÆ, namelyÃ´Ô ¾ ÊÆÆ½Ô ½µWe can consider the entropy to be a function deﬁned on this simplex. Since it isa continuous function, extreme values will occur at the vertices of this simplex, atpoints where all except one of the probabilities are zero. If ÔÚ is a vertex, then theentropy there will beÀ´ÔÚµ ´Æ ½µ ¼ ÐÓ ´¼µ·½ ÐÓ ´½µThe logarithm of zero is not deﬁned, but the limit of ÜÐÓ ´Üµ as Ü tends to ¼ ex-ists and is equal to zero. If we take the limiting values, we see that at any vertex,À´ÔÚµ ¼, as ÐÓ ´½µ ¼. This is the minimum value of the entropy function.The entropy function has a maximum value at an interior point of the simplex. Toﬁnd it we can use Lagrange multipliers.THEOREM 1.1If we have a sample space with Æ elements, the maximum value of the entropyfunction is ÐÓ ´Æµ.
8 Fundamentals of Information Theory and Coding DesignPROOF We want to ﬁnd the maximum value ofÀ´Ôµ Æ½Ô ÐÓ ´Ô µ (1.5)subject to the constraintÆ½Ô ½ (1.6)We introduce the Lagrange multiplier , and put´Ôµ À´Ôµ ·Æ½Ô ½ (1.7)To ﬁnd the maximum value we have to solveÔ ¼ (1.8)for ½ ¾ Æ andÆ½Ô ½ (1.9)Ô ÐÓ ´Ô µ ½ · (1.10)soÔ ½ (1.11)for each . The remaining condition givesÆ ½ ½ (1.12)which can be solved for , or can be used directly to giveÔ ½Æ (1.13)for all . Using these values for the Ô , we getÀ´Ôµ Æ ½Æ ÐÓ ´½ Æµ ÐÓ ´Æµ (1.14)
Entropy and Information 91.7 A Useful InequalityLEMMA 1.1If Ô½ Ô¾ ÔÆ and Õ½ Õ¾ ÕÆ are all non-negative numbers that satisfy theconditionsÈÆ½ ÔÒ ½ andÈÆ½ ÕÒ ½, then Æ½Ô ÐÓ ´Ô µ Æ½Ô ÐÓ ´Õ µ (1.15)with equality if and only if Ô Õ for all .PROOF We prove the result for the natural logarithm; the result for any other basefollows immediately from the identityÐÒ´Üµ ÐÒ´ÖµÐÓ Ö´Üµ (1.16)It is a standard result about the logarithm function thatÐÒÜ Ü ½ (1.17)for Ü ¼, with equality if and only if Ü ½. Substituting Ü Õ Ô , we getÐÒ´Õ Ô µ Õ Ô ½ (1.18)with equality if and only if Ô Õ . This holds for all ½ ¾ Æ, so if wemultiply by Ô and sum over the , we getÆ½Ô ÐÒ´Õ Ô µÆ½´Õ Ô µÆ½Õ Æ½Ô ½ ½ ¼ (1.19)with equality if and only if Ô Õ for all . SoÆ½Ô ÐÒ´Õ µ Æ½Ô ÐÒ´Ô µ ¼ (1.20)which is the required result.The inequality can also be written in the formÆ½Ô ÐÓ ´Õ Ô µ ¼ (1.21)with equality if and only if Ô Õ for all .Note that putting Õ ½ Æ for all in this inequality gives us an alternative proofthat the maximum value of the entropy function is ÐÓ ´Æµ.
10 Fundamentals of Information Theory and Coding Design1.8 Joint Probability Distribution FunctionsThere are many situations in which it is useful to consider sample spaces that are theCartesian product of two or more sets.DEFINITION 1.6 Cartesian Product Let Ë ×½ ×¾ ×Å and ÌØ½ Ø¾ ØÆ be two sets. The Cartesian product of Ë and Ì is the set Ë ¢ Ì´× Ø µ ½ Å ½ Æ .The extension to the Cartesian product of more than two sets is immediate.DEFINITION 1.7 Joint Probability Distribution A joint probability distributionis a probability distribution on the Cartesian product of a number of sets.If we have Ë and Ì as above, then a joint probability distribution function assigns aprobability to each pair ´× Ø µ. We can denote this probability by Ô . Since thesevalues form a probability distribution, we have¼ Ô ½ (1.22)for ½ Å, ½ Æ, andÅ½Æ½Ô ½ (1.23)If È is the joint probability distribution function on Ë ¢ Ì, the deﬁnition of entropybecomesÀ´Èµ Å½Æ½È´× Ø µÐÓ ´È´× Ø µµ Å½Æ½Ô ÐÓ ´Ô µ (1.24)If we want to emphasise the spaces Ë and Ì, we will denote the entropy of the jointprobability distribution on Ë¢Ì by À´ÈË¢Ìµor simply by À´Ë Ìµ. This is knownas the joint entropy of Ë and Ì.If there are probability distributions ÈË and ÈÌ on Ë and Ì, respectively, and theseare independent, the joint probability distribution on Ë ¢ Ì is given byÔ ÈË´× µÈÌ´Ø µ (1.25)
Entropy and Information 11for ½ Å , ½ Æ . If there are correlations between the × and Ø , then thisformula does not apply.DEFINITION 1.8 Marginal Distribution If È is a joint probability distributionfunction on Ë ¢Ì , the marginal distribution on Ë is ÈË Ë ¼ ½ given byÈË´× µÆ½È ´× Ø µ (1.26)for ½ Æ and the marginal distribution on Ì is ÈÌ Ì ¼ ½ given byÈÌ ´Ø µÅ½È ´× Ø µ (1.27)for ½ Æ .There is a simple relationship between the entropy of the joint probability distributionfunction and that of the marginal distribution functions.THEOREM 1.2If È is a joint probability distribution function on Ë ¢Ì , and ÈË and ÈÌ are themarginal distributions on Ë and Ì , respectively, thenÀ´È µ À´ÈËµ · À´ÈÌ µ (1.28)with equality if and only if the marginal distributions are independent.PROOFÀ´ÈËµ Å½ÈË´× µ ÐÓ ´ÈË´× µµ Å½Æ½È ´× Ø µ ÐÓ ´ÈË´× µµ (1.29)and similarlyÀ´ÈÌ µ Å½Æ½È ´× Ø µ ÐÓ ´ÈÌ ´Ø µµ (1.30)SoÀ´ÈËµ · À´ÈÌ µ Å½Æ½È ´× Ø µ ÐÓ ´ÈË´× µµ · ÐÓ ´ÈÌ ´Ø µµ
12 Fundamentals of Information Theory and Coding Design Å½Æ½È´× Ø µÐÓ ´ÈË´× µÈÌ´Ø µµ (1.31)Also,À´Èµ Å½Æ½È´× Ø µÐÓ ´È´× Ø µµ (1.32)SinceÅ½Æ½È´× Ø µ ½ (1.33)andÅ½Æ½ÈË´× µÈÌ´Ø µÅ½ÈË´× µÆ½ÈÌ´Ø µ ½ (1.34)we can use the inequality of Lemma 1.1 to conclude thatÀ´Èµ À´ÈËµ·À´ÈÌµ (1.35)with equality if and only if È´× Ø µ ÈË´× µÈÌ´Ø µ for all and , that is, if thetwo marginal distributions are independent.1.9 Conditional Probability and Bayes’ TheoremDEFINITION 1.9 Conditional Probability If Ë is a sample space with a prob-ability distribution function È, and and are events in Ë, the conditional prob-ability of given isÈ´ µ È´ µÈ´ µ (1.36)It is obvious thatÈ´ µÈ´ µ È´ µ È´ µÈ´ µ (1.37)Almost as obvious is one form of Bayes’ Theorem:THEOREM 1.3If Ë is a sample space with a probability distribution function È, and and areevents in Ë, thenÈ´ µ È´ µÈ´ µÈ´ µ (1.38)
Entropy and Information 13Bayes’ Theorem is important because it enables us to derive probabilities of hypothe-ses from observations, as in the following example.EXAMPLE 1.7We have two jars, A and B. Jar A contains 8 green balls and 2 red balls. Jar B contains3 green balls and 7 red balls. One jar is selected at random and a ball is drawn fromit.Wehaveprobabilitiesasfollows. Thesetofjarsformsonesamplespace,Ë ,withÈ´ µ ¼ È´ µas one jar is as likely to be chosen as the other.The set of colours forms another sample space, Ì Ê . The probability ofdrawing a green ball isÈ´ µ ½½ ¾¼ ¼as 11 of the 20 balls in the jars are green. Similarly,È´ Ê µ ¾¼ ¼We have a joint probability distribution over the colours of the balls and the jars withthe probability of selecting Jar A and drawing a green ball being given byÈ´ ´ µ µ ¼Similarly, we have the probability of selecting Jar A and drawing a red ballÈ´ ´Ê µ µ ¼ ½the probability of selecting Jar B and drawing a green ballÈ´ ´ µ µ ¼ ½and the probability of selecting Jar B and drawing a red ballÈ´ ´Ê µ µ ¼ ¿We have the conditional probabilities: given that Jar A was selected, the probabilityof drawing a green ball isÈ´ µ ¼and the probability of drawing a red ball isÈ´ Ê µ ¼ ¾
14 Fundamentals of Information Theory and Coding DesignGiven that Jar B was selected, the corresponding probabilities are:È´ µ ¼ ¿andÈ´ Ê µ ¼We can now use Bayes’ Theorem to work out the probability of having drawn fromeither jar, given the colour of the ball that was drawn. If a green ball was drawn, theprobability that it was drawn from Jar A isÈ´ µÈ´ µÈ´ µÈ´ µ¼ ¢ ¼ ¼ ¼ ¿while the probability that it was drawn from Jar B isÈ´ µÈ´ µÈ´ µÈ´ µ¼ ¿ ¢ ¼ ¼ ¼ ¾If a red ball was drawn, the probability that it was drawn from Jar A isÈ´ Ê µÈ´ Ê µÈ´ µÈ´ Ê µ¼ ¾ ¢ ¼ ¼ ¼ ¾¾while the probability that it was drawn from Jar B isÈ´ Ê µÈ´ Ê µÈ´ µÈ´ Ê µ¼ ¢ ¼ ¼ ¼(In this case, we could have derived these conditional probabilities from the jointprobability distribution, but we chose not to do so to illustrate how Bayes’ Theoremallows us to go from the conditional probabilities of the colours given the jar selectedto the conditional probabilities of the jars selected given the colours drawn.)1.10 Conditional Probability Distributions and Conditional EntropyIn this section, we have a joint probability distribution È on a Cartesian productË ¢ Ì, where Ë ×½ ×¾ ×Å and Ì Ø½ Ø¾ ØÆ , with marginal distri-butions ÈË and ÈÌ.DEFINITION 1.10 Conditional Probability of × given Ø For × ¾ Ë andØ ¾ Ì, the conditional probability of × given Ø isÈ´× Ø µÈ´× Ø µÈÌ´Ø µÈ´× Ø µÈÅ½ È´× Ø µ(1.39)
Entropy and Information 15DEFINITION 1.11 Conditional Probability Distribution given Ø For a ﬁxedØ , the conditional probabilities È ´× Ø µ sum to 1 over , so they form a probabilitydistribution on Ë, the conditional probability distribution given Ø . We will denotethis by ÈË Ø .DEFINITION 1.12 Conditional Entropy given Ø The conditional entropygiven Ø is the entropy of the conditional probability distribution on Ë given Ø . Itwill be denoted À´ÈË Ø µ.À´ÈË Ø µ Å½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.40)DEFINITION 1.13 Conditional Probability Distribution on Ë given Ì Theconditional probability distribution on Ë given Ì is the weighted average of theconditional probability distributions given Ø for all . It will be denoted ÈË Ì .ÈË Ì ´× µÆ½ÈÌ ´Ø µÈË Ø ´× µ (1.41)DEFINITION1.14 ConditionalEntropygivenÌ Theconditional entropy givenÌ is the weighted average of the conditional entropies on Ë given Ø for all Ø ¾ Ì .It will be denoted À´ÈË Ì µ.À´ÈË Ì µ Æ½ÈÌ ´Ø µÅ½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.42)Since ÈÌ ´Ø µÈ ´× Ø µ È ´× Ø µ, we can re-write this asÀ´ÈË Ì µ Å½Æ½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.43)We now prove two simple results about the conditional entropies.THEOREM 1.4À´È µ À´ÈÌ µ · À´ÈË Ì µ À´ÈËµ · À´ÈÌ Ëµ
16 Fundamentals of Information Theory and Coding DesignPROOFÀ´Èµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµ Å½Æ½È´× Ø µ ÐÓ ´ÈÌ´Ø µÈ´× Ø µµ Å½Æ½È´× Ø µ ÐÓ ´ÈÌ´Ø µµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµ Æ½ÈÌ´Ø µ ÐÓ ´ÈÌ´Ø µµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµÀ´ÈÌµ · À´ÈË Ìµ (1.44)The proof of the other equality is similar.THEOREM 1.5À´ÈË Ìµ À´ÈËµ with equality if and only if ÈË and ÈÌ are independent.PROOF From the previous theorem, À´Èµ À´ÈÌµ · À´ÈË ÌµFrom Theorem 1.2, À´Èµ À´ÈËµ · À´ÈÌµ with equality if and only if ÈË andÈÌ are independent.So À´ÈÌµ · À´ÈË Ìµ À´ÈËµ · À´ÈÌµ.Subtracting À´ÈÌµ from both sides we get À´ÈË Ìµ À´ÈËµ, with equality if andonly if ÈË and ÈÌ are independent.This result is obviously symmetric in Ë and Ì; so we also have À´ÈÌ Ëµ À´ÈÌµwith equality if and only if ÈË and ÈÌ are independent. We can sum up this resultby saying the conditioning reduces entropy or conditioning reduces uncertainty.1.11 Information SourcesMost of this book will be concerned with random sequences. Depending on thecontext, such sequences may be called time series, (discrete) stochastic processes or
Entropy and Information 17signals. The ﬁrst term is used by statisticians, the second by mathematicians and thethird by engineers. This may reﬂect differences in the way these people approach thesubject: statisticians are primarily interested in describing such sequences in termsof probability theory, mathematicians are interested in the behaviour of such seriesand the ways in which they may be generated and engineers are interested in ways ofusing such sequences and processing them to extract useful information from them.A device or situation that produces such a sequence is called an information source.The elements of the sequence are usually drawn from a ﬁnite set, which may bereferred to as the alphabet. The source can be considered to be emitting an elementof the alphabet at each instant of a sequence of instants in time. The elements of thealphabet are referred to as symbols.EXAMPLE 1.8Tossing a coin repeatedly and recording the outcomes as heads (H) or tails (T) givesus a random sequence whose alphabet is À Ì .EXAMPLE 1.9Throwing a die repeatedly and recording the number of spots on the uppermost facegives us a random sequence whose alphabet is ½ ¾ ¿ .EXAMPLE 1.10Computers and telecommunications equipment generate sequences of bits which arerandom sequences whose alphabet is ¼ ½ .EXAMPLE 1.11A text in the English language is a random sequence whose alphabet is the set con-sisting of the letters of the alphabet, the digits and the punctuation marks. While wenormally consider text to be meaningful rather than random, it is only possible topredict which letter will come next in the sequence in probabilistic terms, in general.The last example above illustrates the point that a random sequence may not appearto be random at ﬁrst sight. The difference between the earlier examples and the ﬁnalexample is that in the English language there are correlations between each letter inthe sequence and those that precede it. In contrast, there are no such correlations inthe cases of tossing a coin or throwing a die repeatedly. We will consider both kindsof information sources below.
18 Fundamentals of Information Theory and Coding DesignAn obvious question that is raised by the term “information source” is: What is the“information” that the source produces? A second question, perhaps less obvious,is: How can we measure the information produced by an information source?An information source generates a sequence of symbols which has a certain degreeof unpredictability. The more unpredictable the sequence is, the more informationis conveyed by each symbol. The information source may impose structure on thesequence of symbols. This structure will increase the predictability of the sequenceand reduce the information carried by each symbol.The random behaviour of the sequence may be described by probability distribu-tions over the alphabet. If the elements of the sequence are uncorrelated, a simpleprobability distribution over the alphabet may sufﬁce. In other cases, conditionalprobability distributions may be required.We have already seen that entropy is a measure of predictability. For an informationsource, the information content of the sequence that it generates is measured by theentropy per symbol. We can compute this if we make assumptions about the kindsof structure that the information source imposes upon its output sequences.To describe an information source completely, we need to specify both the alpha-bet and the probability distribution that governs the generation of sequences. Theentropy of the information source Ë with alphabet and probability distribution Èwill be denoted by À´Ëµ in the following sections, even though it is actually the en-tropy of È. Later on, we will wish to concentrate on the alphabet and will use À´ µto denote the entropy of the information source, on the assumption that the alphabetwill have a probability distribution associated with it.1.12 Memoryless Information SourcesFor a memoryless information source, there are no correlations between the outputsof the source at different times. For each instant at which an output is emitted, thereis a probability distribution over the alphabet that describes the probability of eachsymbol being emitted at that instant. If all the probability distributions are the same,the source is said to be stationary. If we know these probability distributions, we cancalculate the information content of the sequence.EXAMPLE 1.12Tossing a fair coin gives us an example of a stationary memoryless information source.At any instant, the probability distribution is given by È´Àµ ¼ , È´Ìµ ¼ ¼.This probability distribution has an entropy of 1 bit; so the information content is 1bit/symbol.
Entropy and Information 19EXAMPLE 1.13As an exampleof a non-stationarymemorylessinformationsource, suppose we have afaircoinandadiewith Àpainted on fourfacesandÌpainted on two faces. Tossing thecoin and throwing the die in alternation will create a memoryless information sourcewith alphabet À Ì . Every time the coin is tossed, the probability distribution ofthe outcomes is È´Àµ ¼ , È´Ìµ ¼ , and every time the die is thrown, theprobability distribution is È´Àµ ¼ , È´Ìµ ¼ ¿¿¿.The probability distribution of the outcomes of tossing the coin has an entropy of 1bit. The probability distribution of the outcomes of throwing the die has an entropyof 0.918 bits. The information content of the sequence is the average entropy persymbol, which is 0.959 bits/symbol.Memoryless information sources are relatively simple. More realistic informationsources have memory, which is the property that the emission of a symbol at anyinstant depends on one or more of the symbols that were generated before it.1.13 Markov Sources and n-gram ModelsMarkov sources and n-gram models are descriptions of a class of information sourceswith memory.DEFINITION 1.15Markov Source A Markov source consists of an alphabet ,a set of states ¦, a set of transitions between states, a set of labels for the transitionsand two sets of probabilities. The ﬁrst set of probabilities is the initial probabilitydistribution on the set of states, which determines the probabilities of sequencesstarting with each symbol in the alphabet. The second set of probabilities is a setof transition probabilities. For each pair of states, and , the probability of atransition from to is È´ µ. (Note that these probabilities are ﬁxed and do notdepend on time, so that there is an implicit assumption of stationarity.) The labelson the transitions are symbols from the alphabet.To generate a sequence, a state is selected on the basis of the initial probability distri-bution. A transition from this state to another state (or to the same state) is selectedon the basis of the transition probabilities, and the label of this transition is output.This process is repeated to generate the sequence of output symbols.It is convenient to represent Markov models diagrammatically in the form of a graph,with the states represented by vertices and the transitions by edges, as in the follow-ing example.
Entropy and Information 21EXAMPLE 1.15The following probabilities give us a 3-gram model on the language ¼ ½ .È ´¼¼¼µ ¼ ¿¾ È ´¼¼½µ ¼ ¼È ´¼½¼µ ¼ ½ È ´¼½½µ ¼ ½È ´½¼¼µ ¼ ½ È ´½¼½µ ¼ ¼È ´½½¼µ ¼ ¼ È ´½½½µ ¼ ¼To describe the relationship between n-gram models and Markov sources, we needto look at special cases of Markov sources.DEFINITION 1.16 Ñth-order Markov Source A Markov source whose statesare sequences of Ñ symbols from the alphabet is called an Ñth-order Markovsource.When we have an Ñth-order Markov model, the transition probabilities are usuallygiven in terms of the probabilities of single symbols being emitted when the sourceis in a given state. For example, in a second-order Markov model on ¼ ½ , thetransition probability from ¼½ to ½¼, which would be represented by È ´½¼ ¼½µ, wouldbe represented instead by the probability of emission of ¼ when in the state ¼½, that isÈ ´¼ ¼½µ. Obviously, some transitions are impossible. For example, it is not possibleto go from the state ¼½ to the state ¼¼, as the state following ¼½ must have ½ as itsﬁrst symbol.We can construct a Ñth-order Markov model from an ´Ñ · ½µ-gram model and anÑ-gram model. The Ñ-gram model gives us the probabilities of strings of length Ñ,such as È ´×½ ×¾ ×Ñµ. To ﬁnd the emission probability of × from this state, wesetÈ ´× ×½ ×¾ ×ÑµÈ ´×½ ×¾ ×Ñ ×µÈ ´×½ ×¾ ×Ñµ(1.45)where the probability È ´×½ ×¾ ×Ñ ×µ is given by the ´Ñ · ½µ-gram model.EXAMPLE 1.16In the previous example 1.15 we had a 3-gram model on the language ¼ ½ givenbyÈ ´¼¼¼µ ¼ ¿¾ È ´¼¼½µ ¼ ¼È ´¼½¼µ ¼ ½ È ´¼½½µ ¼ ½È ´½¼¼µ ¼ ½ È ´½¼½µ ¼ ¼È ´½½¼µ ¼ ¼ È ´½½½µ ¼ ¼
22 Fundamentals of Information Theory and Coding DesignP(1|01)=0.5P(0|11)=0.6P(1|00)=0.201 111000P(0|10)=0.8P(1|10)=0.2P(0|01)=0.5P(1|11)=0.4P(0|00)=0.8FIGURE 1.3Diagrammatic representation of a Markov source equivalent to a 3-gram model.If a 2-gram model for the same source is given by È´¼¼µ ¼ , È´¼½µ ¼ ¿,È´½¼µ ¼ ¾ and È´½½µ ¼ ½, then we can construct a second-order Markov sourceas follows:È´¼ ¼¼µ È´¼¼¼µ È´¼¼µ ¼ ¿¾ ¼ ¼È´½ ¼¼µ È´¼¼½µ È´¼¼µ ¼ ¼ ¼ ¼ ¾È´¼ ¼½µ È´¼½¼µ È´¼½µ ¼ ½ ¼ ¿ ¼È´½ ¼½µ È´¼½½µ È´¼½µ ¼ ½ ¼ ¿ ¼È´¼ ½¼µ È´½¼¼µ È´½¼µ ¼ ½ ¼ ¾ ¼È´½ ½¼µ È´½¼½µ È´½¼µ ¼ ¼ ¼ ¾ ¼ ¾È´¼ ½½µ È´½½¼µ È´½½µ ¼ ¼ ¼ ½ ¼È´½ ½½µ È´½½½µ È´½½µ ¼ ¼ ¼ ½ ¼Figure 1.3 shows this Markov source.To describe the behaviour of a Markov source mathematically, we use the transitionmatrix of probabilities. If the set of states is¦ ½ ¾ Æ
Entropy and Information 23the transition matrix is the Æ ¢Æ matrix¥¾È´½ ½µ È´½ ¾µ ¡¡¡ È´½ ÆµÈ´¾ ½µ È´¾ ¾µ ¡¡¡ È´¾ Æµ............È´Æ ½µ È´Æ ¾µ ¡¡¡ È´Æ Æµ¿(1.46)The probability of the source being in a given state varies over time. Let ÛØbe theprobability of the source being in state at time Ø, and setÏØ¾ÛØ½ÛØ¾...ÛØÆ¿(1.47)Then Ï¼ is the initial probability distribution andÏØ·½¥ÏØ(1.48)and so, by induction,ÏØ¥ØÏ¼(1.49)Because they all represent probability distributions, each of the columns of ¥ mustadd up to ½, and all the ÛØmust add up to ½ for each Ø.1.14 Stationary DistributionsThe vectors ÏØdescribe how the behaviour of the source changes over time. Theasymptotic (long-term) behaviour of sources is of interest in some cases.EXAMPLE 1.17Consider a source with transition matrix¥ ¼ ¼ ¾¼ ¼Suppose the initial probability distribution isÏ¼ ¼¼
24 Fundamentals of Information Theory and Coding DesignThenÏ½¥Ï¼ ¼ ¼ ¾¼ ¼¼¼¼¼Similarly,Ï¾¥Ï½ ¼ ¿¼Ï¿¥Ï¾ ¼ ¿¼Ï ¥Ï¿ ¼ ¿¿¼ ¾and so on.Suppose instead thatÏ¼×½ ¿¾ ¿ThenÏ½× ¥Ï¼×¼ ¼ ¾¼ ¼½ ¿¾ ¿½ ¿¾ ¿ Ï¼×so thatÏØ× Ï¼×for all Ø ¼. This distribution will persist for all time.In the example above, the initial distribution Ï¼× has the property that¥Ï¼× Ï¼× (1.50)and persists for all time.DEFINITION 1.17 Stationary Distribution A probability distribution Ï overthe states of a Markov source with transition matrix ¥ that satisﬁes the equation¥Ï Ï is a stationary distribution.As shown in the example, if Ï¼ is a stationary distribution, it persists for all time,ÏØ Ï¼ for all Ø. The deﬁning equation shows that a stationary distribution Ïmust be an eigenvector of ¥with eigenvalue ½. To ﬁnd a stationary distribution for ¥,we must solve the equation ¥Ï Ï together with the condition thatÈÛ ½.EXAMPLE 1.18Suppose¥¾¼ ¾ ¼ ¼ ¼ ¼¼¼ ¼ ¼ ¼¼ ¼ ¾¼ ¾ ¼ ¼ ¼¿
Entropy and Information 25Then the equation ¥Ï Ï gives¼ ¾ Û½ ·¼ ¼Û¾ ·¼ ¼¼Û¿ Û½¼ ¼Û½ ·¼ ¼¼Û¾ ·¼ ¾ Û¿ Û¾¼ ¾ Û½ ·¼ ¼Û¾ ·¼ Û¿ Û¿The ﬁrst equation gives us¼ Û½ ¼ ¼Û¾ ¼and the other two give¼ ¼Û½ ½ ¼¼Û¾ ·¼ ¾ Û¿ ¼¼ ¾ Û½ ·¼ ¼Û¾ ¼ ¾ Û¿ ¼from which we get ¾ ¼¼Û¾ ·¼ Û¿ ¼SoÛ½¾¿Û¾andÛ¿¿Û¾Substituting these values inÛ½ ·Û¾ ·Û¿ ½we get¾¿Û¾ ·Û¾ · ¿Û¾ ½which gives usÛ¾¿½¿ Û½¾½¿ Û¿½¿So the stationary distribution isÏ¾¾ ½¿¿ ½¿½¿¿In the examples above, the source has an unique stationary distribution. This is notalways the case.
26 Fundamentals of Information Theory and Coding DesignEXAMPLE 1.19Consider the source with four states and probability transition matrix¥¾½ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ½ ¼¿The diagrammatic representation of this source is shown in Figure 1.4.È´ ¾µ ¼¾½È´ ½ ¿µ ¼È´ ¾ ¿µ ¼È´ ¿ ¾µ ¼È´ ½ ½µ ½ ¼¿È´ µ ½ ¼FIGURE 1.4A source with two stationary distributions.For this source, any distribution with Û¾ ¼ ¼, Û¿ ¼ ¼and Û½ ·Û ½ ¼satisﬁesthe equation ¥Ï Ï . However, inspection of the transition matrix shows that oncethe source enters either the ﬁrst state or the fourth state, it cannot leave it. The onlystationary distributions that can occur are Û½ ½ ¼, Û¾ ¼ ¼, Û¿ ¼ ¼, Û ¼ ¼or Û½ ¼ ¼, Û¾ ¼ ¼, Û¿ ¼ ¼, Û ½ ¼.Some Markov sources have the property that every sequence generated by the sourcehas the same statistical properties. That is, the various frequencies of occurrence of
Entropy and Information 27symbols, pairs of symbols, and so on, obtained from any sequence generated by thesource will, as the length of the sequence increases, approach some deﬁnite limitwhich is independent of the particular sequence. Sources that have this property arecalled ergodic sources.The source of Example 1.19 is not an ergodic source. The sequences generated bythat source fall into two classes, one of which is generated by sequences of statesthat end in the ﬁrst state, the other of which is generated by sequences that end in thefourth state. The fact that there are two distinct stationary distributions shows thatthe source is not ergodic.1.15 The Entropy of Markov SourcesThere are various ways of deﬁning the entropy of an information source. The fol-lowing is a simple approach which applies to a restricted class of Markov sources.DEFINITION 1.18 Entropy of the th State of a Markov Source The entropy ofthe th state of a Markov source is the entropy of the probability distribution on theset of transitions from that state.If we denote the probability distribution on the set of transitions from the th state byÈ , then the entropy of the th state is given byÀ´È µ Æ½È ´ µ ÐÓ ´È ´ µµ (1.51)DEFINITION 1.19 Uniﬁlar Markov Source A uniﬁlar Markov source is one withthe property that the labels on the transitions from any given state are all distinct.We need this property in order to be able to deﬁne the entropy of a Markov source.We assume that the source has a stationary distribution.
28 Fundamentals of Information Theory and Coding DesignDEFINITION 1.20 Entropy of a Uniﬁlar Markov Source The entropyof a uniﬁlar Markov source Å, whose stationary distribution is given byÛ½ Û¾ ÛÆ, and whose transition probabilities are È´ µ for ½ Æ,½ Æ, isÀ´ÅµÆ½Û À´È µ Æ½Æ½Û È´ µ ÐÓ ´È´ µµ (1.52)It can be shown that this deﬁnition is consistent with more general deﬁnitions of theentropy of an information source.EXAMPLE 1.20For the Markov source of Example 1.14, there are three states, ½, ¾ and ¿. Theprobability distribution on the set of transitions from is È for ½ ¾ ¿.È½ is given byÈ½´½µ È´½ ½µ ¼ ¼ È½´¾µ È´¾ ½µ ½ ¼ È½´¿µ È´¿ ½µ ¼ ¼Its entropy isÀ´È½µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´½ ¼µ´¼ ¼µ ´¼ ¼µ ÐÓ ´¼ ¼µ ¼ ¼using the usual convention that ¼ ¼ ÐÓ ´¼ ¼µ ¼ ¼.È¾ is given byÈ¾´½µ È´½ ¾µ ¼ ¼ È¾´¾µ È´¾ ¾µ ¼ ¼ È¾´¿µ È´¿ ¾µ ½ ¼Its entropy isÀ´È¾µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´½ ¼µ´¼ ¼µ ¼ ¼È¿ is given byÈ¿´½µ È´½ ¿µ ¼ È¿´¾µ È´¾ ¿µ ¼ È¿´¿µ È´¿ ¿µ ¼ ¼Its entropy isÀ´È¿µ ´¼ µ´ ¼ ¿ µ ´¼ µ´ ½ ¿¾½ ¿µ ´¼ ¼µ ÐÓ ´¼ ¼µ ¼ ¼The stationary distribution of the source is given byÛ½¿½¿Û¾½¿Û¿½¿
Entropy and Information 29The entropy of the source isÀ´Åµ ´¼ ¼µ´¿ ½¿µ · ´¼ ¼µ´ ½¿µ · ´¼ ¼ µ´ ½¿µ ¼ ¿ ¿EXAMPLE 1.21For the source of Example 1.16, the states are ¼¼, ¼½, ½¼, ½½.È¼¼ is given by È¼¼´¼µ È´¼ ¼¼µ ¼ , and È¼¼´½µ È´½ ¼¼µ ¼ ¾. Its entropyisÀ´È¼¼µ ´¼ µ´ ¼ ¿¾½ ¿µ ´¼ ¾µ´ ¾ ¿¾½ ¿µ ¼ ¾½ ¿È¼½ is given by È¼½´¼µ È´¼ ¼½µ ¼ , and È¼½´½µ È´½ ¼½µ ¼ . Its entropyisÀ´È¼½µ ´¼ µ´ ½ ¼µ ´¼ µ´ ½ ¼µ ½ ¼È½¼ is given by È½¼´¼µ È´¼ ½¼µ ¼ , and È½¼´½µ È´½ ½¼µ ¼ ¾. Its entropyisÀ´È½¼µ ´¼ µ´ ¼ ¿¾½ ¿µ ´¼ ¾µ´ ¾ ¿¾½ ¿µ ¼ ¾½ ¿È½½ is given by È½½´¼µ È´¼ ½½µ ¼ , and È½½´½µ È´½ ½½µ ¼ . Its entropyisÀ´È½½µ ´¼ µ´ ¼ ½¼ ¿µ ´¼ µ´ ½ ¿¾½ ¿µ ¼ ¼The stationary distribution of the source is given byÛ½½¾¼¾½¿Û¾¿¼¾½¿Û¿¿¾½¿Û ¾¾½¿The entropy of the source isÀ´Åµ½¾¼´¼ ¾½ ¿µ · ¿¼´½ ¼¼¼¼¼µ · ¿ ´¼ ¾½ ¿µ · ¾ ´¼ ¼ µ¾½¿¼ ¿ ¿1.16 Sequences of SymbolsIt is possible to estimate the entropy of a Markov source using information aboutthe probabilities of occurrence of sequences of symbols. The following results apply
30 Fundamentals of Information Theory and Coding Designto ergodic Markov sources and are stated without proof. In a sense, they justifythe use of the conditional probabilities of emission of symbols instead of transitionprobabilities between states in Ñth-order Markov models.THEOREM 1.6Given any ¯ ¼ and any Æ ¼, we can ﬁnd a positive integer Æ¼ such that allsequences of length Æ Æ¼ fall into two classes: a set of sequences whose totalprobability is less than ¯; and the remainder, for which the following inequality holds:¬¬¬¬ÐÓ ´½ ÔµÆ À¬¬¬¬Æ (1.53)where Ôis the probability of the sequence and À is the entropy of the source.PROOF See , Appendix 3.THEOREM 1.7Let Å be a Markov source with alphabet ½ ¾ Ò , and entropy À. LetÆ denote the set of all sequences of symbols from of length Æ. For × ¾ Æ, letÈ´×µ be the probability of the sequence ×being emitted by the source. DeﬁneÆ ½Æ ×¾ ÆÈ´×µ ÐÓ ´È´×µµ (1.54)which is the entropy per symbol of the sequences of Æ symbols. Then Æ is amonotonic decreasing function of Æ andÐ ÑÆ ½ Æ À (1.55)PROOF See , Appendix 3.THEOREM 1.8Let Å be a Markov source with alphabet ½ ¾ Ò , and entropy À. LetÆ denote the set of all sequences of symbols from of length Æ. For × ¾ Æ ½,let È´× µ be the probability of the source emitting the sequence × followed by thesymbol , and let È´ ×µ be the conditional probability of the symbol beingemitted immediately after the sequence ×. DeﬁneÆ ×¾ Æ ½Ò½È´× µ ÐÓ ´È´ ×µµ (1.56)
Entropy and Information 31which is the conditional entropy of the next symbol when the ´Æ ½µ precedingsymbols are known. Then Æ is a monotonic decreasing function of Æ andÐ ÑÆ ½ Æ À (1.57)PROOF See , Appendix 3.THEOREM 1.9If Æ and Æ are deﬁned as in the previous theorems, thenÆ Æ Æ ´Æ ½µ Æ ½ (1.58)Æ½ÆÆÁ ½Á (1.59)andÆ Æ (1.60)PROOF See , Appendix 3.These results show that a series of approximations to the entropy of a source canbe obtained by considering only the statistical behaviour of sequences of symbolsof increasing length. The sequence of estimates Æ is a better approximation thanthe sequence Æ . If the dependencies in a source extend over no more than Æsymbols, so that the conditional probability of the next symbol knowing the preced-ing ´Æ ½µ symbols is the same as the conditional probability of the next symbolknowing the preceding Æ Æ symbols, then Æ À.1.17 The Adjoint Source of a Markov SourceIt is possible to approximate the behaviour of a Markov source by a memorylesssource.DEFINITION 1.21 Adjoint Source of a Markov Source The adjoint sourceof a Markov source is the memoryless source with the same alphabet which emitssymbols independently of each other with the same probabilities as the Markovsource.
32 Fundamentals of Information Theory and Coding DesignIf we have an Ñth-order Markov source Å with alphabet ½ ¾ Õ , theprobabilities of emission of the symbols areÈ´ µ È´ ½ ¾ ÑµÈ´ ½ ¾ Ñµ (1.61)where ½ ¾ Ñ represents a sequence of Ñ symbols from the alphabet of thesource, È´ ½ ¾ Ñµ is the probability of this sequence in the stationary dis-tribution of the Markov source and the summation over indicates that all suchsequences are included in the summation. The adjoint source of this Markov source,denoted Å, is the memoryless source that emits these symbols with the same prob-abilities.EXAMPLE 1.22For the 3-gram model of Example 1.15, we have transition probabilitiesÈ´¼ ¼¼µ ¼ È´½ ¼¼µ ¼ ¾È´¼ ¼½µ ¼ È´½ ¼½µ ¼È´¼ ½¼µ ¼ È´½ ½¼µ ¼ ¾È´¼ ½½µ ¼ È´½ ½½µ ¼which give us the transition matrix¥¾¼ ¼ ¼ ¼ ¼ ¼¼ ¾ ¼ ¼ ¼ ¾ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼¿We need to ﬁnd the stationary distribution of the source. The equation ¥Ï Ïgives¼ Û½ ·¼ ¼Û¾ ·¼ Û¿ ·¼ ¼Û Û½¼ ¾Û½ ·¼ ¼Û¾ ·¼ ¾Û¿ ·¼ ¼Û Û¾¼ ¼Û½ ·¼ Û¾ ·¼ ¼Û¿ ·¼ Û Û¿¼ ¼Û½ ·¼ Û¾ ·¼ ¼Û¿ ·¼ Û ÛSolving these equations together with the constraintÛ½ ·Û¾ ·Û¿ ·Û ½ ¼we get the stationary distributionÈ´¼¼µ ¾ È´¼½µ È´½¼µ È´½½µ
Entropy and Information 33The probabilities for the adjoint source of the 3-gram models areÈ´¼µ È´¼ ¼¼µÈ´¼¼µ · È´¼ ¼½µÈ´¼½µ · È´¼ ½¼µÈ´½¼µ · È´¼ ½½µÈ´½½µ½¾¿andÈ´½µ È´½ ¼¼µÈ´¼¼µ · È´½ ¼½µÈ´¼½µ · È´½ ½¼µÈ´½¼µ · È´½ ½½µÈ´½½µ¾¿Although the probabilities of emission of single symbols are the same for both theMarkov source and its adjoint source, the probabilities of emission of sequencesof symbols may not be the same. For example the probability of emission of thesequence ¼¼¼ by the Markov source is È´¼ ¼¼µ ¼ , while for the adjoint source itis È´¼µ¿ ¼ ¼¿ (by the assumption of independence).Going from a Markov source to its adjoint reduces the number of constraints on theoutput sequence and hence increases the entropy. This is formalised by the followingtheorem.THEOREM 1.10If Å is the adjoint of the Markov source Å, their entropies are related byÀ´Åµ À´Åµ (1.62)PROOF If Å is an Ñth-order source with alphabet ½ ¾ Õ , we willdenote the states, which are Ñ-tuples of the , by «Á , where ½ Á ÕÑ. Weassume that Å has a stationary distribution.The probabilities of emission of the symbols areÈ´ µÁÛÁÈ´ «Áµ (1.63)where the summation is over all states and ÛÁ is the probability of state «Á in thestationary distribution of the source.The entropy of the adjoint isÀ´Åµ Õ½È´ µ ÐÓ ´È´ µµ
34 Fundamentals of Information Theory and Coding Design Õ½ ÁÛÁ È´ «ÁµÐÓ ´È´ µµ ÁÛÁÕ½È´ «ÁµÐÓ ´È´ µµ (1.64)The entropy of the Áth state of Å isÀ´ÈÁ µ Õ½È´ «Á µÐÓ ´È´ «Áµµ (1.65)and the entropy of Å isÀ´Åµ ÁÕ½ÛÁÈ´ «ÁµÐÓ ´È´ «Á µµ ÁÛÁÕ½È´ «ÁµÐÓ ´È´ «Á µµ (1.66)If we apply the inequality of Lemma 1.1 to each summation over , the result follows.1.18 Extensions of SourcesIn situations where codes of various types are being developed, it is often useful toconsider sequences of symbols emitted by a source.DEFINITION 1.22 Extension of a Stationary Memoryless Source The Òthextension of a stationary memoryless source Ë is the stationary memoryless sourcewhose alphabet consists of all sequences of Ò symbols from the alphabet of Ë, withthe emission probabilities of the sequences being the same as the probabilities ofoccurrence of the sequences in the output of Ë.The Òth extension of Ë will be denoted by ËÒ. Because the emission of successivesymbols by Ë is statistically independent, the emission probabilities in ËÒ can becomputed by multiplying the appropriate emission probabilities in Ë.EXAMPLE 1.23Consider the memoryless source Ë with alphabet ¼ ½ and emission probabilitiesÈ´¼µ ¼ ¾, È´½µ ¼ .
Entropy and Information 35The second extension of Ë has alphabet ¼¼ ¼½ ½¼ ½½ with emission probabilitiesÈ´¼¼µ È´¼µÈ´¼µ ´¼ ¾µ´¼ ¾µ ¼ ¼È´¼½µ È´¼µÈ´½µ ´¼ ¾µ´¼ µ ¼ ½È´½¼µ È´½µÈ´¼µ ´¼ µ´¼ ¾µ ¼ ½È´½½µ È´½µÈ´½µ ´¼ µ´¼ µ ¼The third extension of Ë has alphabet ¼¼¼ ¼¼½ ¼½¼ ¼½½ ½¼¼ ½¼½ ½½¼ ½½½ withemission probabilitiesÈ´¼¼¼µ È´¼µÈ´¼µÈ´¼µ ´¼ ¾µ´¼ ¾µ´¼ ¾µ ¼ ¼¼È´¼¼½µ È´¼µÈ´¼µÈ´½µ ´¼ ¾µ´¼ ¾µ´¼ µ ¼ ¼¿¾È´¼½¼µ È´¼µÈ´½µÈ´¼µ ´¼ ¾µ´¼ µ´¼ ¾µ ¼ ¼¿¾È´¼½½µ È´¼µÈ´½µÈ´½µ ´¼ ¾µ´¼ µ´¼ µ ¼ ½¾È´½¼¼µ È´½µÈ´¼µÈ´¼µ ´¼ µ´¼ ¾µ´¼ ¾µ ¼ ¼¿¾È´½¼½µ È´½µÈ´¼µÈ´½µ ´¼ µ´¼ ¾µ´¼ µ ¼ ½¾È´½½¼µ È´½µÈ´½µÈ´¼µ ´¼ µ´¼ µ´¼ ¾µ ¼ ½¾È´½½½µ È´½µÈ´½µÈ´½µ ´¼ µ´¼ µ´¼ µ ¼ ½¾There is a simple relationship between the entropy of a stationary memoryless sourceand the entropies of its extensions.THEOREM 1.11If ËÒ is the Òth extension of the stationary memoryless source Ë, their entropies arerelated byÀ´ËÒµ ÒÀ´Ëµ (1.67)PROOF If the alphabet of Ë is ½ ¾ Õ , and the emission probabilitiesof the symbols are È´ µ for ½ ¾ Õ, the entropy of Ë isÀ´Ëµ Õ½È´ µ ÐÓ ´È´ µµ (1.68)The alphabet of ËÒ consists of all sequences ½ ¾ Ò, where ¾ ½ ¾ Õ .The emission probability of ½ ¾ Ò isÈ´ ½ ¾ Òµ È´ ½ µÈ´ ¾ µ È´ Òµ (1.69)
36 Fundamentals of Information Theory and Coding DesignThe entropy of ËÒ isÀ´ËÒµ Õ½ ½ÕÒ ½È´ ½ ¾ Òµ ÐÓ ´È´ ½ ¾ Òµµ Õ½ ½ÕÒ ½È´ ½ ¾ ÒµÒ½ÐÓ ´È´ µµ (1.70)We can interchange the order of summation to getÀ´ËÒµ Ò½Õ½ ½ÕÒ ½È´ ½ ¾ Ò µ ÐÓ ´È´ µµ (1.71)Breaking È´ ½ ¾ Ò µ into the product of probabilities, and rearranging, we getÀ´ËÒµ Ò½Õ½ ½È´ ½ µÕ½È´ µ ÐÓ ´È´ µµÕÒ ½È´ Ò µ (1.72)SinceÕ½È´ µ ½ (1.73)for , we are left withÀ´ËÒµ Ò½Õ½È´ µ ÐÓ ´È´ µµÒ½À´ËµÒÀ´Ëµ (1.74)We also have extensions of Markov sources.DEFINITION 1.23 Extension of an Ñth-order Markov Source Let Ñ and Òbe positive integers, and let Ô be the smallest integer that is greater than or equalto Ñ Ò. The Òth extension of the Ñth-order Markov source Å is the Ôth-orderMarkov source whose alphabet consists of all sequences of Ò symbols from thealphabet of Å and for which the transition probabilities between states are equalto the probabilities of the corresponding Ò-fold transitions of the Ñth-order source.We will use ÅÒ to denote the Òth extension of Å.
Entropy and Information 37EXAMPLE 1.24Let Å be the ﬁrst-order Markov source with alphabet ¼ ½ and transition probabil-itiesÈ´¼ ¼µ ¼ ¿ È´½ ¼µ ¼ È´¼ ½µ ¼ È´½ ½µ ¼The second extension of Å has Ñ ½ and Ò ¾, so Ô ½. It is a ﬁrst-ordersource with alphabet ¼¼ ¼½ ½¼ ½½ . We can calculate the transition probabilitiesas follows.È´¼¼ ¼¼µ È´¼ ¼µÈ´¼ ¼µ ´¼ ¿µ´¼ ¿µ ¼ ¼È´¼½ ¼¼µ È´¼ ¼µÈ´½ ¼µ ´¼ ¿µ´¼ µ ¼ ¾½È´½¼ ¼¼µ È´½ ¼µÈ´¼ ½µ ´¼ µ´¼ µ ¼ ¾È´½½ ¼¼µ È´½ ¼µÈ´½ ½µ ´¼ µ´¼ µ ¼ ¾È´¼¼ ½½µ È´¼ ½µÈ´¼ ¼µ ´¼ µ´¼ ¿µ ¼ ½¾È´¼½ ½½µ È´¼ ½µÈ´½ ¼µ ´¼ µ´¼ µ ¼ ¾È´½¼ ½½µ È´½ ½µÈ´¼ ½µ ´¼ µ´¼ µ ¼ ¾È´½½ ½½µ È´½ ½µÈ´½ ½µ ´¼ µ´¼ µ ¼ ¿EXAMPLE 1.25Consider the second order Markov source with alphabet ¼ ½ and transition proba-bilitiesÈ´¼ ¼¼µ ¼ È´½ ¼¼µ ¼ ¾È´¼ ¼½µ ¼ È´½ ¼½µ ¼È´¼ ½¼µ ¼ ¾ È´½ ½¼µ ¼È´¼ ½½µ ¼ È´½ ½½µ ¼The transition probabilities of the second extension are
Entropy and Information 39È´¼ ¼¼¼¼µ ¼ È´½ ¼¼¼¼µ ¼ ½È´¼ ¼¼¼½µ ¼ È´½ ¼¼¼½µ ¼ ¾È´¼ ¼¼½¼µ ¼ È´½ ¼¼½¼µ ¼ ¿È´¼ ¼¼½½µ ¼ È´½ ¼¼½½µ ¼È´¼ ¼½¼¼µ ¼ È´½ ¼½¼¼µ ¼È´¼ ¼½¼½µ ¼ È´½ ¼½¼½µ ¼È´¼ ¼½½¼µ ¼ ¿ È´½ ¼½½¼µ ¼È´¼ ¼½½½µ ¼ ¾ È´½ ¼½½½µ ¼È´¼ ½¼¼¼µ ¼ ½ È´½ ½¼¼¼µ ¼È´¼ ½¼¼½µ ¼ ¾ È´½ ½¼¼½µ ¼È´¼ ½¼½¼µ ¼ ¿ È´½ ½¼½¼µ ¼È´¼ ½¼½½µ ¼ È´½ ½¼½½µ ¼È´¼ ½½¼¼µ ¼ È´½ ½½¼¼µ ¼È´¼ ½½¼½µ ¼ È´½ ½½¼½µ ¼È´¼ ½½½¼µ ¼ È´½ ½½½¼µ ¼ ¿È´¼ ½½½½µ ¼ È´½ ½½½½µ ¼ ¾We can use these probabilities to compute the transition probabilities of the secondextension, for example,È´¼½ ½¼½¼µ È´¼ ½¼½¼µÈ´½ ¼½¼¼µ ´¼ ¿µ´¼ µ ¼ ½If we denote the states of the second extension by «½ ¼¼, «¾ ¼½, «¿ ½¼ and« ½½, we have the transition probabilities of a second-order Markov source:
40 Fundamentals of Information Theory and Coding DesignÈ´«½ «½«½µ ¼ ½ È´«¾ «½«½µ ¼ ¼È´«¿ «½«½µ ¼ ¼ È´« «½«½µ ¼ ¼¾È´«½ «½«¾µ ¼ È´«¾ «½«¾µ ¼ ¾È´«¿ «½«¾µ ¼ ½¾ È´« «½«¾µ ¼ ¼È´«½ «½«¿µ ¼ ¿ È´«¾ «½«¿µ ¼ ¿È´«¿ «½«¿µ ¼ ½¾ È´« «½«¿µ ¼ ½È´«½ «½« µ ¼ ½ È´«¾ «½« µ ¼ ¾È´«¿ «½« µ ¼ ¼ È´« «½« µ ¼ ¿¾È´«½ «¾«½µ ¼ ¼ È´«¾ «¾«½µ ¼È´«¿ «¾«½µ ¼ ½¼ È´« «¾«½µ ¼ ¼È´«½ «¾«¾µ ¼ ½¾ È´«¾ «¾«¾µ ¼ ¾È´«¿ «¾«¾µ ¼ ¾ È´« «¾«¾µ ¼ ¿È´«½ «¾«¿µ ¼ ½ È´«¾ «¾«¿µ ¼ ½È´«¿ «¾«¿µ ¼ ¾ È´« «¾«¿µ ¼ ¾È´«½ «¾« µ ¼ ½ È´«¾ «¾« µ ¼ ¼È´«¿ «¾« µ ¼ È´« «¾« µ ¼ ½È´«½ «¿«½µ ¼ ¼ È´«¾ «¿«½µ ¼ ¼½È´«¿ «¿«½µ ¼ ¾ È´« «¿«½µ ¼ ½È´«½ «¿«¾µ ¼ ½ È´«¾ «¿«¾µ ¼ ¼È´«¿ «¿«¾µ ¼ È´« «¿«¾µ ¼ ¿¾È´«½ «¿«¿µ ¼ ½ È´«¾ «¿«¿µ ¼ ½È´«¿ «¿«¿µ ¼ ¾ È´« «¿«¿µ ¼ ¾È´«½ «¿« µ ¼ ½¾ È´«¾ «¿« µ ¼ ¾È´«¿ «¿« µ ¼ ½¾ È´« «¿« µ ¼È´«½ « «½µ ¼ ¼ È´«¾ « «½µ ¼È´«¿ « «½µ ¼ ½¼ È´« « «½µ ¼ ¼È´«½ « «¾µ ¼ ½ È´«¾ « «¾µ ¼ ¾È´«¿ « «¾µ ¼ ½ È´« « «¾µ ¼ ¾È´«½ « «¿µ ¼ ¿ È´«¾ « «¿µ ¼ ¿È´«¿ « «¿µ ¼ ½ È´« « «¿µ ¼ ½¾È´«½ « « µ ¼ È´«¾ « « µ ¼ ¾È´«¿ « « µ ¼ ½ È´« « « µ ¼ ¼It is convenient to represent elements of the alphabet of ÅÒ by single symbols aswe have done in the examples above. If the alphabet of Å is ½ ¾ Õ , thenwe will use « for a generic element of the alphabet of ÅÒ, and « ½ ¾ Ò will standfor the sequence ½ ¾ Ò. For further abbreviation, we will let Á stand for½ ¾ Ò and use «Á to denote « ½ ¾ Ò, and so on.The statistics of the extension ÅÒ are given by the conditional probabilities È´«Â «Á½ «Á¾ «ÁÔµ.In terms of the alphabet of Å, we haveÈ´«Â «Á½ «Á¾ «ÁÔµ È´ ½ Ò ½½ ½Ò ÔÒµ (1.75)
Entropy and Information 41This can also be written asÈ´«Â «Á½ «Á¾ «ÁÔµ È´ ½ ½½ ½Ò ÔÒµÈ´ ¾ ½¾ ÔÒ ½ µÈ´ Ò ½Ò ÔÒ ½ Ò ½µ (1.76)We can use this relationship to prove the following result.THEOREM 1.12If ÅÒ is the Òth extension of the Markov source Å their entropies are related byÀ´ÅÒµ ÒÀ´Åµ (1.77)The proof is similar to the proof of the corresponding result for memoryless sources.(Exercise 17 deals with a special case.)Note that if Ñ Ò, thenÀ´ÅÑµ ÑÀ´Åµ ÒÀ´Åµ À´ÅÒµ (1.78)Since an extension of a Markov source is a Ôth-order Markov source, we can considerits adjoint source. If Å is an Ñth-order Markov source, ÅÒ is an Òth extension ofÅ, and ÅÒ the adjoint source of the extension, then we can combine the results ofTheorem 1.10 and Theorem 1.12 to getÀ´ÅÒµ À´ÅÒµ ÒÀ´Åµ (1.79)1.19 Inﬁnite Sample SpacesThe concept of entropy carries over to inﬁnite sample spaces, but there are a numberof technical issues that have to be considered.If the sample space is countable, the entropy has to be deﬁned in terms of the limitof a series, as in the following example.EXAMPLE 1.27Suppose the sample space is the set of natural numbers, Æ ¼ ½ ¾ and theprobability distribution is given byÈ´Òµ ¾ Ò ½(1.80)
42 Fundamentals of Information Theory and Coding Designfor Ò ¾ Æ. The entropy of this distribution isÀ´Èµ ½Ò ¼È´Òµ ÐÓ ´È´Òµ ½Ò ¼¾ Ò ½´ Ò ½µ½Ò ¼Ò · ½¾Ò·½ (1.81)This inﬁnite sum converges to ¾, so À´Èµ ¾.If the sample space is a continuum, in particular, the real line, the summations be-come integrals. Instead of the probability distribution function, we use the probabil-ity density function , which has the property thatÈ´ µ ´Üµ Ü (1.82)where is a closed interval and is the probability density function.The mean and variance of the probability density function are deﬁned by½ ½Ü ´Üµ Ü (1.83)and¾½ ½´Ü µ¾ ´Üµ Ü (1.84)The obvious generalisation of the deﬁnition of entropy for a probability density func-tion deﬁned on the real line isÀ´ µ ½ ½´Üµ ÐÓ ´ ´Üµµ Ü (1.85)provided this integral exists. This deﬁnition was proposed by Shannon, but has beenthe subject of debate because it is not invariant with respect to change of scale orchange of co-ordinates in general. It is sometimes known as the differential entropy.If we accept this deﬁnition of the entropy of a continuous distribution, it is easy tocompute the entropy of a Gaussian distribution.THEOREM 1.13The entropy of a Gaussian distribution with mean and variance ¾ is ÐÒ´Ô¾ µin natural units.
Entropy and Information 43PROOF The density function of the Gaussian distribution is´Üµ½Ô¾´Ü µ¾ ¾ ¾(1.86)Since this is a probability density function, we have½ ½´Üµ Ü ½ (1.87)Taking the natural logarithm of , we getÐÒ´ ´Üµµ ÐÒ´Ô¾ µ ´Ü µ¾¾ ¾(1.88)By deﬁnition,¾½ ½´Ü µ¾´Üµ Ü (1.89)this will be used below.We now calculate the entropy:À´ µ ½ ½´Üµ ÐÒ´ ´Üµµ Ü ½ ½´Üµ ÐÒ´Ô¾ µ ´Ü µ¾¾ ¾Ü½ ½´Üµ ÐÒ´Ô¾ µ Ü ·½ ½´Üµ´Ü µ¾¾ ¾ÜÐÒ´Ô¾ µ½ ½´Üµ Ü ·½¾ ¾½ ½´Üµ´Ü µ¾ÜÐÒ´Ô¾ µ ·¾¾ ¾ÐÒ´Ô¾ µ ·½¾ÐÒ´Ô¾ µ · ÐÓ ´Ô µÐÒ´Ô¾ µ (1.90)If the probability density function is deﬁned over the whole real line, it is not possi-ble to ﬁnd a speciﬁc probability density function whose entropy is greater than theentropy of all other probability density functions deﬁned on the real line. However,if we restrict consideration to probability density functions with a given variance, it
44 Fundamentals of Information Theory and Coding Designcan be shown that the Gaussian distribution has the maximum entropy of all thesedistributions. (Exercises 20 and 21 outline the proof of this result.)We have used À´ µ to denote the entropy of the probability distribution whose prob-ability density function is . If is a random variable whose probability densityfunction is , we will denote its entropy by either À´ µ or À´ µ.1.20 Exercises1. Let Ë ×½ ×¾ ×¿ be a sample space with probability distribution È givenby È´×½µ ¼ ¾, È´×¾µ ¼ ¿, È´×¿µ ¼ . Let be a function deﬁned on Ëby ´×½µ , ´×¾µ ¾, ´×¿µ ½. What is the expected value of ?2. Let Ë ×½ ×¾ be a sample space with probability distribution È given byÈ´×½µ ¼ , È´×¾µ ¼ ¿. Let be the function from Ë to Ê¾ given by´×½µ¾ ¼¿ ¼and´×¾µ¼ ¼What is the expected value of ?3. Suppose that a fair die is tossed. What is the expected number of spots on theuppermost face of the die when it comes to rest? Will this number of spotsever be seen when the die is tossed?4. Let Ë ×½ ×¾ ×¿ × be a sample space with probability distribution Ègiven by È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ .There are sixteen possible events that can be formed from the elements of Ë.Compute the probability and surprise of these events.5. Let Ë Ë½ ×½ ×Æ be a sample space for some Æ. Compute the en-tropy of each of the following probability distributions on Ë:(a) Æ ¿, È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ¾ ;(b) Æ , È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ ;(c) Æ , È´×½µ ¼ , È´×¾µ ¼ ½¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ ,È´× µ ¼ ½¾ ;(d) Æ , È´×½µ ¼ ¾ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ¾ , È´× µ ¼ ½¾ ,È´× µ ¼ ½¾ ;
Entropy and Information 45(e) Æ , È´×½µ ¼ ¼ ¾ , È´×¾µ ¼ ½¾ , È´×¿µ ¼ ¾ , È´× µ ¼ ,È´× µ ¼ ¼ ¾ .6. Convert the following entropy values from bits to natural units: (a) À´Èµ¼ ; (b) À´Èµ ½ ¼; (c) À´Èµ ½ ; (d) À´Èµ ¾ ¼; (e) À´Èµ ¾ ; (f)À´Èµ ¿ ¼.7. Convert the following entropy values from natural units to bits: (a) À´Èµ¼ ; (b) À´Èµ ½ ¼; (c) À´Èµ ½ ; (d) À´Èµ ¾ ¼; (e) À´Èµ ¾ ; (f)À´Èµ ¿ ¼.8. Let Ë ×½ ×¾ and Ì Ø½ Ø¾ Ø¿ . Let È be a joint probability distributionfunction on Ë ¢ Ì, given by È´×½ Ø½µ ¼ , È´×½ Ø¾µ ¼ ¾ , È´×½ Ø¿µ¼ ½¾ , È´×¾ Ø½µ ¼ ¼ ¾ , È´×¾ Ø¾µ ¼ ¼¿½¾ , È´×¾ Ø¿µ ¼ ¼¿½¾ . Com-pute the marginal distributions ÈË and ÈÌ and the entropies À´Èµ, À´ÈËµand À´ÈÌ µ. Also compute the conditional probability distributions ÈË Ì andÈÌ Ë and their entropies À´ÈË Ì µ and À´ÈÌ Ëµ.9. Draw a diagram to represent the Markov source with alphabet ¼ ½ and setof states ¦ ½ ¾ ¿ with the following ﬁve transitions:(a) ½ ¾, with label 1 and È´¾ ½µ ¼ ;(b) ½ ¿, with label 0 and È´¿ ½µ ¼ ;(c) ¾ ½, with label 0 and È´½ ¾µ ¼ ;(d) ¾ ¿, with label 1 and È´¿ ¾µ ¼ ¾;(e) ¿ ½, with label 1 and È´½ ¿µ ½ ¼.Write down the transition matrix for this source. Is it possible for this source togenerate an output sequence that includes the subsequence ¼¼¼? Is it possiblefor this source to generate an output sequence that includes the subsequence½½½?10. A 2-gram model on the language is given by the probabilitiesÈ´ µ ¼ ¼¼¼ È´ µ ¼ ¾¼¼ È´ µ ¼ ½¿¿È´ µ ¼ ½¿¿ È´ µ ¼ ¼¼¼ È´ µ ¼ ¾¼¼È´ µ ¼ ¾¼¼ È´ µ ¼ ½¿¿ È´ µ ¼ ¼¼¼The probabilities of the individual symbols areÈ´ µ ½ ¿ È´ µ ½ ¿ È´ µ ½ ¿Construct a (ﬁrst-order) Markov source from these models. Draw a diagramto represent the source and write down its transition matrix.
46 Fundamentals of Information Theory and Coding Design11. A 4-gram model on the language ¼ ½ is given byÈ ´½¼½¼µ ¼ ¼ È ´¼½¼½µ ¼ ¼with all other probabilities ¼. All the 3-gram probabilities are also ¼, exceptforÈ ´¼½¼µ ¼ ¼ È ´½¼½µ ¼ ¼Construct a third-order Markov source from these models. Draw a diagram torepresent the source and write down its transition matrix.12. Consider a Markov source with transition matrix¥¼ ¿ ¼¼ ¼ ¾and initial probability distributionÏ¼ ¼¼Compute ÏØfor Ø ½ ¾ ¿ .13. Find the stationary distribution of the Markov source whose transition matrixis¥¼ ½ ¼¼ ¼ ¾*14. Prove that a Markov source with two states always has a stationary distribution,provided that none of the transition probabilities are ¼.15. Find the stationary distribution of the Markov source whose transition matrixis¥¾¼ ¼ ¿ ¼ ¾¼ ¼ ¾ ¼¼ ½ ¼ ¼ ¾¿16. Compute the entropy of the Markov source in Exercise 9 above.*17. Prove that if ÅÒis the Òth extension of the ﬁrst-order Markov source Å , theirentropies are related by À ´ÅÒµ ÒÀ ´Å µ.18. Let Ë ½ ¾ ¿ be the set of positive integers and let È be the probabilitydistribution given by È ´ µ ¾ . Let be the function on Ë deﬁned by´ µ ´ ½µ ·½. What is the expected value of ?19. Let Ù be the probability density function of the uniform distribution over theclosed and bounded interval , so that Ù´Üµ ´ µ ½ if Üand Ù´Üµ ¼ otherwise. Compute the mean, variance and entropy (in naturalunits) of Ù.The next two exercises prove the result that the Gaussian distribution has themaximum entropy of all distributions with a given variance.
Entropy and Information 4720. Use the identity ÐÒ´Üµ Ü ½ to show that if and are probability densityfunctions deﬁned on the real line then½ ½´Üµ ÐÒ´ ´Üµµ Ü½ ½´Üµ ÐÒ´ ´Üµµ Üprovided both these integrals exist.21. Show that if is a probability density function with mean and variance ¾and is the probability density function of a Gaussian distribution with thesame mean and variance, then ½ ½´Üµ ÐÒ´ ´Üµµ Ü ÐÒ´Ô¾ µConclude that of all the probability density functions on the real line withvariance ¾, the Gaussian has the greatest entropy.22. Compute the entropy of the probability density function of a uniform distribu-tion on a closed interval whose variance is ¾ and conﬁrm that it is less thanthe entropy of a Gaussian distribution with variance ¾. (Use the results ofExercise 19.)The next four exercises are concerned with the Kullback-Leibler Divergence.*23. Let È and É be probability distributions on the sample spaceË ×½ ×¾ ×Æwith È´× µ Ô and É´× µ Õ for ½ Æ. The Kullback-LeiblerDivergence of È and É is deﬁned by´È ÉµÆ½Ô ÐÓ ´Ô Õ µCompute ´È Éµ and ´É Èµ, when Æ , Ô ¼ ¾ for ½ ¾ ¿ ,and Õ½ ¼ ½¾ , Õ¾ ¼ ¾ , Õ¿ ¼ ½¾ , Õ ¼ . Is ´È Éµ ´É Èµ?*24. If È and É are probability distributions on Ë, a ﬁnite sample space, show that´È Éµ ¼ and ´È Éµ ¼ if and only if È É.*25. If È, É, and Ê are probability distributions on Ë ×½ ×¾ ×Æ , withÈ´× µ Ô , É´× µ Õ and Ê´× µ Ö for ½ Æ, show that´È Êµ ´È Éµ · ´É Êµif and only ifÆ½´Ô Õ µ ÐÓ ´Õ Ö µ ¼
48 Fundamentals of Information Theory and Coding Design*26. If the sample space is the real line, it is easier to deﬁne the Kullback-LeiblerDivergence in terms of probability density functions. If Ô and Õ are probabilitydensity functions on Ê, with Ô´Üµ ¼ and Õ´Üµ ¼ for all Ü ¾ Ê, then wedeﬁne´Ô Õµ½ ½Ô´Üµ ÐÒ´Ô´Üµ Õ´Üµµ ÜCompute ´Ô Õµ when Ô is the probability density function of a Gaussiandistribution with mean ½ and variance ¾ and Õ is the probability densityfunction of a Gaussian distribution with mean ¾ and variance ¾.The next two exercises deal with the topic of Maximum Entropy Estimation.*27. We have shown in Section 1.6 that the entropy of the uniform distribution on asample space with Æ elements is ÐÓ ´Æµ and this is the maximum value of theentropy for any distribution deﬁned on that sample space. In many situations,we need to ﬁnd a probability distribution that satisﬁes certain constraints andhas the maximum entropy of all the distributions that satisfy those constraints.One type of constraint that is common is to require that the mean of the dis-tribution should have a certain value. This can be done using Lagrange multi-pliers. Find the probability distribution on Ë ½ ¾ ¿ that has maximumentropy subject to the condition that the mean of the distribution is ¾. To dothis you have to ﬁnd Ô½, Ô¾, Ô¿ and Ô that maximise the entropyÀ´Èµ½Ô ÐÓ ´½ Ô µsubject to the two constraints½Ô ½(so that the Ô form a probability distribution) and½Ô ¾*28. Find the probability distribution on Ë ½ ¾ ¿ that has maximum entropysubject to the conditions that the mean of the distribution is ¾ and the secondmoment of the distribution of the distribution is ¾. In this case the constraintsare½Ô ½½Ô ¾
Entropy and Information 49and½¾Ô ¾The next two exercises deal with the topic of Mutual Information.*29. If È is a joint probability distribution on Ë ¢ Ì , the Mutual Information ofÈ , denoted ÁÅ´È µ, is the Kullback-Leibler Divergence between È and theproduct of the marginals É, given byÉ´× Ø µ ÈË´× µÈÌ´Ø µCompute the Mutual Information of È when Ë ×½ ×¾ , Ì Ø½ Ø¾ Ø¿ ,and È is given by È ´×½ Ø½µ ¼ , È ´×½ Ø¾µ ¼ ¾ , È ´×½ Ø¿µ ¼ ½¾ ,È ´×¾ Ø½µ ¼ ¼ ¾ , È ´×¾ Ø¾µ ¼ ¼¿½¾ , È ´×¾ Ø¿µ ¼ ¼¿½¾ .*30. Show that if Ë ×½ ×¾ ×Å and Ì Ø½ Ø¾ ØÆ , an alternativeexpression for the Mutual Information of È , a joint probability distribution onË ¢Ì , is given byÁÅ´È µÅ½Æ½È ´× Ø µÐÓ ´È ´× Ø µ ÈË´× µµ1.21 References R. Ash, Information Theory, John Wiley Sons, New York, 1965. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley Sons, New York, 1991. D. S. Jones, Elementary Information Theory, Oxford University Press, Oxford,1979. H. S. Leff and A. F. Rex, Eds., Maxwell’s Demon: Entropy, Information, Com-puting, Adam Hilger, Bristol, 1990. R. D. Rosenkrantz, Ed., E. T. Jaynes: Papers on Probability, Statistics and Sta-tistical Physics, D. Reidel, Dordrecht, 1983. C. E. Shannon and W. Weaver, The Mathematical Theory Of Communication,The University of Illinois Press, Urbana, IL, 1949.