Abstract Algebra Applications with Maple,Richard E. Klima, Ernest Stitzinger, and Neil P. SigmonAlgebraic Number Theory, Richard A. MollinAn Atlas of The Smaller Maps in Orientable and Nonorientable Surfaces,David M. Jackson and Terry I. VisentinAn Introduction to Crytography, Richard A. MollinCombinatorial Algorithms: Generation Enumeration and Search,Donald L. Kreher and Douglas R. StinsonThe CRC Handbook of Combinatorial Designs,Charles J. Colbourn and Jeffrey H. DinitzCryptography: Theory and Practice, Second Edition, Douglas R. StinsonDesign Theory, Charles C. Lindner and Christopher A. RodgersFrames and Resolvable Designs: Uses, Constructions, and Existence,Steven Furino, Ying Miao, and Jianxing YinFundamental Number Theory with Applications, Richard A. MollinGraph Theory and Its Applications, Jonathan Gross and Jay YellenHandbook of Applied Cryptography,Alfred J. Menezes, Paul C. van Oorschot, and Scott A. VanstoneHandbook of Constrained Optimization,Herbert B. Shulman and Venkat VenkateswaranHandbook of Discrete and Combinatorial Mathematics, Kenneth H. RosenHandbook of Discrete and Computational Geometry,Jacob E. Goodman and Joseph O’RourkeIntroduction to Information Theory and Data Compression,Darrel R. Hankerson, Greg A. Harris, and Peter D. JohnsonSeries EditorKenneth H. Rosen, Ph.D.AT&T LaboratoriesMiddletown, New JerseyandDISCRETEMATHEMATICSITS APPLICATIONS
Continued TitlesNetwork Reliability: Experiments with a Symbolic Algebra Environment,Daryl D. Harms, Miroslav Kraetzl, Charles J. Colbourn, and John S. DevittRSA and Public-Key CryptographyRichard A. MollinQuadratics, Richard A. MollinVerificaton of Computer Codes in Computational Science and Engineering,Patrick Knupp and Kambiz Salari
CHAPMAN & HALL/CRCA CRC Press CompanyBoca Raton London NewYork Washington, D.C.Roberto TogneriChristopher J.S. deSilvaFUNDAMENTALS ofINFORMATION THEORYand CODING DESIGNDISCRETE MA THEMA TICS A ND ITS A PPLICA TIONSSeries Editor KENNETH H. ROSEN
,QFOXGHV ELEOLRJUDSKLFDO UHIHUHQFHV DQG LQGH[,6%1 DON SDSHU
,QIRUPDWLRQ WKHRU RGLQJ WKHRU , H6LOYD KULVWRSKHU - 6 ,, 7LWOH ,,, 53UHVV VHULHV RQ GLVFUHWH PDWKHPDWLFV DQG LWV DSSOLFDWLRQV4 7 d³GF This edition published in the Taylor Francis e-Library, 2006.“To purchase your own copy of this or any of Taylor Francis or Routledge’scollection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.”
PrefaceWhat is information? How do we quantify or measure the amount of informationthat is present in a ﬁle of data, or a string of text? How do we encode the informationso that it can be stored efﬁciently, or transmitted reliably?The main concepts and principles of information theory were developed by Claude E.Shannon in the 1940s. Yet only now, and thanks to the emergence of the informationage and digital communication, are the ideas of information theory being looked atagain in a new light. Because of information theory and the results arising fromcoding theory we now know how to quantify information, how we can efﬁcientlyencode it and how reliably we can transmit it.This book introduces the main concepts behind how we model information sourcesand channels, how we code sources for efﬁcient storage and transmission, and thefundamentals of coding theory and applications to state-of-the-art error correctingand error detecting codes.This textbook has been written for upper level undergraduate students and graduatestudents in mathematics, engineering and computer science. Most of the materialpresented in this text was developed over many years at The University of West-ern Australia in the unit Information Theory and Coding 314, which was a core unitfor students majoring in Communications and Electrical and Electronic Engineering,and was a unit offered to students enrolled in the Master of Engineering by Course-work and Dissertation in the Intelligent Information Processing Systems course.The number of books on the market dealing with information theory and coding hasbeen on the rise over the past ﬁve years. However, very few, if any, of these bookshave been able to cover the fundamentals of the theory without losing the reader inthe complex mathematical abstractions. And fewer books are able to provide theimportant theoretical framework when discussing the algorithms and implementa-tion details of modern coding systems. This book does not abandon the theoreticalfoundations of information and coding theory and presents working algorithms andimplementations which can be used to fabricate and design real systems. The mainemphasis is on the underlying concepts that govern information theory and the nec-essary mathematical background that describe modern coding systems. One of thestrengths of the book are the many worked examples that appear throughout the bookthat allow the reader to immediately understand the concept being explained, or thealgorithm being described. These are backed up by fairly comprehensive exercisesets at the end of each chapter (including exercises identiﬁed by an * which are moreadvanced or challenging).v
viThe material in the book has been selected for completeness and to present a balancedcoverage. There is discussion of cascading of information channels and additivityof information which is rarely found in modern texts. Arithmetic coding is fullyexplained with both worked examples for encoding and decoding. The connectionbetween coding of extensions and Markov modelling is clearly established (this isusually not apparent in other textbooks). Three complete chapters are devoted toblock codes for error detection and correction. A large part of these chapters dealswith an exposition of the concepts from abstract algebra that underpin the design ofthese codes. We decided that this material should form part of the main text (ratherthan be relegated to an appendix) to emphasise the importance of understanding themathematics of these and other advanced coding strategies.Chapter 1 introduces the concepts of entropy and information sources and explainshow information sources are modelled. In Chapter 2 this analysis is extended toinformation channels where the concept of mutual information is introduced andchannel capacity is discussed. Chapter 3 covers source coding for efﬁcient storageand transmission with an introduction to the theory and main concepts, a discussionof Shannon’s Noiseless Coding Theorem and details of the Huffman and arithmeticcoding algorithms. Chapter 4 provides the basic principles behind the various com-pression algorithms including run-length coding and dictionary coders. Chapter 5introduces the fundamental principles of channel coding, the importance of the Ham-ming distance in the analysis and design of codes and a statement of what Shannon’sFundamental Coding Theorem tells us we can do with channel codes. Chapter 6introduces the algebraic concepts of groups, rings, ﬁelds and linear spaces over thebinary ﬁeld and introduces binary block codes. Chapter 7 provides the details of thetheory of rings of polynomials and cyclic codes and describes how to analyse anddesign various linear cyclic codes including Hamming codes, Cyclic RedundancyCodes and Reed-Muller codes. Chapter 8 deals with burst-correcting codes and de-scribes the design of Fire codes, BCH codes and Reed-Solomon codes. Chapter 9completes the discussion on channel coding by describing the convolutional encoder,decoding of convolutional codes, trellis modulation and Turbo codes.This book can be used as a textbook for a one semester undergraduate course in in-formation theory and source coding (all of Chapters 1 to 4), a one semester graduatecourse in coding theory (all of Chapters 5 to 9) or as part of a one semester under-graduate course in communications systems covering information theory and coding(selected material from Chapters 1, 2, 3, 5, 6 and 7).We would like to thank Sean Davey and Nishith Arora for their help with the LATEXformatting of the manuscript. We would also like to thank Ken Rosen for his reviewof our draft manuscript and his many helpful suggestions and Sunil Nair from CRCPress for encouraging us to write this book in the ﬁrst place!Our examples on arithmetic coding were greatly facilitated by the use of the conver-sion calculator (which is one of the few that can handle fractions!) made availableby www.math.com.
viiThe manuscript was written in LATEX and we are indebted to the open source softwarecommunity for developing such a powerful text processing environment. We areespecially grateful to the developers of LyX (www.lyx.org) for making writing thedocument that much more enjoyable and to the makers of xﬁg (www.xﬁg.org) forproviding such an easy-to-use drawing package.Roberto TogneriChris deSilva
Chapter 1Entropy and Information1.1 StructureStructure is a concept of which we all have an intuitive understanding. However,it is not easy to articulate that understanding and give a precise deﬁnition of whatstructure is. We might try to explain structure in terms of such things as regularity,predictability, symmetry and permanence. We might also try to describe what struc-ture is not, using terms such as featureless, random, chaotic, transient and aleatory.Part of the problem of trying to deﬁne structure is that there are many different kindsof behaviour and phenomena which might be described as structured, and ﬁnding adeﬁnition that covers all of them is very difﬁcult.Consider the distribution of the stars in the night sky. Overall, it would appear thatthis distribution is random, without any structure. Yet people have found patterns inthe stars and imposed a structure on the distribution by naming constellations.Again, consider what would happen if you took the pixels on the screen of yourcomputer when it was showing a complicated and colourful scene and strung themout in a single row. The distribution of colours in this single row of pixels wouldappear to be quite arbitrary, yet the complicated pattern of the two-dimensional arrayof pixels would still be there.These two examples illustrate the point that we must distinguish between the pres-ence of structure and our perception of structure. In the case of the constellations,the structure is imposed by our brains. In the case of the picture on our computerscreen, we can only see the pattern if the pixels are arranged in a certain way.Structure relates to the way in which things are put together, the way in which theparts make up the whole. Yet there is a difference between the structure of, say, abridge and that of a piece of music. The parts of the Golden Gate Bridge or theSydney Harbour Bridge are solid and ﬁxed in relation to one another. Seeing onepart of the bridge gives you a good idea of what the rest of it looks like.The structure of pieces of music is quite different. The notes of a melody can bearranged according to the whim or the genius of the composer. Having heard partof the melody you cannot be sure of what the next note is going to be, leave alone1
2 Fundamentals of Information Theory and Coding Designany other part of the melody. In fact, pieces of music often have a complicated,multi-layered structure, which is not obvious to the casual listener.In this book, we are going to be concerned with things that have structure. The kindsof structure we will be concerned with will be like the structure of pieces of music.They will not be ﬁxed and obvious.1.2 Structure in RandomnessStructure may be present in phenomena that appear to be random. When it is present,it makes the phenomena more predictable. Nevertheless, the fact that randomness ispresent means that we have to talk about the phenomena in terms of probabilities.Let us consider a very simple example of how structure can make a random phe-nomenon more predictable. Suppose we have a fair die. The probability of any facecoming up when the die is thrown is 1/6. In this case, it is not possible to predictwhich face will come up more than one-sixth of the time, on average.On the other hand, if we have a die that has been biased, this introduces some struc-ture into the situation. Suppose that the biasing has the effect of making the probabil-ity of the face with six spots coming up 55/100, the probability of the face with onespot coming up 5/100 and the probability of any other face coming up 1/10. Thenthe prediction that the face with six spots will come up will be right more than halfthe time, on average.Another example of structure in randomness that facilitates prediction arises fromphenomena that are correlated. If we have information about one of the phenomena,we can make predictions about the other. For example, we know that the IQ of iden-tical twins is highly correlated. In general, we cannot make any reliable predictionabout the IQ of one of a pair of twins. But if we know the IQ of one twin, we canmake a reliable prediction of the IQ of the other.In order to talk about structure in randomness in quantitative terms, we need to useprobability theory.1.3 First Concepts of Probability TheoryTo describe a phenomenon in terms of probability theory, we need to deﬁne a setof outcomes, which is called the sample space. For the present, we will restrictconsideration to sample spaces which are ﬁnite sets.
Entropy and Information 3DEFINITION 1.1 Probability Distribution A probability distribution on asample space Ë ×½ ×¾ ×Æ is a function È that assigns a probabilityto each outcome in the sample space. È is a map from Ë to the unit interval,È Ë ¼ ½ , which must satisfyÈÆ½ È´× µ ½.DEFINITION 1.2 Events Events are subsets of the sample space.We can extend a probability distribution È from Ë to the set of all subsets of Ë,which we denote by È´Ëµ, by setting È´ µ È×¾ È´×µfor any ¾ È´Ëµ. Notethat È´ µ ¼.An event whose probability is ¼ is impossible and an event whose probability is ½ iscertain to occur.If and are events and then È´ µ È´ µ·È´ µ.DEFINITION 1.3 Expected Value If Ë ×½ ×¾ ×Æ is a sample spacewith probability distribution È, and Ë Î is a function from the sample spaceto a vector space Î , the expected value of isÈÆ½ È´× µ ´× µ.NOTE We will often have equations that involve summation over the elementsof a ﬁnite set. In the equations above, the set has been Ë ×½ ×¾ ×Æ andthe summation has been denoted byÈÆ½. In other places in the text we will denotesuch summations simply byÈ×¾Ë.1.4 Surprise and EntropyIn everyday life, events can surprise us. Usually, the more unlikely or unexpectedan event is, the more surprising it is. We can quantify this idea using a probabilitydistribution.DEFINITION 1.4 Surprise If is an event in a sample space Ë, we deﬁne thesurprise of to be ×´ µ ÐÓ ´È´ µµ ÐÓ ´½ È´ µµ.Events for which È´ µ ½, which are certain to occur, have zero surprise, as wewould expect, and events that are impossible, that is, for which È´ µ ¼, haveinﬁnite surprise.
4 Fundamentals of Information Theory and Coding DesignDeﬁning the surprise as the negative logarithm of the probability not only gives us theappropriate limiting values as the probability tends to ¼ or ½, it also makes surpriseadditive. If several independent events occur in succession, the total surprise theygenerate is the sum of their individual surprises.DEFINITION 1.5 Entropy We can restrict the surprise to the sample spaceand consider it to be a function from the sample space to the real numbers. Theexpected value of the surprise is the entropy of the probability distribution.If the sample space is Ë ×½ ×¾ ×Æ , with probability distribution È, theentropy of the probability distribution is given byÀ´Èµ Æ½È´× µ ÐÓ ´È´× µµ (1.1)The concept of entropy was introduced into thermodynamics in the nineteenth cen-tury. It was considered to be a measure of the extent to which a system was disor-dered. The tendency of systems to become more disordered over time is described bythe Second Law of Thermodynamics, which states that the entropy of a system can-not spontaneously decrease. In the 1940’s, Shannon  introduced the concept intocommunications theory and founded the subject of information theory. It was thenrealised that entropy is a property of any stochastic system and the concept is nowused widely in many ﬁelds. Today, information theory (as described in books suchas , , ) is still principally concerned with communications systems, but thereare widespread applications in statistics, information processing and computing (see, , ).Let us consider some examples of probability distributions and see how the entropy isrelated to predictability. First, let us note the form of the function ×´Ôµ ÔÐÓ ´Ôµwhere ¼ Ô ½ and ÐÓ denotes the logarithm to base 2. (The actual base does notmatter, but we shall be using base 2 throughout the rest of this book, so we may aswell start here.) The graph of this function is shown in Figure 1.1.Note that ÔÐÓ ´Ôµ approaches ¼ as Ô tends to ¼ and also as Ô tends to ½. Thismeans that outcomes that are almost certain to occur and outcomes that are unlikelyto occur both contribute little to the entropy. Outcomes whose probability is close to¼ make a comparatively large contribution to the entropy.EXAMPLE 1.1Ë ×½ ×¾ with È´×½µ ¼ È´×¾µ. The entropy isÀ´Èµ ´¼ µ´ ½µ ´¼ µ´ ½µ ½In this case, ×½ and ×¾ are equally likely to occur and the situation is as unpredictableas it can be.
Entropy and Information 500.10.20.30.40.50.60 0.2 0.4 0.6 0.8 1-p*log(p)pFIGURE 1.1The graph of Ô ÐÓ ´Ôµ.EXAMPLE 1.2Ë ×½ ×¾ with È´×½µ ¼ , and È´×¾µ ¼ ¼¿½¾ . The entropy isÀ´Èµ ´¼ µ´ ¼ ¼ µ ´¼ ¼¿½¾ µ´ µ ¼ ¾¼In this case, the situation is more predictable, with ×½ more than thirty times morelikely to occur than ×¾. The entropy is close to zero.EXAMPLE 1.3Ë ×½ ×¾ with È´×½µ ½ ¼, and È´×¾µ ¼ ¼. Using the convention that¼ ÐÓ ´¼µ ¼, the entropy is ¼. The situation is entirely predictable, as ×½ alwaysoccurs.EXAMPLE 1.4Ë ×½ ×¾ ×¿ × × × , with È´× µ ½ for ½ ¾ . The entropy is¾ and the situation is as unpredictable as it can be.EXAMPLE 1.5Ë ×½ ×¾ ×¿ × × × , with È´×½µ ¼ È´× µ ¼ ¼¼½ for ¾ ¿ .
6 Fundamentals of Information Theory and Coding DesignThe entropy is ¼ ¼ and the situation is fairly predictable as ×½ will occur far morefrequently than any other outcome.EXAMPLE 1.6Ë ×½ ×¾ ×¿ × × × , with È ´×½µ ¼ È ´×¾µ È ´× µ ¼ ¼¼½ for¿ . The entropy is ½ ¼ ¾ and the situation is about as predictable as inExample 1.1 above, with outcomes ×½ and ×¾ equally likely to occur and the othersvery unlikely to occur.Roughly speaking, a system whose entropy is is about as unpredictable as a systemwith ¾ equally likely outcomes.1.5 Units of EntropyThe units in which entropy is measured depend on the base of the logarithms usedto calculate it. If we use logarithms to the base 2, then the unit is the bit. If weuse natural logarithms (base ), the entropy is measured in natural units, sometimesreferred to as nits. Converting between the different units is simple.PROPOSITION 1.1If À is the entropy of a probability distribution measured using natural logarithms,and ÀÖ istheentropyofthesameprobabilitydistribution measured using logarithmsto the base Ö, thenÀÖÀÐÒ´Öµ(1.2)PROOF Let the sample space be Ë ×½ ×¾ ×Æ , with probability distri-bution È . For any positive number Ü,ÐÒ´Üµ ÐÒ´Öµ ÐÓ Ö´Üµ (1.3)It follows thatÀÖ´È µ Æ½È ´× µ ÐÓ Ö´È ´× µµ Æ½È ´× µÐÒ´È ´× µµÐÒ´Öµ
Entropy and Information 7 ÈÆ½ È´× µÐÒ´È´× µµÐÒ´ÖµÀ ´ÈµÐÒ´Öµ (1.4)1.6 The Minimum and Maximum Values of EntropyIf we have a sample space Ë with Æ elements, and probability distribution È on Ë,it is convenient to denote the probability of × ¾ Ë by Ô . We can construct a vectorin ÊÆ consisting of the probabilities:Ô¾Ô½Ô¾...ÔÆ¿Because the probabilities have to add up to unity, the set of all probability distribu-tions forms a simplex in ÊÆ, namelyÃ´Ô ¾ ÊÆÆ½Ô ½µWe can consider the entropy to be a function deﬁned on this simplex. Since it isa continuous function, extreme values will occur at the vertices of this simplex, atpoints where all except one of the probabilities are zero. If ÔÚ is a vertex, then theentropy there will beÀ´ÔÚµ ´Æ ½µ ¼ ÐÓ ´¼µ·½ ÐÓ ´½µThe logarithm of zero is not deﬁned, but the limit of ÜÐÓ ´Üµ as Ü tends to ¼ ex-ists and is equal to zero. If we take the limiting values, we see that at any vertex,À´ÔÚµ ¼, as ÐÓ ´½µ ¼. This is the minimum value of the entropy function.The entropy function has a maximum value at an interior point of the simplex. Toﬁnd it we can use Lagrange multipliers.THEOREM 1.1If we have a sample space with Æ elements, the maximum value of the entropyfunction is ÐÓ ´Æµ.
8 Fundamentals of Information Theory and Coding DesignPROOF We want to ﬁnd the maximum value ofÀ´Ôµ Æ½Ô ÐÓ ´Ô µ (1.5)subject to the constraintÆ½Ô ½ (1.6)We introduce the Lagrange multiplier , and put´Ôµ À´Ôµ ·Æ½Ô ½ (1.7)To ﬁnd the maximum value we have to solveÔ ¼ (1.8)for ½ ¾ Æ andÆ½Ô ½ (1.9)Ô ÐÓ ´Ô µ ½ · (1.10)soÔ ½ (1.11)for each . The remaining condition givesÆ ½ ½ (1.12)which can be solved for , or can be used directly to giveÔ ½Æ (1.13)for all . Using these values for the Ô , we getÀ´Ôµ Æ ½Æ ÐÓ ´½ Æµ ÐÓ ´Æµ (1.14)
Entropy and Information 91.7 A Useful InequalityLEMMA 1.1If Ô½ Ô¾ ÔÆ and Õ½ Õ¾ ÕÆ are all non-negative numbers that satisfy theconditionsÈÆ½ ÔÒ ½ andÈÆ½ ÕÒ ½, then Æ½Ô ÐÓ ´Ô µ Æ½Ô ÐÓ ´Õ µ (1.15)with equality if and only if Ô Õ for all .PROOF We prove the result for the natural logarithm; the result for any other basefollows immediately from the identityÐÒ´Üµ ÐÒ´ÖµÐÓ Ö´Üµ (1.16)It is a standard result about the logarithm function thatÐÒÜ Ü ½ (1.17)for Ü ¼, with equality if and only if Ü ½. Substituting Ü Õ Ô , we getÐÒ´Õ Ô µ Õ Ô ½ (1.18)with equality if and only if Ô Õ . This holds for all ½ ¾ Æ, so if wemultiply by Ô and sum over the , we getÆ½Ô ÐÒ´Õ Ô µÆ½´Õ Ô µÆ½Õ Æ½Ô ½ ½ ¼ (1.19)with equality if and only if Ô Õ for all . SoÆ½Ô ÐÒ´Õ µ Æ½Ô ÐÒ´Ô µ ¼ (1.20)which is the required result.The inequality can also be written in the formÆ½Ô ÐÓ ´Õ Ô µ ¼ (1.21)with equality if and only if Ô Õ for all .Note that putting Õ ½ Æ for all in this inequality gives us an alternative proofthat the maximum value of the entropy function is ÐÓ ´Æµ.
10 Fundamentals of Information Theory and Coding Design1.8 Joint Probability Distribution FunctionsThere are many situations in which it is useful to consider sample spaces that are theCartesian product of two or more sets.DEFINITION 1.6 Cartesian Product Let Ë ×½ ×¾ ×Å and ÌØ½ Ø¾ ØÆ be two sets. The Cartesian product of Ë and Ì is the set Ë ¢ Ì´× Ø µ ½ Å ½ Æ .The extension to the Cartesian product of more than two sets is immediate.DEFINITION 1.7 Joint Probability Distribution A joint probability distributionis a probability distribution on the Cartesian product of a number of sets.If we have Ë and Ì as above, then a joint probability distribution function assigns aprobability to each pair ´× Ø µ. We can denote this probability by Ô . Since thesevalues form a probability distribution, we have¼ Ô ½ (1.22)for ½ Å, ½ Æ, andÅ½Æ½Ô ½ (1.23)If È is the joint probability distribution function on Ë ¢ Ì, the deﬁnition of entropybecomesÀ´Èµ Å½Æ½È´× Ø µÐÓ ´È´× Ø µµ Å½Æ½Ô ÐÓ ´Ô µ (1.24)If we want to emphasise the spaces Ë and Ì, we will denote the entropy of the jointprobability distribution on Ë¢Ì by À´ÈË¢Ìµor simply by À´Ë Ìµ. This is knownas the joint entropy of Ë and Ì.If there are probability distributions ÈË and ÈÌ on Ë and Ì, respectively, and theseare independent, the joint probability distribution on Ë ¢ Ì is given byÔ ÈË´× µÈÌ´Ø µ (1.25)
Entropy and Information 11for ½ Å , ½ Æ . If there are correlations between the × and Ø , then thisformula does not apply.DEFINITION 1.8 Marginal Distribution If È is a joint probability distributionfunction on Ë ¢Ì , the marginal distribution on Ë is ÈË Ë ¼ ½ given byÈË´× µÆ½È ´× Ø µ (1.26)for ½ Æ and the marginal distribution on Ì is ÈÌ Ì ¼ ½ given byÈÌ ´Ø µÅ½È ´× Ø µ (1.27)for ½ Æ .There is a simple relationship between the entropy of the joint probability distributionfunction and that of the marginal distribution functions.THEOREM 1.2If È is a joint probability distribution function on Ë ¢Ì , and ÈË and ÈÌ are themarginal distributions on Ë and Ì , respectively, thenÀ´È µ À´ÈËµ · À´ÈÌ µ (1.28)with equality if and only if the marginal distributions are independent.PROOFÀ´ÈËµ Å½ÈË´× µ ÐÓ ´ÈË´× µµ Å½Æ½È ´× Ø µ ÐÓ ´ÈË´× µµ (1.29)and similarlyÀ´ÈÌ µ Å½Æ½È ´× Ø µ ÐÓ ´ÈÌ ´Ø µµ (1.30)SoÀ´ÈËµ · À´ÈÌ µ Å½Æ½È ´× Ø µ ÐÓ ´ÈË´× µµ · ÐÓ ´ÈÌ ´Ø µµ
12 Fundamentals of Information Theory and Coding Design Å½Æ½È´× Ø µÐÓ ´ÈË´× µÈÌ´Ø µµ (1.31)Also,À´Èµ Å½Æ½È´× Ø µÐÓ ´È´× Ø µµ (1.32)SinceÅ½Æ½È´× Ø µ ½ (1.33)andÅ½Æ½ÈË´× µÈÌ´Ø µÅ½ÈË´× µÆ½ÈÌ´Ø µ ½ (1.34)we can use the inequality of Lemma 1.1 to conclude thatÀ´Èµ À´ÈËµ·À´ÈÌµ (1.35)with equality if and only if È´× Ø µ ÈË´× µÈÌ´Ø µ for all and , that is, if thetwo marginal distributions are independent.1.9 Conditional Probability and Bayes’ TheoremDEFINITION 1.9 Conditional Probability If Ë is a sample space with a prob-ability distribution function È, and and are events in Ë, the conditional prob-ability of given isÈ´ µ È´ µÈ´ µ (1.36)It is obvious thatÈ´ µÈ´ µ È´ µ È´ µÈ´ µ (1.37)Almost as obvious is one form of Bayes’ Theorem:THEOREM 1.3If Ë is a sample space with a probability distribution function È, and and areevents in Ë, thenÈ´ µ È´ µÈ´ µÈ´ µ (1.38)
Entropy and Information 13Bayes’ Theorem is important because it enables us to derive probabilities of hypothe-ses from observations, as in the following example.EXAMPLE 1.7We have two jars, A and B. Jar A contains 8 green balls and 2 red balls. Jar B contains3 green balls and 7 red balls. One jar is selected at random and a ball is drawn fromit.Wehaveprobabilitiesasfollows. Thesetofjarsformsonesamplespace,Ë ,withÈ´ µ ¼ È´ µas one jar is as likely to be chosen as the other.The set of colours forms another sample space, Ì Ê . The probability ofdrawing a green ball isÈ´ µ ½½ ¾¼ ¼as 11 of the 20 balls in the jars are green. Similarly,È´ Ê µ ¾¼ ¼We have a joint probability distribution over the colours of the balls and the jars withthe probability of selecting Jar A and drawing a green ball being given byÈ´ ´ µ µ ¼Similarly, we have the probability of selecting Jar A and drawing a red ballÈ´ ´Ê µ µ ¼ ½the probability of selecting Jar B and drawing a green ballÈ´ ´ µ µ ¼ ½and the probability of selecting Jar B and drawing a red ballÈ´ ´Ê µ µ ¼ ¿We have the conditional probabilities: given that Jar A was selected, the probabilityof drawing a green ball isÈ´ µ ¼and the probability of drawing a red ball isÈ´ Ê µ ¼ ¾
14 Fundamentals of Information Theory and Coding DesignGiven that Jar B was selected, the corresponding probabilities are:È´ µ ¼ ¿andÈ´ Ê µ ¼We can now use Bayes’ Theorem to work out the probability of having drawn fromeither jar, given the colour of the ball that was drawn. If a green ball was drawn, theprobability that it was drawn from Jar A isÈ´ µÈ´ µÈ´ µÈ´ µ¼ ¢ ¼ ¼ ¼ ¿while the probability that it was drawn from Jar B isÈ´ µÈ´ µÈ´ µÈ´ µ¼ ¿ ¢ ¼ ¼ ¼ ¾If a red ball was drawn, the probability that it was drawn from Jar A isÈ´ Ê µÈ´ Ê µÈ´ µÈ´ Ê µ¼ ¾ ¢ ¼ ¼ ¼ ¾¾while the probability that it was drawn from Jar B isÈ´ Ê µÈ´ Ê µÈ´ µÈ´ Ê µ¼ ¢ ¼ ¼ ¼(In this case, we could have derived these conditional probabilities from the jointprobability distribution, but we chose not to do so to illustrate how Bayes’ Theoremallows us to go from the conditional probabilities of the colours given the jar selectedto the conditional probabilities of the jars selected given the colours drawn.)1.10 Conditional Probability Distributions and Conditional EntropyIn this section, we have a joint probability distribution È on a Cartesian productË ¢ Ì, where Ë ×½ ×¾ ×Å and Ì Ø½ Ø¾ ØÆ , with marginal distri-butions ÈË and ÈÌ.DEFINITION 1.10 Conditional Probability of × given Ø For × ¾ Ë andØ ¾ Ì, the conditional probability of × given Ø isÈ´× Ø µÈ´× Ø µÈÌ´Ø µÈ´× Ø µÈÅ½ È´× Ø µ(1.39)
Entropy and Information 15DEFINITION 1.11 Conditional Probability Distribution given Ø For a ﬁxedØ , the conditional probabilities È ´× Ø µ sum to 1 over , so they form a probabilitydistribution on Ë, the conditional probability distribution given Ø . We will denotethis by ÈË Ø .DEFINITION 1.12 Conditional Entropy given Ø The conditional entropygiven Ø is the entropy of the conditional probability distribution on Ë given Ø . Itwill be denoted À´ÈË Ø µ.À´ÈË Ø µ Å½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.40)DEFINITION 1.13 Conditional Probability Distribution on Ë given Ì Theconditional probability distribution on Ë given Ì is the weighted average of theconditional probability distributions given Ø for all . It will be denoted ÈË Ì .ÈË Ì ´× µÆ½ÈÌ ´Ø µÈË Ø ´× µ (1.41)DEFINITION1.14 ConditionalEntropygivenÌ Theconditional entropy givenÌ is the weighted average of the conditional entropies on Ë given Ø for all Ø ¾ Ì .It will be denoted À´ÈË Ì µ.À´ÈË Ì µ Æ½ÈÌ ´Ø µÅ½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.42)Since ÈÌ ´Ø µÈ ´× Ø µ È ´× Ø µ, we can re-write this asÀ´ÈË Ì µ Å½Æ½È ´× Ø µ ÐÓ ´È ´× Ø µµ (1.43)We now prove two simple results about the conditional entropies.THEOREM 1.4À´È µ À´ÈÌ µ · À´ÈË Ì µ À´ÈËµ · À´ÈÌ Ëµ
16 Fundamentals of Information Theory and Coding DesignPROOFÀ´Èµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµ Å½Æ½È´× Ø µ ÐÓ ´ÈÌ´Ø µÈ´× Ø µµ Å½Æ½È´× Ø µ ÐÓ ´ÈÌ´Ø µµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµ Æ½ÈÌ´Ø µ ÐÓ ´ÈÌ´Ø µµ Å½Æ½È´× Ø µ ÐÓ ´È´× Ø µµÀ´ÈÌµ · À´ÈË Ìµ (1.44)The proof of the other equality is similar.THEOREM 1.5À´ÈË Ìµ À´ÈËµ with equality if and only if ÈË and ÈÌ are independent.PROOF From the previous theorem, À´Èµ À´ÈÌµ · À´ÈË ÌµFrom Theorem 1.2, À´Èµ À´ÈËµ · À´ÈÌµ with equality if and only if ÈË andÈÌ are independent.So À´ÈÌµ · À´ÈË Ìµ À´ÈËµ · À´ÈÌµ.Subtracting À´ÈÌµ from both sides we get À´ÈË Ìµ À´ÈËµ, with equality if andonly if ÈË and ÈÌ are independent.This result is obviously symmetric in Ë and Ì; so we also have À´ÈÌ Ëµ À´ÈÌµwith equality if and only if ÈË and ÈÌ are independent. We can sum up this resultby saying the conditioning reduces entropy or conditioning reduces uncertainty.1.11 Information SourcesMost of this book will be concerned with random sequences. Depending on thecontext, such sequences may be called time series, (discrete) stochastic processes or
Entropy and Information 17signals. The ﬁrst term is used by statisticians, the second by mathematicians and thethird by engineers. This may reﬂect differences in the way these people approach thesubject: statisticians are primarily interested in describing such sequences in termsof probability theory, mathematicians are interested in the behaviour of such seriesand the ways in which they may be generated and engineers are interested in ways ofusing such sequences and processing them to extract useful information from them.A device or situation that produces such a sequence is called an information source.The elements of the sequence are usually drawn from a ﬁnite set, which may bereferred to as the alphabet. The source can be considered to be emitting an elementof the alphabet at each instant of a sequence of instants in time. The elements of thealphabet are referred to as symbols.EXAMPLE 1.8Tossing a coin repeatedly and recording the outcomes as heads (H) or tails (T) givesus a random sequence whose alphabet is À Ì .EXAMPLE 1.9Throwing a die repeatedly and recording the number of spots on the uppermost facegives us a random sequence whose alphabet is ½ ¾ ¿ .EXAMPLE 1.10Computers and telecommunications equipment generate sequences of bits which arerandom sequences whose alphabet is ¼ ½ .EXAMPLE 1.11A text in the English language is a random sequence whose alphabet is the set con-sisting of the letters of the alphabet, the digits and the punctuation marks. While wenormally consider text to be meaningful rather than random, it is only possible topredict which letter will come next in the sequence in probabilistic terms, in general.The last example above illustrates the point that a random sequence may not appearto be random at ﬁrst sight. The difference between the earlier examples and the ﬁnalexample is that in the English language there are correlations between each letter inthe sequence and those that precede it. In contrast, there are no such correlations inthe cases of tossing a coin or throwing a die repeatedly. We will consider both kindsof information sources below.
18 Fundamentals of Information Theory and Coding DesignAn obvious question that is raised by the term “information source” is: What is the“information” that the source produces? A second question, perhaps less obvious,is: How can we measure the information produced by an information source?An information source generates a sequence of symbols which has a certain degreeof unpredictability. The more unpredictable the sequence is, the more informationis conveyed by each symbol. The information source may impose structure on thesequence of symbols. This structure will increase the predictability of the sequenceand reduce the information carried by each symbol.The random behaviour of the sequence may be described by probability distribu-tions over the alphabet. If the elements of the sequence are uncorrelated, a simpleprobability distribution over the alphabet may sufﬁce. In other cases, conditionalprobability distributions may be required.We have already seen that entropy is a measure of predictability. For an informationsource, the information content of the sequence that it generates is measured by theentropy per symbol. We can compute this if we make assumptions about the kindsof structure that the information source imposes upon its output sequences.To describe an information source completely, we need to specify both the alpha-bet and the probability distribution that governs the generation of sequences. Theentropy of the information source Ë with alphabet and probability distribution Èwill be denoted by À´Ëµ in the following sections, even though it is actually the en-tropy of È. Later on, we will wish to concentrate on the alphabet and will use À´ µto denote the entropy of the information source, on the assumption that the alphabetwill have a probability distribution associated with it.1.12 Memoryless Information SourcesFor a memoryless information source, there are no correlations between the outputsof the source at different times. For each instant at which an output is emitted, thereis a probability distribution over the alphabet that describes the probability of eachsymbol being emitted at that instant. If all the probability distributions are the same,the source is said to be stationary. If we know these probability distributions, we cancalculate the information content of the sequence.EXAMPLE 1.12Tossing a fair coin gives us an example of a stationary memoryless information source.At any instant, the probability distribution is given by È´Àµ ¼ , È´Ìµ ¼ ¼.This probability distribution has an entropy of 1 bit; so the information content is 1bit/symbol.
Entropy and Information 19EXAMPLE 1.13As an exampleof a non-stationarymemorylessinformationsource, suppose we have afaircoinandadiewith Àpainted on fourfacesandÌpainted on two faces. Tossing thecoin and throwing the die in alternation will create a memoryless information sourcewith alphabet À Ì . Every time the coin is tossed, the probability distribution ofthe outcomes is È´Àµ ¼ , È´Ìµ ¼ , and every time the die is thrown, theprobability distribution is È´Àµ ¼ , È´Ìµ ¼ ¿¿¿.The probability distribution of the outcomes of tossing the coin has an entropy of 1bit. The probability distribution of the outcomes of throwing the die has an entropyof 0.918 bits. The information content of the sequence is the average entropy persymbol, which is 0.959 bits/symbol.Memoryless information sources are relatively simple. More realistic informationsources have memory, which is the property that the emission of a symbol at anyinstant depends on one or more of the symbols that were generated before it.1.13 Markov Sources and n-gram ModelsMarkov sources and n-gram models are descriptions of a class of information sourceswith memory.DEFINITION 1.15Markov Source A Markov source consists of an alphabet ,a set of states ¦, a set of transitions between states, a set of labels for the transitionsand two sets of probabilities. The ﬁrst set of probabilities is the initial probabilitydistribution on the set of states, which determines the probabilities of sequencesstarting with each symbol in the alphabet. The second set of probabilities is a setof transition probabilities. For each pair of states, and , the probability of atransition from to is È´ µ. (Note that these probabilities are ﬁxed and do notdepend on time, so that there is an implicit assumption of stationarity.) The labelson the transitions are symbols from the alphabet.To generate a sequence, a state is selected on the basis of the initial probability distri-bution. A transition from this state to another state (or to the same state) is selectedon the basis of the transition probabilities, and the label of this transition is output.This process is repeated to generate the sequence of output symbols.It is convenient to represent Markov models diagrammatically in the form of a graph,with the states represented by vertices and the transitions by edges, as in the follow-ing example.
Entropy and Information 21EXAMPLE 1.15The following probabilities give us a 3-gram model on the language ¼ ½ .È ´¼¼¼µ ¼ ¿¾ È ´¼¼½µ ¼ ¼È ´¼½¼µ ¼ ½ È ´¼½½µ ¼ ½È ´½¼¼µ ¼ ½ È ´½¼½µ ¼ ¼È ´½½¼µ ¼ ¼ È ´½½½µ ¼ ¼To describe the relationship between n-gram models and Markov sources, we needto look at special cases of Markov sources.DEFINITION 1.16 Ñth-order Markov Source A Markov source whose statesare sequences of Ñ symbols from the alphabet is called an Ñth-order Markovsource.When we have an Ñth-order Markov model, the transition probabilities are usuallygiven in terms of the probabilities of single symbols being emitted when the sourceis in a given state. For example, in a second-order Markov model on ¼ ½ , thetransition probability from ¼½ to ½¼, which would be represented by È ´½¼ ¼½µ, wouldbe represented instead by the probability of emission of ¼ when in the state ¼½, that isÈ ´¼ ¼½µ. Obviously, some transitions are impossible. For example, it is not possibleto go from the state ¼½ to the state ¼¼, as the state following ¼½ must have ½ as itsﬁrst symbol.We can construct a Ñth-order Markov model from an ´Ñ · ½µ-gram model and anÑ-gram model. The Ñ-gram model gives us the probabilities of strings of length Ñ,such as È ´×½ ×¾ ×Ñµ. To ﬁnd the emission probability of × from this state, wesetÈ ´× ×½ ×¾ ×ÑµÈ ´×½ ×¾ ×Ñ ×µÈ ´×½ ×¾ ×Ñµ(1.45)where the probability È ´×½ ×¾ ×Ñ ×µ is given by the ´Ñ · ½µ-gram model.EXAMPLE 1.16In the previous example 1.15 we had a 3-gram model on the language ¼ ½ givenbyÈ ´¼¼¼µ ¼ ¿¾ È ´¼¼½µ ¼ ¼È ´¼½¼µ ¼ ½ È ´¼½½µ ¼ ½È ´½¼¼µ ¼ ½ È ´½¼½µ ¼ ¼È ´½½¼µ ¼ ¼ È ´½½½µ ¼ ¼
22 Fundamentals of Information Theory and Coding DesignP(1|01)=0.5P(0|11)=0.6P(1|00)=0.201 111000P(0|10)=0.8P(1|10)=0.2P(0|01)=0.5P(1|11)=0.4P(0|00)=0.8FIGURE 1.3Diagrammatic representation of a Markov source equivalent to a 3-gram model.If a 2-gram model for the same source is given by È´¼¼µ ¼ , È´¼½µ ¼ ¿,È´½¼µ ¼ ¾ and È´½½µ ¼ ½, then we can construct a second-order Markov sourceas follows:È´¼ ¼¼µ È´¼¼¼µ È´¼¼µ ¼ ¿¾ ¼ ¼È´½ ¼¼µ È´¼¼½µ È´¼¼µ ¼ ¼ ¼ ¼ ¾È´¼ ¼½µ È´¼½¼µ È´¼½µ ¼ ½ ¼ ¿ ¼È´½ ¼½µ È´¼½½µ È´¼½µ ¼ ½ ¼ ¿ ¼È´¼ ½¼µ È´½¼¼µ È´½¼µ ¼ ½ ¼ ¾ ¼È´½ ½¼µ È´½¼½µ È´½¼µ ¼ ¼ ¼ ¾ ¼ ¾È´¼ ½½µ È´½½¼µ È´½½µ ¼ ¼ ¼ ½ ¼È´½ ½½µ È´½½½µ È´½½µ ¼ ¼ ¼ ½ ¼Figure 1.3 shows this Markov source.To describe the behaviour of a Markov source mathematically, we use the transitionmatrix of probabilities. If the set of states is¦ ½ ¾ Æ
Entropy and Information 23the transition matrix is the Æ ¢Æ matrix¥¾È´½ ½µ È´½ ¾µ ¡¡¡ È´½ ÆµÈ´¾ ½µ È´¾ ¾µ ¡¡¡ È´¾ Æµ............È´Æ ½µ È´Æ ¾µ ¡¡¡ È´Æ Æµ¿(1.46)The probability of the source being in a given state varies over time. Let ÛØbe theprobability of the source being in state at time Ø, and setÏØ¾ÛØ½ÛØ¾...ÛØÆ¿(1.47)Then Ï¼ is the initial probability distribution andÏØ·½¥ÏØ(1.48)and so, by induction,ÏØ¥ØÏ¼(1.49)Because they all represent probability distributions, each of the columns of ¥ mustadd up to ½, and all the ÛØmust add up to ½ for each Ø.1.14 Stationary DistributionsThe vectors ÏØdescribe how the behaviour of the source changes over time. Theasymptotic (long-term) behaviour of sources is of interest in some cases.EXAMPLE 1.17Consider a source with transition matrix¥ ¼ ¼ ¾¼ ¼Suppose the initial probability distribution isÏ¼ ¼¼
24 Fundamentals of Information Theory and Coding DesignThenÏ½¥Ï¼ ¼ ¼ ¾¼ ¼¼¼¼¼Similarly,Ï¾¥Ï½ ¼ ¿¼Ï¿¥Ï¾ ¼ ¿¼Ï ¥Ï¿ ¼ ¿¿¼ ¾and so on.Suppose instead thatÏ¼×½ ¿¾ ¿ThenÏ½× ¥Ï¼×¼ ¼ ¾¼ ¼½ ¿¾ ¿½ ¿¾ ¿ Ï¼×so thatÏØ× Ï¼×for all Ø ¼. This distribution will persist for all time.In the example above, the initial distribution Ï¼× has the property that¥Ï¼× Ï¼× (1.50)and persists for all time.DEFINITION 1.17 Stationary Distribution A probability distribution Ï overthe states of a Markov source with transition matrix ¥ that satisﬁes the equation¥Ï Ï is a stationary distribution.As shown in the example, if Ï¼ is a stationary distribution, it persists for all time,ÏØ Ï¼ for all Ø. The deﬁning equation shows that a stationary distribution Ïmust be an eigenvector of ¥with eigenvalue ½. To ﬁnd a stationary distribution for ¥,we must solve the equation ¥Ï Ï together with the condition thatÈÛ ½.EXAMPLE 1.18Suppose¥¾¼ ¾ ¼ ¼ ¼ ¼¼¼ ¼ ¼ ¼¼ ¼ ¾¼ ¾ ¼ ¼ ¼¿
Entropy and Information 25Then the equation ¥Ï Ï gives¼ ¾ Û½ ·¼ ¼Û¾ ·¼ ¼¼Û¿ Û½¼ ¼Û½ ·¼ ¼¼Û¾ ·¼ ¾ Û¿ Û¾¼ ¾ Û½ ·¼ ¼Û¾ ·¼ Û¿ Û¿The ﬁrst equation gives us¼ Û½ ¼ ¼Û¾ ¼and the other two give¼ ¼Û½ ½ ¼¼Û¾ ·¼ ¾ Û¿ ¼¼ ¾ Û½ ·¼ ¼Û¾ ¼ ¾ Û¿ ¼from which we get ¾ ¼¼Û¾ ·¼ Û¿ ¼SoÛ½¾¿Û¾andÛ¿¿Û¾Substituting these values inÛ½ ·Û¾ ·Û¿ ½we get¾¿Û¾ ·Û¾ · ¿Û¾ ½which gives usÛ¾¿½¿ Û½¾½¿ Û¿½¿So the stationary distribution isÏ¾¾ ½¿¿ ½¿½¿¿In the examples above, the source has an unique stationary distribution. This is notalways the case.
26 Fundamentals of Information Theory and Coding DesignEXAMPLE 1.19Consider the source with four states and probability transition matrix¥¾½ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ½ ¼¿The diagrammatic representation of this source is shown in Figure 1.4.È´ ¾µ ¼¾½È´ ½ ¿µ ¼È´ ¾ ¿µ ¼È´ ¿ ¾µ ¼È´ ½ ½µ ½ ¼¿È´ µ ½ ¼FIGURE 1.4A source with two stationary distributions.For this source, any distribution with Û¾ ¼ ¼, Û¿ ¼ ¼and Û½ ·Û ½ ¼satisﬁesthe equation ¥Ï Ï . However, inspection of the transition matrix shows that oncethe source enters either the ﬁrst state or the fourth state, it cannot leave it. The onlystationary distributions that can occur are Û½ ½ ¼, Û¾ ¼ ¼, Û¿ ¼ ¼, Û ¼ ¼or Û½ ¼ ¼, Û¾ ¼ ¼, Û¿ ¼ ¼, Û ½ ¼.Some Markov sources have the property that every sequence generated by the sourcehas the same statistical properties. That is, the various frequencies of occurrence of
Entropy and Information 27symbols, pairs of symbols, and so on, obtained from any sequence generated by thesource will, as the length of the sequence increases, approach some deﬁnite limitwhich is independent of the particular sequence. Sources that have this property arecalled ergodic sources.The source of Example 1.19 is not an ergodic source. The sequences generated bythat source fall into two classes, one of which is generated by sequences of statesthat end in the ﬁrst state, the other of which is generated by sequences that end in thefourth state. The fact that there are two distinct stationary distributions shows thatthe source is not ergodic.1.15 The Entropy of Markov SourcesThere are various ways of deﬁning the entropy of an information source. The fol-lowing is a simple approach which applies to a restricted class of Markov sources.DEFINITION 1.18 Entropy of the th State of a Markov Source The entropy ofthe th state of a Markov source is the entropy of the probability distribution on theset of transitions from that state.If we denote the probability distribution on the set of transitions from the th state byÈ , then the entropy of the th state is given byÀ´È µ Æ½È ´ µ ÐÓ ´È ´ µµ (1.51)DEFINITION 1.19 Uniﬁlar Markov Source A uniﬁlar Markov source is one withthe property that the labels on the transitions from any given state are all distinct.We need this property in order to be able to deﬁne the entropy of a Markov source.We assume that the source has a stationary distribution.
28 Fundamentals of Information Theory and Coding DesignDEFINITION 1.20 Entropy of a Uniﬁlar Markov Source The entropyof a uniﬁlar Markov source Å, whose stationary distribution is given byÛ½ Û¾ ÛÆ, and whose transition probabilities are È´ µ for ½ Æ,½ Æ, isÀ´ÅµÆ½Û À´È µ Æ½Æ½Û È´ µ ÐÓ ´È´ µµ (1.52)It can be shown that this deﬁnition is consistent with more general deﬁnitions of theentropy of an information source.EXAMPLE 1.20For the Markov source of Example 1.14, there are three states, ½, ¾ and ¿. Theprobability distribution on the set of transitions from is È for ½ ¾ ¿.È½ is given byÈ½´½µ È´½ ½µ ¼ ¼ È½´¾µ È´¾ ½µ ½ ¼ È½´¿µ È´¿ ½µ ¼ ¼Its entropy isÀ´È½µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´½ ¼µ´¼ ¼µ ´¼ ¼µ ÐÓ ´¼ ¼µ ¼ ¼using the usual convention that ¼ ¼ ÐÓ ´¼ ¼µ ¼ ¼.È¾ is given byÈ¾´½µ È´½ ¾µ ¼ ¼ È¾´¾µ È´¾ ¾µ ¼ ¼ È¾´¿µ È´¿ ¾µ ½ ¼Its entropy isÀ´È¾µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´¼ ¼µ ÐÓ ´¼ ¼µ ´½ ¼µ´¼ ¼µ ¼ ¼È¿ is given byÈ¿´½µ È´½ ¿µ ¼ È¿´¾µ È´¾ ¿µ ¼ È¿´¿µ È´¿ ¿µ ¼ ¼Its entropy isÀ´È¿µ ´¼ µ´ ¼ ¿ µ ´¼ µ´ ½ ¿¾½ ¿µ ´¼ ¼µ ÐÓ ´¼ ¼µ ¼ ¼The stationary distribution of the source is given byÛ½¿½¿Û¾½¿Û¿½¿
Entropy and Information 29The entropy of the source isÀ´Åµ ´¼ ¼µ´¿ ½¿µ · ´¼ ¼µ´ ½¿µ · ´¼ ¼ µ´ ½¿µ ¼ ¿ ¿EXAMPLE 1.21For the source of Example 1.16, the states are ¼¼, ¼½, ½¼, ½½.È¼¼ is given by È¼¼´¼µ È´¼ ¼¼µ ¼ , and È¼¼´½µ È´½ ¼¼µ ¼ ¾. Its entropyisÀ´È¼¼µ ´¼ µ´ ¼ ¿¾½ ¿µ ´¼ ¾µ´ ¾ ¿¾½ ¿µ ¼ ¾½ ¿È¼½ is given by È¼½´¼µ È´¼ ¼½µ ¼ , and È¼½´½µ È´½ ¼½µ ¼ . Its entropyisÀ´È¼½µ ´¼ µ´ ½ ¼µ ´¼ µ´ ½ ¼µ ½ ¼È½¼ is given by È½¼´¼µ È´¼ ½¼µ ¼ , and È½¼´½µ È´½ ½¼µ ¼ ¾. Its entropyisÀ´È½¼µ ´¼ µ´ ¼ ¿¾½ ¿µ ´¼ ¾µ´ ¾ ¿¾½ ¿µ ¼ ¾½ ¿È½½ is given by È½½´¼µ È´¼ ½½µ ¼ , and È½½´½µ È´½ ½½µ ¼ . Its entropyisÀ´È½½µ ´¼ µ´ ¼ ½¼ ¿µ ´¼ µ´ ½ ¿¾½ ¿µ ¼ ¼The stationary distribution of the source is given byÛ½½¾¼¾½¿Û¾¿¼¾½¿Û¿¿¾½¿Û ¾¾½¿The entropy of the source isÀ´Åµ½¾¼´¼ ¾½ ¿µ · ¿¼´½ ¼¼¼¼¼µ · ¿ ´¼ ¾½ ¿µ · ¾ ´¼ ¼ µ¾½¿¼ ¿ ¿1.16 Sequences of SymbolsIt is possible to estimate the entropy of a Markov source using information aboutthe probabilities of occurrence of sequences of symbols. The following results apply
30 Fundamentals of Information Theory and Coding Designto ergodic Markov sources and are stated without proof. In a sense, they justifythe use of the conditional probabilities of emission of symbols instead of transitionprobabilities between states in Ñth-order Markov models.THEOREM 1.6Given any ¯ ¼ and any Æ ¼, we can ﬁnd a positive integer Æ¼ such that allsequences of length Æ Æ¼ fall into two classes: a set of sequences whose totalprobability is less than ¯; and the remainder, for which the following inequality holds:¬¬¬¬ÐÓ ´½ ÔµÆ À¬¬¬¬Æ (1.53)where Ôis the probability of the sequence and À is the entropy of the source.PROOF See , Appendix 3.THEOREM 1.7Let Å be a Markov source with alphabet ½ ¾ Ò , and entropy À. LetÆ denote the set of all sequences of symbols from of length Æ. For × ¾ Æ, letÈ´×µ be the probability of the sequence ×being emitted by the source. DeﬁneÆ ½Æ ×¾ ÆÈ´×µ ÐÓ ´È´×µµ (1.54)which is the entropy per symbol of the sequences of Æ symbols. Then Æ is amonotonic decreasing function of Æ andÐ ÑÆ ½ Æ À (1.55)PROOF See , Appendix 3.THEOREM 1.8Let Å be a Markov source with alphabet ½ ¾ Ò , and entropy À. LetÆ denote the set of all sequences of symbols from of length Æ. For × ¾ Æ ½,let È´× µ be the probability of the source emitting the sequence × followed by thesymbol , and let È´ ×µ be the conditional probability of the symbol beingemitted immediately after the sequence ×. DeﬁneÆ ×¾ Æ ½Ò½È´× µ ÐÓ ´È´ ×µµ (1.56)
Entropy and Information 31which is the conditional entropy of the next symbol when the ´Æ ½µ precedingsymbols are known. Then Æ is a monotonic decreasing function of Æ andÐ ÑÆ ½ Æ À (1.57)PROOF See , Appendix 3.THEOREM 1.9If Æ and Æ are deﬁned as in the previous theorems, thenÆ Æ Æ ´Æ ½µ Æ ½ (1.58)Æ½ÆÆÁ ½Á (1.59)andÆ Æ (1.60)PROOF See , Appendix 3.These results show that a series of approximations to the entropy of a source canbe obtained by considering only the statistical behaviour of sequences of symbolsof increasing length. The sequence of estimates Æ is a better approximation thanthe sequence Æ . If the dependencies in a source extend over no more than Æsymbols, so that the conditional probability of the next symbol knowing the preced-ing ´Æ ½µ symbols is the same as the conditional probability of the next symbolknowing the preceding Æ Æ symbols, then Æ À.1.17 The Adjoint Source of a Markov SourceIt is possible to approximate the behaviour of a Markov source by a memorylesssource.DEFINITION 1.21 Adjoint Source of a Markov Source The adjoint sourceof a Markov source is the memoryless source with the same alphabet which emitssymbols independently of each other with the same probabilities as the Markovsource.
32 Fundamentals of Information Theory and Coding DesignIf we have an Ñth-order Markov source Å with alphabet ½ ¾ Õ , theprobabilities of emission of the symbols areÈ´ µ È´ ½ ¾ ÑµÈ´ ½ ¾ Ñµ (1.61)where ½ ¾ Ñ represents a sequence of Ñ symbols from the alphabet of thesource, È´ ½ ¾ Ñµ is the probability of this sequence in the stationary dis-tribution of the Markov source and the summation over indicates that all suchsequences are included in the summation. The adjoint source of this Markov source,denoted Å, is the memoryless source that emits these symbols with the same prob-abilities.EXAMPLE 1.22For the 3-gram model of Example 1.15, we have transition probabilitiesÈ´¼ ¼¼µ ¼ È´½ ¼¼µ ¼ ¾È´¼ ¼½µ ¼ È´½ ¼½µ ¼È´¼ ½¼µ ¼ È´½ ½¼µ ¼ ¾È´¼ ½½µ ¼ È´½ ½½µ ¼which give us the transition matrix¥¾¼ ¼ ¼ ¼ ¼ ¼¼ ¾ ¼ ¼ ¼ ¾ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼¼ ¼ ¼ ¼ ¼ ¼¿We need to ﬁnd the stationary distribution of the source. The equation ¥Ï Ïgives¼ Û½ ·¼ ¼Û¾ ·¼ Û¿ ·¼ ¼Û Û½¼ ¾Û½ ·¼ ¼Û¾ ·¼ ¾Û¿ ·¼ ¼Û Û¾¼ ¼Û½ ·¼ Û¾ ·¼ ¼Û¿ ·¼ Û Û¿¼ ¼Û½ ·¼ Û¾ ·¼ ¼Û¿ ·¼ Û ÛSolving these equations together with the constraintÛ½ ·Û¾ ·Û¿ ·Û ½ ¼we get the stationary distributionÈ´¼¼µ ¾ È´¼½µ È´½¼µ È´½½µ
Entropy and Information 33The probabilities for the adjoint source of the 3-gram models areÈ´¼µ È´¼ ¼¼µÈ´¼¼µ · È´¼ ¼½µÈ´¼½µ · È´¼ ½¼µÈ´½¼µ · È´¼ ½½µÈ´½½µ½¾¿andÈ´½µ È´½ ¼¼µÈ´¼¼µ · È´½ ¼½µÈ´¼½µ · È´½ ½¼µÈ´½¼µ · È´½ ½½µÈ´½½µ¾¿Although the probabilities of emission of single symbols are the same for both theMarkov source and its adjoint source, the probabilities of emission of sequencesof symbols may not be the same. For example the probability of emission of thesequence ¼¼¼ by the Markov source is È´¼ ¼¼µ ¼ , while for the adjoint source itis È´¼µ¿ ¼ ¼¿ (by the assumption of independence).Going from a Markov source to its adjoint reduces the number of constraints on theoutput sequence and hence increases the entropy. This is formalised by the followingtheorem.THEOREM 1.10If Å is the adjoint of the Markov source Å, their entropies are related byÀ´Åµ À´Åµ (1.62)PROOF If Å is an Ñth-order source with alphabet ½ ¾ Õ , we willdenote the states, which are Ñ-tuples of the , by «Á , where ½ Á ÕÑ. Weassume that Å has a stationary distribution.The probabilities of emission of the symbols areÈ´ µÁÛÁÈ´ «Áµ (1.63)where the summation is over all states and ÛÁ is the probability of state «Á in thestationary distribution of the source.The entropy of the adjoint isÀ´Åµ Õ½È´ µ ÐÓ ´È´ µµ
34 Fundamentals of Information Theory and Coding Design Õ½ ÁÛÁ È´ «ÁµÐÓ ´È´ µµ ÁÛÁÕ½È´ «ÁµÐÓ ´È´ µµ (1.64)The entropy of the Áth state of Å isÀ´ÈÁ µ Õ½È´ «Á µÐÓ ´È´ «Áµµ (1.65)and the entropy of Å isÀ´Åµ ÁÕ½ÛÁÈ´ «ÁµÐÓ ´È´ «Á µµ ÁÛÁÕ½È´ «ÁµÐÓ ´È´ «Á µµ (1.66)If we apply the inequality of Lemma 1.1 to each summation over , the result follows.1.18 Extensions of SourcesIn situations where codes of various types are being developed, it is often useful toconsider sequences of symbols emitted by a source.DEFINITION 1.22 Extension of a Stationary Memoryless Source The Òthextension of a stationary memoryless source Ë is the stationary memoryless sourcewhose alphabet consists of all sequences of Ò symbols from the alphabet of Ë, withthe emission probabilities of the sequences being the same as the probabilities ofoccurrence of the sequences in the output of Ë.The Òth extension of Ë will be denoted by ËÒ. Because the emission of successivesymbols by Ë is statistically independent, the emission probabilities in ËÒ can becomputed by multiplying the appropriate emission probabilities in Ë.EXAMPLE 1.23Consider the memoryless source Ë with alphabet ¼ ½ and emission probabilitiesÈ´¼µ ¼ ¾, È´½µ ¼ .
Entropy and Information 35The second extension of Ë has alphabet ¼¼ ¼½ ½¼ ½½ with emission probabilitiesÈ´¼¼µ È´¼µÈ´¼µ ´¼ ¾µ´¼ ¾µ ¼ ¼È´¼½µ È´¼µÈ´½µ ´¼ ¾µ´¼ µ ¼ ½È´½¼µ È´½µÈ´¼µ ´¼ µ´¼ ¾µ ¼ ½È´½½µ È´½µÈ´½µ ´¼ µ´¼ µ ¼The third extension of Ë has alphabet ¼¼¼ ¼¼½ ¼½¼ ¼½½ ½¼¼ ½¼½ ½½¼ ½½½ withemission probabilitiesÈ´¼¼¼µ È´¼µÈ´¼µÈ´¼µ ´¼ ¾µ´¼ ¾µ´¼ ¾µ ¼ ¼¼È´¼¼½µ È´¼µÈ´¼µÈ´½µ ´¼ ¾µ´¼ ¾µ´¼ µ ¼ ¼¿¾È´¼½¼µ È´¼µÈ´½µÈ´¼µ ´¼ ¾µ´¼ µ´¼ ¾µ ¼ ¼¿¾È´¼½½µ È´¼µÈ´½µÈ´½µ ´¼ ¾µ´¼ µ´¼ µ ¼ ½¾È´½¼¼µ È´½µÈ´¼µÈ´¼µ ´¼ µ´¼ ¾µ´¼ ¾µ ¼ ¼¿¾È´½¼½µ È´½µÈ´¼µÈ´½µ ´¼ µ´¼ ¾µ´¼ µ ¼ ½¾È´½½¼µ È´½µÈ´½µÈ´¼µ ´¼ µ´¼ µ´¼ ¾µ ¼ ½¾È´½½½µ È´½µÈ´½µÈ´½µ ´¼ µ´¼ µ´¼ µ ¼ ½¾There is a simple relationship between the entropy of a stationary memoryless sourceand the entropies of its extensions.THEOREM 1.11If ËÒ is the Òth extension of the stationary memoryless source Ë, their entropies arerelated byÀ´ËÒµ ÒÀ´Ëµ (1.67)PROOF If the alphabet of Ë is ½ ¾ Õ , and the emission probabilitiesof the symbols are È´ µ for ½ ¾ Õ, the entropy of Ë isÀ´Ëµ Õ½È´ µ ÐÓ ´È´ µµ (1.68)The alphabet of ËÒ consists of all sequences ½ ¾ Ò, where ¾ ½ ¾ Õ .The emission probability of ½ ¾ Ò isÈ´ ½ ¾ Òµ È´ ½ µÈ´ ¾ µ È´ Òµ (1.69)
36 Fundamentals of Information Theory and Coding DesignThe entropy of ËÒ isÀ´ËÒµ Õ½ ½ÕÒ ½È´ ½ ¾ Òµ ÐÓ ´È´ ½ ¾ Òµµ Õ½ ½ÕÒ ½È´ ½ ¾ ÒµÒ½ÐÓ ´È´ µµ (1.70)We can interchange the order of summation to getÀ´ËÒµ Ò½Õ½ ½ÕÒ ½È´ ½ ¾ Ò µ ÐÓ ´È´ µµ (1.71)Breaking È´ ½ ¾ Ò µ into the product of probabilities, and rearranging, we getÀ´ËÒµ Ò½Õ½ ½È´ ½ µÕ½È´ µ ÐÓ ´È´ µµÕÒ ½È´ Ò µ (1.72)SinceÕ½È´ µ ½ (1.73)for , we are left withÀ´ËÒµ Ò½Õ½È´ µ ÐÓ ´È´ µµÒ½À´ËµÒÀ´Ëµ (1.74)We also have extensions of Markov sources.DEFINITION 1.23 Extension of an Ñth-order Markov Source Let Ñ and Òbe positive integers, and let Ô be the smallest integer that is greater than or equalto Ñ Ò. The Òth extension of the Ñth-order Markov source Å is the Ôth-orderMarkov source whose alphabet consists of all sequences of Ò symbols from thealphabet of Å and for which the transition probabilities between states are equalto the probabilities of the corresponding Ò-fold transitions of the Ñth-order source.We will use ÅÒ to denote the Òth extension of Å.
Entropy and Information 37EXAMPLE 1.24Let Å be the ﬁrst-order Markov source with alphabet ¼ ½ and transition probabil-itiesÈ´¼ ¼µ ¼ ¿ È´½ ¼µ ¼ È´¼ ½µ ¼ È´½ ½µ ¼The second extension of Å has Ñ ½ and Ò ¾, so Ô ½. It is a ﬁrst-ordersource with alphabet ¼¼ ¼½ ½¼ ½½ . We can calculate the transition probabilitiesas follows.È´¼¼ ¼¼µ È´¼ ¼µÈ´¼ ¼µ ´¼ ¿µ´¼ ¿µ ¼ ¼È´¼½ ¼¼µ È´¼ ¼µÈ´½ ¼µ ´¼ ¿µ´¼ µ ¼ ¾½È´½¼ ¼¼µ È´½ ¼µÈ´¼ ½µ ´¼ µ´¼ µ ¼ ¾È´½½ ¼¼µ È´½ ¼µÈ´½ ½µ ´¼ µ´¼ µ ¼ ¾È´¼¼ ½½µ È´¼ ½µÈ´¼ ¼µ ´¼ µ´¼ ¿µ ¼ ½¾È´¼½ ½½µ È´¼ ½µÈ´½ ¼µ ´¼ µ´¼ µ ¼ ¾È´½¼ ½½µ È´½ ½µÈ´¼ ½µ ´¼ µ´¼ µ ¼ ¾È´½½ ½½µ È´½ ½µÈ´½ ½µ ´¼ µ´¼ µ ¼ ¿EXAMPLE 1.25Consider the second order Markov source with alphabet ¼ ½ and transition proba-bilitiesÈ´¼ ¼¼µ ¼ È´½ ¼¼µ ¼ ¾È´¼ ¼½µ ¼ È´½ ¼½µ ¼È´¼ ½¼µ ¼ ¾ È´½ ½¼µ ¼È´¼ ½½µ ¼ È´½ ½½µ ¼The transition probabilities of the second extension are
Entropy and Information 39È´¼ ¼¼¼¼µ ¼ È´½ ¼¼¼¼µ ¼ ½È´¼ ¼¼¼½µ ¼ È´½ ¼¼¼½µ ¼ ¾È´¼ ¼¼½¼µ ¼ È´½ ¼¼½¼µ ¼ ¿È´¼ ¼¼½½µ ¼ È´½ ¼¼½½µ ¼È´¼ ¼½¼¼µ ¼ È´½ ¼½¼¼µ ¼È´¼ ¼½¼½µ ¼ È´½ ¼½¼½µ ¼È´¼ ¼½½¼µ ¼ ¿ È´½ ¼½½¼µ ¼È´¼ ¼½½½µ ¼ ¾ È´½ ¼½½½µ ¼È´¼ ½¼¼¼µ ¼ ½ È´½ ½¼¼¼µ ¼È´¼ ½¼¼½µ ¼ ¾ È´½ ½¼¼½µ ¼È´¼ ½¼½¼µ ¼ ¿ È´½ ½¼½¼µ ¼È´¼ ½¼½½µ ¼ È´½ ½¼½½µ ¼È´¼ ½½¼¼µ ¼ È´½ ½½¼¼µ ¼È´¼ ½½¼½µ ¼ È´½ ½½¼½µ ¼È´¼ ½½½¼µ ¼ È´½ ½½½¼µ ¼ ¿È´¼ ½½½½µ ¼ È´½ ½½½½µ ¼ ¾We can use these probabilities to compute the transition probabilities of the secondextension, for example,È´¼½ ½¼½¼µ È´¼ ½¼½¼µÈ´½ ¼½¼¼µ ´¼ ¿µ´¼ µ ¼ ½If we denote the states of the second extension by «½ ¼¼, «¾ ¼½, «¿ ½¼ and« ½½, we have the transition probabilities of a second-order Markov source:
40 Fundamentals of Information Theory and Coding DesignÈ´«½ «½«½µ ¼ ½ È´«¾ «½«½µ ¼ ¼È´«¿ «½«½µ ¼ ¼ È´« «½«½µ ¼ ¼¾È´«½ «½«¾µ ¼ È´«¾ «½«¾µ ¼ ¾È´«¿ «½«¾µ ¼ ½¾ È´« «½«¾µ ¼ ¼È´«½ «½«¿µ ¼ ¿ È´«¾ «½«¿µ ¼ ¿È´«¿ «½«¿µ ¼ ½¾ È´« «½«¿µ ¼ ½È´«½ «½« µ ¼ ½ È´«¾ «½« µ ¼ ¾È´«¿ «½« µ ¼ ¼ È´« «½« µ ¼ ¿¾È´«½ «¾«½µ ¼ ¼ È´«¾ «¾«½µ ¼È´«¿ «¾«½µ ¼ ½¼ È´« «¾«½µ ¼ ¼È´«½ «¾«¾µ ¼ ½¾ È´«¾ «¾«¾µ ¼ ¾È´«¿ «¾«¾µ ¼ ¾ È´« «¾«¾µ ¼ ¿È´«½ «¾«¿µ ¼ ½ È´«¾ «¾«¿µ ¼ ½È´«¿ «¾«¿µ ¼ ¾ È´« «¾«¿µ ¼ ¾È´«½ «¾« µ ¼ ½ È´«¾ «¾« µ ¼ ¼È´«¿ «¾« µ ¼ È´« «¾« µ ¼ ½È´«½ «¿«½µ ¼ ¼ È´«¾ «¿«½µ ¼ ¼½È´«¿ «¿«½µ ¼ ¾ È´« «¿«½µ ¼ ½È´«½ «¿«¾µ ¼ ½ È´«¾ «¿«¾µ ¼ ¼È´«¿ «¿«¾µ ¼ È´« «¿«¾µ ¼ ¿¾È´«½ «¿«¿µ ¼ ½ È´«¾ «¿«¿µ ¼ ½È´«¿ «¿«¿µ ¼ ¾ È´« «¿«¿µ ¼ ¾È´«½ «¿« µ ¼ ½¾ È´«¾ «¿« µ ¼ ¾È´«¿ «¿« µ ¼ ½¾ È´« «¿« µ ¼È´«½ « «½µ ¼ ¼ È´«¾ « «½µ ¼È´«¿ « «½µ ¼ ½¼ È´« « «½µ ¼ ¼È´«½ « «¾µ ¼ ½ È´«¾ « «¾µ ¼ ¾È´«¿ « «¾µ ¼ ½ È´« « «¾µ ¼ ¾È´«½ « «¿µ ¼ ¿ È´«¾ « «¿µ ¼ ¿È´«¿ « «¿µ ¼ ½ È´« « «¿µ ¼ ½¾È´«½ « « µ ¼ È´«¾ « « µ ¼ ¾È´«¿ « « µ ¼ ½ È´« « « µ ¼ ¼It is convenient to represent elements of the alphabet of ÅÒ by single symbols aswe have done in the examples above. If the alphabet of Å is ½ ¾ Õ , thenwe will use « for a generic element of the alphabet of ÅÒ, and « ½ ¾ Ò will standfor the sequence ½ ¾ Ò. For further abbreviation, we will let Á stand for½ ¾ Ò and use «Á to denote « ½ ¾ Ò, and so on.The statistics of the extension ÅÒ are given by the conditional probabilities È´«Â «Á½ «Á¾ «ÁÔµ.In terms of the alphabet of Å, we haveÈ´«Â «Á½ «Á¾ «ÁÔµ È´ ½ Ò ½½ ½Ò ÔÒµ (1.75)
Entropy and Information 41This can also be written asÈ´«Â «Á½ «Á¾ «ÁÔµ È´ ½ ½½ ½Ò ÔÒµÈ´ ¾ ½¾ ÔÒ ½ µÈ´ Ò ½Ò ÔÒ ½ Ò ½µ (1.76)We can use this relationship to prove the following result.THEOREM 1.12If ÅÒ is the Òth extension of the Markov source Å their entropies are related byÀ´ÅÒµ ÒÀ´Åµ (1.77)The proof is similar to the proof of the corresponding result for memoryless sources.(Exercise 17 deals with a special case.)Note that if Ñ Ò, thenÀ´ÅÑµ ÑÀ´Åµ ÒÀ´Åµ À´ÅÒµ (1.78)Since an extension of a Markov source is a Ôth-order Markov source, we can considerits adjoint source. If Å is an Ñth-order Markov source, ÅÒ is an Òth extension ofÅ, and ÅÒ the adjoint source of the extension, then we can combine the results ofTheorem 1.10 and Theorem 1.12 to getÀ´ÅÒµ À´ÅÒµ ÒÀ´Åµ (1.79)1.19 Inﬁnite Sample SpacesThe concept of entropy carries over to inﬁnite sample spaces, but there are a numberof technical issues that have to be considered.If the sample space is countable, the entropy has to be deﬁned in terms of the limitof a series, as in the following example.EXAMPLE 1.27Suppose the sample space is the set of natural numbers, Æ ¼ ½ ¾ and theprobability distribution is given byÈ´Òµ ¾ Ò ½(1.80)
42 Fundamentals of Information Theory and Coding Designfor Ò ¾ Æ. The entropy of this distribution isÀ´Èµ ½Ò ¼È´Òµ ÐÓ ´È´Òµ ½Ò ¼¾ Ò ½´ Ò ½µ½Ò ¼Ò · ½¾Ò·½ (1.81)This inﬁnite sum converges to ¾, so À´Èµ ¾.If the sample space is a continuum, in particular, the real line, the summations be-come integrals. Instead of the probability distribution function, we use the probabil-ity density function , which has the property thatÈ´ µ ´Üµ Ü (1.82)where is a closed interval and is the probability density function.The mean and variance of the probability density function are deﬁned by½ ½Ü ´Üµ Ü (1.83)and¾½ ½´Ü µ¾ ´Üµ Ü (1.84)The obvious generalisation of the deﬁnition of entropy for a probability density func-tion deﬁned on the real line isÀ´ µ ½ ½´Üµ ÐÓ ´ ´Üµµ Ü (1.85)provided this integral exists. This deﬁnition was proposed by Shannon, but has beenthe subject of debate because it is not invariant with respect to change of scale orchange of co-ordinates in general. It is sometimes known as the differential entropy.If we accept this deﬁnition of the entropy of a continuous distribution, it is easy tocompute the entropy of a Gaussian distribution.THEOREM 1.13The entropy of a Gaussian distribution with mean and variance ¾ is ÐÒ´Ô¾ µin natural units.
Entropy and Information 43PROOF The density function of the Gaussian distribution is´Üµ½Ô¾´Ü µ¾ ¾ ¾(1.86)Since this is a probability density function, we have½ ½´Üµ Ü ½ (1.87)Taking the natural logarithm of , we getÐÒ´ ´Üµµ ÐÒ´Ô¾ µ ´Ü µ¾¾ ¾(1.88)By deﬁnition,¾½ ½´Ü µ¾´Üµ Ü (1.89)this will be used below.We now calculate the entropy:À´ µ ½ ½´Üµ ÐÒ´ ´Üµµ Ü ½ ½´Üµ ÐÒ´Ô¾ µ ´Ü µ¾¾ ¾Ü½ ½´Üµ ÐÒ´Ô¾ µ Ü ·½ ½´Üµ´Ü µ¾¾ ¾ÜÐÒ´Ô¾ µ½ ½´Üµ Ü ·½¾ ¾½ ½´Üµ´Ü µ¾ÜÐÒ´Ô¾ µ ·¾¾ ¾ÐÒ´Ô¾ µ ·½¾ÐÒ´Ô¾ µ · ÐÓ ´Ô µÐÒ´Ô¾ µ (1.90)If the probability density function is deﬁned over the whole real line, it is not possi-ble to ﬁnd a speciﬁc probability density function whose entropy is greater than theentropy of all other probability density functions deﬁned on the real line. However,if we restrict consideration to probability density functions with a given variance, it
44 Fundamentals of Information Theory and Coding Designcan be shown that the Gaussian distribution has the maximum entropy of all thesedistributions. (Exercises 20 and 21 outline the proof of this result.)We have used À´ µ to denote the entropy of the probability distribution whose prob-ability density function is . If is a random variable whose probability densityfunction is , we will denote its entropy by either À´ µ or À´ µ.1.20 Exercises1. Let Ë ×½ ×¾ ×¿ be a sample space with probability distribution È givenby È´×½µ ¼ ¾, È´×¾µ ¼ ¿, È´×¿µ ¼ . Let be a function deﬁned on Ëby ´×½µ , ´×¾µ ¾, ´×¿µ ½. What is the expected value of ?2. Let Ë ×½ ×¾ be a sample space with probability distribution È given byÈ´×½µ ¼ , È´×¾µ ¼ ¿. Let be the function from Ë to Ê¾ given by´×½µ¾ ¼¿ ¼and´×¾µ¼ ¼What is the expected value of ?3. Suppose that a fair die is tossed. What is the expected number of spots on theuppermost face of the die when it comes to rest? Will this number of spotsever be seen when the die is tossed?4. Let Ë ×½ ×¾ ×¿ × be a sample space with probability distribution Ègiven by È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ .There are sixteen possible events that can be formed from the elements of Ë.Compute the probability and surprise of these events.5. Let Ë Ë½ ×½ ×Æ be a sample space for some Æ. Compute the en-tropy of each of the following probability distributions on Ë:(a) Æ ¿, È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ¾ ;(b) Æ , È´×½µ ¼ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ ;(c) Æ , È´×½µ ¼ , È´×¾µ ¼ ½¾ , È´×¿µ ¼ ½¾ , È´× µ ¼ ½¾ ,È´× µ ¼ ½¾ ;(d) Æ , È´×½µ ¼ ¾ , È´×¾µ ¼ ¾ , È´×¿µ ¼ ¾ , È´× µ ¼ ½¾ ,È´× µ ¼ ½¾ ;
Entropy and Information 45(e) Æ , È´×½µ ¼ ¼ ¾ , È´×¾µ ¼ ½¾ , È´×¿µ ¼ ¾ , È´× µ ¼ ,È´× µ ¼ ¼ ¾ .6. Convert the following entropy values from bits to natural units: (a) À´Èµ¼ ; (b) À´Èµ ½ ¼; (c) À´Èµ ½ ; (d) À´Èµ ¾ ¼; (e) À´Èµ ¾ ; (f)À´Èµ ¿ ¼.7. Convert the following entropy values from natural units to bits: (a) À´Èµ¼ ; (b) À´Èµ ½ ¼; (c) À´Èµ ½ ; (d) À´Èµ ¾ ¼; (e) À´Èµ ¾ ; (f)À´Èµ ¿ ¼.8. Let Ë ×½ ×¾ and Ì Ø½ Ø¾ Ø¿ . Let È be a joint probability distributionfunction on Ë ¢ Ì, given by È´×½ Ø½µ ¼ , È´×½ Ø¾µ ¼ ¾ , È´×½ Ø¿µ¼ ½¾ , È´×¾ Ø½µ ¼ ¼ ¾ , È´×¾ Ø¾µ ¼ ¼¿½¾ , È´×¾ Ø¿µ ¼ ¼¿½¾ . Com-pute the marginal distributions ÈË and ÈÌ and the entropies À´Èµ, À´ÈËµand À´ÈÌ µ. Also compute the conditional probability distributions ÈË Ì andÈÌ Ë and their entropies À´ÈË Ì µ and À´ÈÌ Ëµ.9. Draw a diagram to represent the Markov source with alphabet ¼ ½ and setof states ¦ ½ ¾ ¿ with the following ﬁve transitions:(a) ½ ¾, with label 1 and È´¾ ½µ ¼ ;(b) ½ ¿, with label 0 and È´¿ ½µ ¼ ;(c) ¾ ½, with label 0 and È´½ ¾µ ¼ ;(d) ¾ ¿, with label 1 and È´¿ ¾µ ¼ ¾;(e) ¿ ½, with label 1 and È´½ ¿µ ½ ¼.Write down the transition matrix for this source. Is it possible for this source togenerate an output sequence that includes the subsequence ¼¼¼? Is it possiblefor this source to generate an output sequence that includes the subsequence½½½?10. A 2-gram model on the language is given by the probabilitiesÈ´ µ ¼ ¼¼¼ È´ µ ¼ ¾¼¼ È´ µ ¼ ½¿¿È´ µ ¼ ½¿¿ È´ µ ¼ ¼¼¼ È´ µ ¼ ¾¼¼È´ µ ¼ ¾¼¼ È´ µ ¼ ½¿¿ È´ µ ¼ ¼¼¼The probabilities of the individual symbols areÈ´ µ ½ ¿ È´ µ ½ ¿ È´ µ ½ ¿Construct a (ﬁrst-order) Markov source from these models. Draw a diagramto represent the source and write down its transition matrix.
46 Fundamentals of Information Theory and Coding Design11. A 4-gram model on the language ¼ ½ is given byÈ ´½¼½¼µ ¼ ¼ È ´¼½¼½µ ¼ ¼with all other probabilities ¼. All the 3-gram probabilities are also ¼, exceptforÈ ´¼½¼µ ¼ ¼ È ´½¼½µ ¼ ¼Construct a third-order Markov source from these models. Draw a diagram torepresent the source and write down its transition matrix.12. Consider a Markov source with transition matrix¥¼ ¿ ¼¼ ¼ ¾and initial probability distributionÏ¼ ¼¼Compute ÏØfor Ø ½ ¾ ¿ .13. Find the stationary distribution of the Markov source whose transition matrixis¥¼ ½ ¼¼ ¼ ¾*14. Prove that a Markov source with two states always has a stationary distribution,provided that none of the transition probabilities are ¼.15. Find the stationary distribution of the Markov source whose transition matrixis¥¾¼ ¼ ¿ ¼ ¾¼ ¼ ¾ ¼¼ ½ ¼ ¼ ¾¿16. Compute the entropy of the Markov source in Exercise 9 above.*17. Prove that if ÅÒis the Òth extension of the ﬁrst-order Markov source Å , theirentropies are related by À ´ÅÒµ ÒÀ ´Å µ.18. Let Ë ½ ¾ ¿ be the set of positive integers and let È be the probabilitydistribution given by È ´ µ ¾ . Let be the function on Ë deﬁned by´ µ ´ ½µ ·½. What is the expected value of ?19. Let Ù be the probability density function of the uniform distribution over theclosed and bounded interval , so that Ù´Üµ ´ µ ½ if Üand Ù´Üµ ¼ otherwise. Compute the mean, variance and entropy (in naturalunits) of Ù.The next two exercises prove the result that the Gaussian distribution has themaximum entropy of all distributions with a given variance.
Entropy and Information 4720. Use the identity ÐÒ´Üµ Ü ½ to show that if and are probability densityfunctions deﬁned on the real line then½ ½´Üµ ÐÒ´ ´Üµµ Ü½ ½´Üµ ÐÒ´ ´Üµµ Üprovided both these integrals exist.21. Show that if is a probability density function with mean and variance ¾and is the probability density function of a Gaussian distribution with thesame mean and variance, then ½ ½´Üµ ÐÒ´ ´Üµµ Ü ÐÒ´Ô¾ µConclude that of all the probability density functions on the real line withvariance ¾, the Gaussian has the greatest entropy.22. Compute the entropy of the probability density function of a uniform distribu-tion on a closed interval whose variance is ¾ and conﬁrm that it is less thanthe entropy of a Gaussian distribution with variance ¾. (Use the results ofExercise 19.)The next four exercises are concerned with the Kullback-Leibler Divergence.*23. Let È and É be probability distributions on the sample spaceË ×½ ×¾ ×Æwith È´× µ Ô and É´× µ Õ for ½ Æ. The Kullback-LeiblerDivergence of È and É is deﬁned by´È ÉµÆ½Ô ÐÓ ´Ô Õ µCompute ´È Éµ and ´É Èµ, when Æ , Ô ¼ ¾ for ½ ¾ ¿ ,and Õ½ ¼ ½¾ , Õ¾ ¼ ¾ , Õ¿ ¼ ½¾ , Õ ¼ . Is ´È Éµ ´É Èµ?*24. If È and É are probability distributions on Ë, a ﬁnite sample space, show that´È Éµ ¼ and ´È Éµ ¼ if and only if È É.*25. If È, É, and Ê are probability distributions on Ë ×½ ×¾ ×Æ , withÈ´× µ Ô , É´× µ Õ and Ê´× µ Ö for ½ Æ, show that´È Êµ ´È Éµ · ´É Êµif and only ifÆ½´Ô Õ µ ÐÓ ´Õ Ö µ ¼
48 Fundamentals of Information Theory and Coding Design*26. If the sample space is the real line, it is easier to deﬁne the Kullback-LeiblerDivergence in terms of probability density functions. If Ô and Õ are probabilitydensity functions on Ê, with Ô´Üµ ¼ and Õ´Üµ ¼ for all Ü ¾ Ê, then wedeﬁne´Ô Õµ½ ½Ô´Üµ ÐÒ´Ô´Üµ Õ´Üµµ ÜCompute ´Ô Õµ when Ô is the probability density function of a Gaussiandistribution with mean ½ and variance ¾ and Õ is the probability densityfunction of a Gaussian distribution with mean ¾ and variance ¾.The next two exercises deal with the topic of Maximum Entropy Estimation.*27. We have shown in Section 1.6 that the entropy of the uniform distribution on asample space with Æ elements is ÐÓ ´Æµ and this is the maximum value of theentropy for any distribution deﬁned on that sample space. In many situations,we need to ﬁnd a probability distribution that satisﬁes certain constraints andhas the maximum entropy of all the distributions that satisfy those constraints.One type of constraint that is common is to require that the mean of the dis-tribution should have a certain value. This can be done using Lagrange multi-pliers. Find the probability distribution on Ë ½ ¾ ¿ that has maximumentropy subject to the condition that the mean of the distribution is ¾. To dothis you have to ﬁnd Ô½, Ô¾, Ô¿ and Ô that maximise the entropyÀ´Èµ½Ô ÐÓ ´½ Ô µsubject to the two constraints½Ô ½(so that the Ô form a probability distribution) and½Ô ¾*28. Find the probability distribution on Ë ½ ¾ ¿ that has maximum entropysubject to the conditions that the mean of the distribution is ¾ and the secondmoment of the distribution of the distribution is ¾. In this case the constraintsare½Ô ½½Ô ¾
Entropy and Information 49and½¾Ô ¾The next two exercises deal with the topic of Mutual Information.*29. If È is a joint probability distribution on Ë ¢ Ì , the Mutual Information ofÈ , denoted ÁÅ´È µ, is the Kullback-Leibler Divergence between È and theproduct of the marginals É, given byÉ´× Ø µ ÈË´× µÈÌ´Ø µCompute the Mutual Information of È when Ë ×½ ×¾ , Ì Ø½ Ø¾ Ø¿ ,and È is given by È ´×½ Ø½µ ¼ , È ´×½ Ø¾µ ¼ ¾ , È ´×½ Ø¿µ ¼ ½¾ ,È ´×¾ Ø½µ ¼ ¼ ¾ , È ´×¾ Ø¾µ ¼ ¼¿½¾ , È ´×¾ Ø¿µ ¼ ¼¿½¾ .*30. Show that if Ë ×½ ×¾ ×Å and Ì Ø½ Ø¾ ØÆ , an alternativeexpression for the Mutual Information of È , a joint probability distribution onË ¢Ì , is given byÁÅ´È µÅ½Æ½È ´× Ø µÐÓ ´È ´× Ø µ ÈË´× µµ1.21 References R. Ash, Information Theory, John Wiley Sons, New York, 1965. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley Sons, New York, 1991. D. S. Jones, Elementary Information Theory, Oxford University Press, Oxford,1979. H. S. Leff and A. F. Rex, Eds., Maxwell’s Demon: Entropy, Information, Com-puting, Adam Hilger, Bristol, 1990. R. D. Rosenkrantz, Ed., E. T. Jaynes: Papers on Probability, Statistics and Sta-tistical Physics, D. Reidel, Dordrecht, 1983. C. E. Shannon and W. Weaver, The Mathematical Theory Of Communication,The University of Illinois Press, Urbana, IL, 1949.
Chapter 2Information Channels2.1 What Are Information Channels?Information does not only have to be used or stored, it also has to be transmitted. Incommunication systems a transmitter converts, or encodes, a message to a form suit-able for transmission through a communications medium, be it a ﬁbre optic channel,satellite link or radio signal through space. The receiver then detects the transmittedsignal and decodes it back to the original message. The encoding and decoding oper-ations will be discussed in the following chapters. Although we will be dealing withdigital data, physical transmission and storage has to deal with analog signals andmedia. Thus digital data has to be modulated in some way prior to transmission andstorage, and detected by the receiver to reproduce the original digital sequence. Dis-cussion of modern analog and digital communication systems is beyond the scopeof this book and interested readers should refer to the standard textbooks in the area,for example .As we will see, two forms of coding will be necessary. Source coding is required toefﬁciently store and transmit the information generated by a general-purpose source(e.g., ASCII text, images, audio, etc.) on the storage or transmission medium in use(e.g., computer disk or digital data networks, both requiring conversion to binarydata). Channel coding is required to mitigate the effects of noise that may corruptdata during storage and transmission through physical systems. What is noise? Noisecan be deﬁned as any unwanted signal or effect in addition to the desired signal. Assuch there can be many causes of noise: interference from other signals, thermal andspurious effects generated by the electronic device being used to store or transmit thesignal, environmental disturbances, etc.In Chapter 1 we were concerned with deﬁning the information content of a source.In this chapter we will be concerned with the equally important problem of deﬁningor measuring the information carrying capacity of a channel. Intuitively a channelcan carry no more information than is being pushed through the channel itself, thatis, the entropy of the source or message being transmitted through the channel. Butas we will see the presence of noise reduces the information carrying capacity ofthe channel and leads us to deﬁne another important quantity (alongside entropy)called mutual information. As we view the channel from an information theoretic51
52 Fundamentals of Information Theory and Coding Designpoint of view not surprisingly the concept of mutual information has uses beyond theinformation carrying capacity of a channel (as does entropy beyond measuring theinformation content of a source).The following assumptions will apply in our modelling of an information channel:Stationary The statistical nature of the channel and noise do not change with timeMemoryless The behaviour of the channel and the effect of the noise at time t willnot depend on the behaviour of the channel or the effect of noise at any previ-ous timeWe now formally deﬁne a mathematical structure for an information channel.DEFINITION 2.1 Information Channel An information channel is a tripleÈ , where is the input alphabet, is the output alphabet and È is theset of channel probabilities. ½ ¾ Ö is a discrete set of Ösymbols (where is the size of the input alphabet) , and ½ ¾ ×is a discrete set of × symbols. The transmission behaviour of the channel isdescribed by the probabilities in È È´ µ ½ ¾ Ö ½ ¾ × ,where È´ µ is the probability that the output symbol will be received if theinput symbol is transmitted.NOTE The input alphabet represents the symbols transmitted into the channel andthe output alphabet represents the symbols received from the channel. The deﬁnitionof the channel implies that the input and output symbols may be different. In real-ity, one would expect that the received symbols are the same as those transmitted.However the effect of noise may introduce “new” symbols and thus we use differentinput and output alphabets to cater for such cases. For more general applications thechannel models the matching of the input symbols to prescribed output symbols orclasses which are usually different. In statistical applications the input and outputsymbols arise from two random variables and thechannel models the joint relationshipbetween the variables.The conditional probabilities that describe an information channel can be representedconveniently using a matrix representation:È¾È´ ½ ½µ È´ ¾ ½µ ¡ ¡ ¡ È´ × ½µÈ´ ½ ¾µ È´ ¾ ¾µ ¡ ¡ ¡ È´ × ¾µ............È´ ½ Öµ È´ ¾ Öµ ¡ ¡ ¡ È´ × Öµ¿(2.1)
Information Channels 53where È is the channel matrix and for notational convenience we may sometimesrewrite this as:È¾È½½ È½¾ ¡ ¡ ¡ È½×È¾½ È¾¾ ¡ ¡ ¡ È¾×............ÈÖ½ ÈÖ¾ ¡ ¡ ¡ ÈÖ×¿(2.2)where we have deﬁned È È´ µ.A graphical representation of an information channel is given in Figure 2.1.½¾......½¾×ÖÈ´ ½ ¾µÈ´ × ½µÈ´ ¾ ½µÈ´ ¾ ¾µÈ´ ½ ½µ...È´ × ÖµFIGURE 2.1Graphical representation of an information channel.The channel matrix exhibits the following properties and structure:¯ Each row of È contains the probabilities of all possible outputs from the sameinput to the channel¯ Each column of È contains the probabilities of all possible inputs to a partic-ular output from the channel¯ If we transmit the symbol we must receive an output symbol with probabil-ity 1, that is:×½È´ µ ½ ÓÖ ½ ¾ Ö (2.3)that is, the probability terms in each row must sum to 1.EXAMPLE 2.1Consider a binary source and channel with input alphabet ¼ ½ and output alphabet¼ ½ .
54 Fundamentals of Information Theory and Coding DesignNoiseless: If the channel is noiseless therewill beno error in transmission,the channelmatrix is given by È ½ ¼¼ ½ and the channel is1 11.00 01.0A BA typical input-output sequence from this channel could be:input: 0 1 1 0 0 1 0 1 1 0output: 0 1 1 0 0 1 0 1 1 0Noisy: Say the channel is noisy and introduces a bit inversion 1% of the time, thenthe channel matrix is given by È ¼ ¼ ¼½¼ ¼½ ¼ and the channel is0101A B0.990.990.010.01A typical input-output sequence from this channel could be:input: 0 1 1 0 0 1 0 1 1 0output: 0 1 1 0 0 1 1 1 1 02.2 BSC and BEC ChannelsIn digital communication systems the input to the channel will be the binary digits0,1 and this set will be the input alphabet and, ideally, also be the output alphabet.Furthermore, the effect of noise will not depend on the transmission pattern, that is,
Information Channels 55the channel is assumed memoryless. Two possible scenarios on the effect of noiseare possible.Ideally if there is no noise a transmitted 0 is detected by the receiver as a 0, and atransmitted 1 is detected by the receiver as a 1. However in the presence of noise thereceiver may produce a different result.The most common effect of noise is to force the detector to detect the wrong bit (bitinversion), that is, a 0 is detected as a 1, and a 1 is detected as a 0. In this case theinformation channel that arises is called a binary symmetric channel or BSC whereÈ´ ½ ¼µ È´ ¼ ½µ Õ is the probability of error (also calledbit error probability, bit error rate (BER), or “crossover” probability) and the outputalphabet is also the set of binary digits ¼ ½ . The parameter Õ fully deﬁnes the be-haviour of the channel. The BSC is an important channel for digital communicationsystems as noise present in physical transmission media (ﬁbre optic cable, copperwire, etc.) typically causes bit inversion errors in the receiver.A BSC has channel matrix È Ô ÕÕ Ô where Ô ½ Õ and is depicted in Figure2.2.0101A BqppqFIGURE 2.2Binary symmetric channel.Another effect that noise (or more usually, loss of signal) may have is to preventthe receiver from deciding whether the symbol was a 0 or a 1. In this case theoutput alphabet includes an additional symbol, , called the “erasure” symbol thatdenotes a bit that was not able to be detected. Thus for binary input ¼ ½ , the outputalphabet consists of the three symbols, ¼ ½ . This information channel is calleda binary erasure channel or BEC where È´ ¼µ È´ ½µ Õis the probability of error (also called the “erasure” probability). Strictly speakinga BEC does not model the effect of bit inversion; thus a transmitted bit is eitherreceived correctly (with probability Ô ½ Õ ) or is received as an “erasure” (withprobability Õ ). A BEC is becoming an increasingly important model for wirelessmobile and satellite communication channels, which suffer mainly from dropoutsand loss of signal leading to the receiver failing to detect any signal.
56 Fundamentals of Information Theory and Coding DesignA BEC has channel matrix È Ô Õ ¼¼ Õ Ôand is depicted in Figure 2.3.0101A Bqq?ppFIGURE 2.3Binary erasure channel.2.3 Mutual InformationTo fully specify the behaviour of an information channel it is necessary to specifythe characteristics of the input as well as the channel matrix. We will assume thatthe input characteristics are described by a probability distribution over the inputalphabet, with È´ µ denoting the probability of symbol being input to the chan-nel. Then a channel will be fully speciﬁed if the input source probabilities, È´ µÈ´ ½µ È´ ¾µ È´ Öµ , and channel probabilities, È È´ µ × Ö½ ½ , aregiven.If a channel is fully speciﬁed then the output probabilities, È´ µ È´ ½µ È´ ¾µÈ´ ×µ , can be calculated by:È´ µÖ½È´ µÈ´ µ (2.4)The probabilities È´ µ are termed the forward probabilities where forward in-dicates that the direction of channel use is with input symbol being transmittedand output symbol being received (i.e., then or given ). We can simi-larly deﬁne the backward probabilities as È´ µ indicating the channel is runningbackwards: output symbol occurs ﬁrst followed by input symbol (i.e., then, or given ). The backward probabilities can be calculated by application ofBayes’ Theorem as follows:È´ µ È´ µÈ´ µÈ´ µÈ´ µÈ´ µÈ´ µÈ´ µÈÖ½ È´ µÈ´ µ (2.5)
Information Channels 57where È´ µ is the joint probability of and .EXAMPLE 2.2Consider the binary information channel fully speciﬁed by:È ¿ ½½ÒÈ´ ¼µ ¾ ¿È´ ½µ ½ ¿(2.6)which is usually represented diagrammatically as:0101A B7/83/42/31/31/81/4The output probabilities are calculated as follows:È´ ¼µ È´ ¼ ¼µÈ´ ¼µ · È´ ¼ ½µÈ´ ½µ¿¢ ¾¿·½¢ ½¿½¿¾È´ ½µ ½ È´ ¼µ½½¾and the backward probabilities by:È´ ¼ ¼µÈ´ ¼ ¼µÈ´ ¼µÈ´ ¼µ ¿¡ ¾¿¡ ½¿¾¡½¾½¿È´ ½ ¼µ ½ È´ ¼ ¼µ½½¿È´ ½ ½µÈ´ ½ ½µÈ´ ½µÈ´ ½µ ¡ ½¿¡ ½½¾¡½½È´ ¼ ½µ ½ È´ ½ ½µ½½Conceptually we can characterise the probabilities as a priori if they provide theprobability assignment before the channel is used (without any knowledge), and as aposteriori if the probability assignment is provided after the channel is used (givenknowledge of the channel response). Speciﬁcally:
58 Fundamentals of Information Theory and Coding DesignÈ´ µ a priori probability of output symbol if we do not know which input symbolwas sentÈ´ µ a posteriori probability of output symbol if we know that input symbolwas sentÈ´ µ a priori probability of input symbol if we do not know which output symbolwas receivedÈ´ µ a posteriori probability of input symbol if we know that output symbolwas receivedWe can similarly refer to the a priori entropy of A:À´ µ¾È´ µ ÐÓ½È´ µ(2.7)as the average uncertainty we have about the input before the channel output is ob-served and the a posteriori entropy of A given :À´ µ¾È´ µ ÐÓ½È´ µ(2.8)as the average uncertainty we have about the input after the channel output isobserved.How does our average uncertainty about the input change after observing the outputof the channel? Intuitively, we expect our uncertainty to be reduced as the channeloutput provides us with knowledge and knowledge reduces uncertainty. However, aswe will see in the following example the output can sometimes increase our uncer-tainty (i.e., be more of a hindrance than a help!).EXAMPLE 2.3Consider the binary information channel from Example 2.2:0101A B7/83/42/31/31/81/4
Information Channels 59What is our uncertainty of the input that is transmitted through the channel before weobserve an output from the channel? This is provided by the entropy based on thegiven a priori input probabilities, È´ ¼µ¾¿ and È´ ½µ½¿ , yielding the apriori entropy of A, À´ µ ¼ ½ .What is our uncertainty of the input that is transmitted through the channel after weobserve an output from the channel? Say we observe an output of ¼, then thea posteriori input probabilities, È´ ¼ ¼µ½¾½¿ and È´ ½ ¼µ½½¿ ,yield an a posteriori entropy of A, À´ ¼µ ¼ ¿ ½ Thus we reduce ouruncertainty once we observe an output of ¼. But what if we observe an output of½? Then the a posteriori input probabilities become È´ ½ ½µ ½½ andÈ´ ¼ ½µ ½½ and the a posteriori entropy of A, À´ ½µ ¼ . Ouruncertainty is in fact increased! The reason is that our high expectation of an inputof 0, from the a priori probability È´ ¼µ¾¿ , is negated by receiving an outputof 1. Thus È´ ¼ ½µ ½½ is closer to equi-probable than È´ ¼µ ¾¿ andthis increases the a posteriori entropy.Notwithstanding the fact that À´ ½µ À´ µ even though À´ ¼µÀ´ µ we can show that if we average across all possible outputs the channel willindeed reduce our uncertainty, that is:À´ µ¾½È´ µÀ´ µÈ´ ¼µÀ´ ¼µ · È´ ½µÀ´ ½µ¼and thus À´ µ À´ µ.The average of the a posterior entropies of A, À´ µ, calculated in Example 2.3 issometimes referred to as the equivocation of A with respect to B where equivocationis used to refer to the fact that À´ µ measures the amount of uncertainty or equiv-ocation we have about the input when observing the output . Together with thea priori entropy of , À´ µ, we can now establish a measure of how well a channeltransmits information from the input to the output. To derive this quantity considerthe following interpretations:À´ µ average uncertainty (or surprise) of the input to the channel before observingthe channel output;À´ µ average uncertainty (or equivocation) of the input to the channel after ob-serving the channel output;À´ µ À´ µ reduction in the average uncertainty of the input to the channelprovided or resolved by the channel.
60 Fundamentals of Information Theory and Coding DesignDEFINITION 2.2 Mutual Information For input alphabet and output alpha-bet the termÁ´ µ À´ µ À´ µ (2.9)is the mutual information between andThe mutual information, Á´ µ, indicates the information about , À´ µ, that isprovided by the channel minus the degradation from the equivocation or uncertainty,À´ µ. The À´ µ can be construed as a measure of the “noise” in the channelsince the noise directly contributes to the amount of uncertainty we have about thechannel input, , given the channel output .Consider the following cases:noisefree If À´ µ ¼ this implies Á´ µ À´ µ which means the channelis able to provide all the information there is about the input, i.e., À´ µ. Thisis the best the channel will ever be able to do.noisy If À´ µ ¼ but À´ µ À´ µ then the channel is noisy and the inputinformation, À´ µ, is reduced by the noise, À´ µ, so that the channel isonly able to provide Á´ µ À´ µ À´ µ amount of information aboutthe input.ambiguous If À´ µ À´ µ the amount of noise totally masks the contributionof the channel and the channel provides Á´ µ ¼ information about theinput. In other words the channel is useless and is no better than if the channelwas not there at all and the outputs were produced independently of the inputs!An alternative expression to Equation 2.9 for the mutual information can be derivedas follows:Á´ µ À´ µ À´ µ¾È´ µ ÐÓ½È´ µ ¾ ¾È´ µ ÐÓ½È´ µ¾ ¾È´ µ ÐÓ½È´ µ ¾ ¾È´ µ ÐÓ½È´ µ¾ ¾È´ µ ÐÓÈ´ µÈ´ µ(2.10)Using Equation 2.5 the mutual information can be expressed more compactly:RESULT 2.1 Alternative expressions for Mutual InformationÁ´ µ¾ ¾È´ µ ÐÓÈ´ µÈ´ µÈ´ µ¾ ¾È´ µÈ´ µ ÐÓÈ´ µÈ´ µ(2.11)
Information Channels 612.3.1 Importance of Mutual InformationThe mutual information has been deﬁned in the context of measuring the informationcarrying capacity of communication channels. However the concept of mutual infor-mation has had far-reaching effects on solving difﬁcult estimation and data analysisproblems in biomedical applications , image processing and signal processing. Inthese applications the key in using mutual information is that it provides a measureof the independence between two random variables or distributions.In image processing  and speech recognition  the use of the maximum mutualinformation or MMI between the observed data and available models has yieldedpowerful strategies for training the models based on the data in a discriminativefashion. In signal processing for communication systems the idea of minimising themutual information between the vector components for separating mutually interfer-ing signals  has led to the creation of a new area of research for signal separationbased on the idea of independent component analysis or ICA .2.3.2 Properties of the Mutual InformationFrom Example 2.3 we saw that for speciﬁc values of the output alphabet, , eitherÀ´ µ À´ µ or À´ µ À´ µ but when we averaged over the outputalphabet then À´ µ À´ µ. This implies that Á´ µ ¼ for this case. Butwhat can we say about Á´ µ for other cases? We restate the following result fromChapter 1:Æ½Ô ÐÓÕÔ ¼ (2.12)for two sources of size Æ with symbol probabilities, Ô and Õ , respectively, andequality only if Ô Õ for ½ ¾ Æ. Let Ô È´ µ and Õ È´ µÈ´ µand Æ ¢ Then from Equations 2.11 and 2.12 we get: Á´ µ¾ ¾È´ µ ÐÓÈ´ µÈ´ µÈ´ µ¾ ¾Ô ÐÓÖÔ ¼ (2.13)That is:RESULT 2.2The mutual information is a non-negative quantity:Á´ µ ¼ (2.14)with Á´ µ ¼ if and only if È´ µ È´ µÈ´ µ , i.e., the input andoutput alphabets are statistically independent.
62 Fundamentals of Information Theory and Coding DesignThe expression for Á´ µ provided by Equation 2.11 is symmetric in the variablesand . Thus by exchanging and we get the following result:Á´ µ Á´ µ (2.15)orÀ´ µ À´ µ À´ µ À´ µ (2.16)NOTE The information the channel provides about upon observing is thesame as the information the channel provides about upon noting was sent.The term À´ µ is sometimes referred to as the equivocation of B with respect toA and measures the uncertainty or equivocation we have about the output giventhe input .RESULT 2.3From the fact that Á´ µ Á´ µ ¼ it can be stated that, in general and onaverage, uncertainty is decreased when we know something, that is:À´ µ À´ µ Ò À´ µ À´ µ (2.17)Other expressions and relationships between the entropies, À´ µ À´ µ, the equiv-ocation, À´ µ À´ µ, the joint entropy, À´ µ and the mutual information,Á´ µ Á´ µ, can be derived. All these relations can be summarised by theVenn diagram of Figure 2.4. From Figure 2.4 additional expressions involving theÀ´ µÀ´ µÀ´ µÀ´ µÀ´ µÁ´ µFIGURE 2.4Relationship between all entropy and mutual information expressions.
Information Channels 63joint entropy can be derived:À´ µ À´ µ · À´ µ Á´ µÀ´ µ · À´ µÀ´ µ · À´ µ (2.18)The relation À´ µ À´ µ · À´ µ Á´ µ can be stated conceptually as:“The total uncertainty in both and À´ µ is the sum of the uncertainties inand À´ µ · À´ µ minus the information provided by the channel Á´ µ ”whereas the relation À´ µ À´ µ · À´ µ becomes:“The total uncertainty in both and À´ µ is the sum of the uncertainty inplus the remaining uncertainty in after we are given .”REMARK 2.1 Given the input probabilities, È´ µ, and channel matrix, È, onlythree quantities need to be calculated directly from the individual probabilities tocompletely determine the Venn diagram of Figure 2.4 and the remaining three can bederived from the existing entropy expressions.EXAMPLE 2.4From Example 2.3 we calculated:À´ µ ¼ ½À´ µ ¼and from Example 2.2 we can calculate:À´ µ¾È´ µ ÐÓ½È´ µ¼The quantities À´ µ, À´ µ and À´ µ completely determine the Venn diagramand the remaining quantities can be derived by:Á´ µ À´ µ À´ µ ¼ ¾ ¿À´ µ À´ µ Á´ µ ¼ ¾¾À´ µ À´ µ · À´ µ Á´ µ ½ ¼
64 Fundamentals of Information Theory and Coding DesignThe mutual information of a BSC can be expressed algebraically by deﬁning:È´ ¼µ µ ÔÖÓ Ð ØÝØ Ø ¼ ×ØÖ Ò×Ñ ØØÕ È´ ½ ¼µ µ Ø ÖÖÓÖÔÖÓ Ð ØÝand hence:´½ µ µ ÔÖÓ Ð ØÝØ Ø ½ ×ØÖ Ò×Ñ ØØÔ ´½ Õµ µ ÔÖÓ Ð ØÝÓ ÒÓ ÖÖÓÖThen the output probabilities are given by:È´ ¼µ Ô · ÕÈ´ ½µ Ô · ÕA sketch of the BSC showing both input and output probabilities is given in Figure2.5.pApqqB0 01 1Ô · ÕÔ · ÕFIGURE 2.5Mutual information of a BSC.The expression for À´ µ is:À´ µ ´ Ô · ÕµÐÓ ½´ Ô · Õµ ·´ Ô · ÕµÐÓ ½´ Ô · Õµ (2.19)The simpliﬁed expression for À´ µ is:À´ µ¾È´ µ¾È´ µÐÓ ½È´ µÕ ÐÓ ½Õ·ÔÐÓ ½Ô· Õ ÐÓ ½Õ·ÔÐÓ ½ÔÕ ÐÓ ½Õ·ÔÐÓ ½Ô(2.20)
Information Channels 65and the mutual information is: Á´ µ À´ µ À´ µ.EXAMPLE 2.5What is the mutual information of a BSC for the following cases?1. Ô Õ ¼ : The channel operates in an ambiguous manner since the errorsare as likely as no errors, the output symbols are equally likely, À´ µ ½ nomatter what is happening at the input (since · ½), the equivocation isÀ´ µ ½ and the mutual information is Á´ µ ¼.2. Ô ½ ¼: The channel is noisefree, À´ µ À´ µ ÐÓ½ · ÐÓ½ ,À´ µ À´ µ ¼ and the mutual information is Á´ µ À´ µÀ´ µ.3. ¼ : The source exhibits maximum entropy with À´ µ ½, the outputentropy also exhibits maximum entropy with À´ µ ½ (since Ô · Õ ½) andthe mutual information is given by Á´ µ ½ Õ ÐÓ½Õ · Ô ÐÓ½Ô .4. ½ ¼: The source contains no information since À´ µ ¼, the outputentropy,À´ µ, is the same as the channeluncertainty, À´ µ, and the mutualinformation is Á´ µ ¼ since there is no information to transmit.The mutual information of a BEC can be similarly expressed algebraically. Adoptingthe same notation used for the BSC the output probabilities of the BEC are given by:È ´ ¼µ ÔÈ ´ ½µ ÔÈ ´ µ Õ · Õ ÕA sketch of the BEC showing both input and output probabilities is given in Figure2.6.The expression for À´ µ is:À´ µ Ô ÐÓ½Ô· Ô ÐÓ½Ô· Õ ÐÓ½Õ(2.21)The simpliﬁed expression for À´ µ reduces to the same expression as Equation2.20:À´ µ Õ ÐÓ½Õ· Ô ÐÓ½Ô(2.22)and the mutual information is: Á´ µ À´ µ À´ µ.
66 Fundamentals of Information Theory and Coding DesignpApqq01 10? BÔÕÔFIGURE 2.6Mutual information of a BEC.2.4 Noiseless and Deterministic ChannelsWe have already discussed the BSC and BEC structures as important channel modelsfor describing modern digital communication systems. We have also loosely referredto channels as being noisefree, noisy and ambiguous. We now formally deﬁne noise-less channels as those channels that are not subject to the effects of noise (i.e., nouncertainty of the input that was transmitted given the output that was received). Wealso deﬁne a dual class of channels called deterministic channels for which we candetermine the output that will be received given the input that was transmitted.2.4.1 Noiseless ChannelsA noiseless channel will have either the same or possibly more output symbols thaninput symbols and is such that there is no noise, ambiguity, or uncertainty of whichinput caused the output.DEFINITION 2.3 Noiseless Channel A channel in which there are at leastas many output symbols as input symbols, but in which each of the output symbolscan be produced by the occurrence only of a particular one of the input symbolsis called a noiseless channel. The channel matrix of a noiseless channel has theproperty that there is one, and only one, non-zero element in each column.EXAMPLE 2.6The following channel with 6 outputs and 3 inputs is noiseless because we know, withcertainty 1, which input symbol, ½ ¾ ¿ , was transmitted given knowledge ofthe received output symbol, ½ ¾ ¿ .
Information Channels 671/32/311/21/41/4½¾¿½¾¿The corresponding channel matrix is given by:È¾½ ¿ ¾ ¿ ¼ ¼ ¼ ¼¼ ¼ ½ ½ ¾ ½ ¼¼ ¼ ¼ ¼ ¼ ½¿and we note that there is only one non-zero element in each column.What can we say about the mutual information through a noiseless channel? Letbe the received output. Then, from the deﬁnition of a noiseless channel, we knowwhich input, say £, was transmitted and we know this with certainty 1. That is,È´ £ µ ½ for £ and hence È´ µ ¼ for all other . The equivocationÀ´ µ becomes:À´ µ¾ ¾È´ µÐÓ ½È´ µ¾È´ µ¾È´ µÐÓ ½È´ µ¼since:È´ µ ¼Ø Ò¼ÐÓ ½¼ ¼È´ µ ½Ø Ò½ÐÓ ½½ ¼and henceÈ¾ È´ µÐÓ ½È ´ µ¼.The mutual information is then given by the following result.
68 Fundamentals of Information Theory and Coding DesignRESULT 2.4The mutual information for noiseless channels is given by:Á´ µ À´ µ (2.23)That is, the amount of information provided by the channel is the same as theinformation sent through the channel.2.4.2 Deterministic ChannelsA deterministic channel will have either the same or possibly more input symbolsthan output symbols and is such that we can determine which output symbol will bereceived when a particular input symbol is transmitted.DEFINITION 2.4 Deterministic Channel A channel in which there are at leastas many input symbols as output symbols, but in which each of the input symbolsis capable of producing only one of the output symbols is called a deterministicchannel. The channel matrix of a deterministic channel has the property that thereis one, and only one, non-zero element in each row, and since the entries along eachrow must sum to 1, that non-zero element is equal to 1.EXAMPLE 2.7The following channel with 3 outputs and 6 inputs is deterministic because we know,with certainty 1, which output symbol, ½ ¾ ¿ , will be received given knowledgeof the transmitted input symbol, ½ ¾ ¿ .111111¾½¾¿½¿
Information Channels 69The corresponding channel matrix is given by:È¾½ ¼ ¼½ ¼ ¼½ ¼ ¼¼ ½ ¼¼ ½ ¼¼ ¼ ½¿and we note that there is only one nonzero element in each row and that the elementis 1.What is the mutual information of a deterministic channel? Let be the transmittedinput symbol. Then, from the deﬁnition of a deterministic channel, we know thatthe received output will be, say, £ and we know this with certainty 1. That is,È´ £ µ ½ for £ and hence È´ µ ¼ for all other . The equivocationÀ´ µ then becomes:À´ µ¾ ¾È´ µÐÓ ½È´ µ¾È´ µ¾È´ µÐÓ ½È´ µ¼sinceÈ¾ È´ µÐÓ ½È ´ µ¼.The mutual information is given by the following result.RESULT 2.5The mutual information for deterministic channels is given by:Á´ µ À´ µ (2.24)That is, the amount of information provided by the channel is the same as theinformation produced by the channel output.2.5 Cascaded ChannelsIn most typical cases of information transmission and storage the data are passedthrough a cascade of different channels rather than through just the one channel.
70 Fundamentals of Information Theory and Coding DesignOne example of where this happens is in modern data communication systems wheredata can be sent through different physical transmission media links (e.g., copperwire, optical ﬁbre) and wireless media links (e.g., satellite, radio) from transmitterto receiver. Each of these links can be modelled as an independent channel andthe complete transmission path as a cascade of such channels. What happens tothe information when passed through a cascade of channels as compared to a singlechannel only?Intuitively we would expect additive loss of information arising from the cumulativeeffects of uncertainty (or equivocation) from each channel in the cascade. Only forthe case of a noiseless channel would we expect no additional loss of information inpassing data through that channel. We verify these and other results when, withoutloss of generality, we derive the mutual information of a cascade of two channels andcompare this with the mutual information through the ﬁrst channel.Consider a pair of channels in cascade as shown in Figure 2.7. In Figure 2.7 thealphabetsymbolr−alphabetsymbolt−BChannel AB Channel BCalphabetsymbols−A CFIGURE 2.7Cascade of two channels.output of channel is connected to the input of channel . Thus the symbolalphabet of size × is both the output from channel and the input to channel. Say the input symbol is transmitted through channel and this producesas the output from channel . Then forms the input to channel which,in turn, produces as the output from channel . The output depends solelyon , not on . Thus we can deﬁne a cascade of channels as occurring when thefollowing condition holds:È´ µ È´ µ (2.25)Similarly we can also state that the following will also be true for a cascade of chan-nels:È´ µ È´ µ (2.26)The problem of interest is comparing the mutual information through channelonly, Á´ µ, with the mutual information through the cascade of channel with
Information Channels 71channel , that is, Á´ µ. To this end we ﬁrst show that À´ µ À´ µ ¼as follows:À´ µ À´ µ¾ ¾È´ µ ÐÓ½È´ µ ¾ ¾È´ µ ÐÓ½È´ µ¾ ¾ ¾È´ µ ÐÓÈ´ µÈ´ µ(2.27)Equation 2.26 gives È´ µ È´ µ, noting that È´ µ È´ µÈ´ µ,and given that ÐÒ½Ü ½ Ü with equality when Ü ½ we can now state:À´ µ À´ µ½ÐÒ ¾¾ ¾ ¾È´ µÈ´ µ ½ È´ µÈ´ µ½ÐÒ ¾¾ ¾È´ µ¾È´ µ ¾È´ µ½ÐÒ ¾¾ ¾È´ µ ´½ ½µ ¼ (2.28)with equality when È´ µ È´ µ such that È´ µ ¼.Since Á´ µ À´ µ À´ µ and Á´ µ À´ µ À´ µ we have thefollowing result.RESULT 2.6For the cascade of channel with channel it is true that:Á´ µ Á´ µ Û Ø ÕÙ Ð ØÝ « È´ µ È´ µ Û Ò È´ µ ¼(2.29)That is, channels tend to leak information and the amount of information out of acascade can be no greater (and is usually less) than the information from the ﬁrstchannel.CLAIM 2.1If channel is noiseless then Á´ µ Á´ µPROOF Fornoiselesschannel ifÈ´ µ ¼ then thisimpliesthatÈ´ µ ½.For the cascade of channel with the condition for Á´ µ Á´ µ when
72 Fundamentals of Information Theory and Coding DesignÈ´ µ ¼ is È´ µ È´ µ From Bayes’ Theorem we can show that:P´ µ¾È´ µÈ´ µ (2.30)For a cascade we know that È´ µ È´ µ and since È´ µ ½ when È´ µ¼, and È´ µ ¼ otherwise, for noiseless channel this gives the required resultthat:È´ µ È´ µ (2.31)and hence Á´ µ Á´ µ.The converse of Claim 2.1, however, is not true since, surprisingly, there may existparticular channel combinations and input distributions which give rise to Á´ µÁ´ µ even if channel is not noiseless. The following example illustrates thispoint.EXAMPLE 2.8Consider the cascade of channel :A B1/31/21/31/31/2with channel :3/4B C3/41/41/41which produces the cascaded channel :AB3/4C13/41/41/41/31/31/31/21/2
Information Channels 73The corresponding channel matrices for and are:È ½ ¾ ¼ ½ ¾½ ¿ ½ ¿ ½ ¿ È¾¿ ¼ ½¼ ½ ¼½ ¼ ¿¿Obviously channel is not noiseless. Nevertheless it is true that:Á´ µ Á´ µby virtue of the fact that the channel matrix for the cascaded channel :È È È ½ ¾ ¼ ½ ¾½ ¿ ½ ¿ ½ ¿¾¿ ¼ ½¼ ½ ¼½ ¼ ¿¿½ ¾ ¼ ½ ¾½ ¿ ½ ¿ ½ ¿is identical to the channel matrix for channel !2.6 Additivity of Mutual InformationIn the previous section it was shown that when information channels are cascadedthere is a tendency for information loss, unless the channels are noiseless. Of par-ticular interest to communication engineers is the problem of how to reduce infor-mation loss, especially when confronted with transmission through a noisy channel.The practical outcome of this is the development of channel codes for reliable trans-mission (channel codes are discussed in Chapters 5 to 9). From a purely informationtheoretic point of view channel coding represents a form of additivity of mutual infor-mation. Additivity is achieved when we consider the average information providedby the channel about a single input symbol upon observing a succession of outputsymbols. Such a multiplicity of outputs can occur spatially or temporally. Spatialmultiplicity occurs when the same input is transmitted simultaneously through morethan one channel, thereby producing as many outputs as there are channels. Temporalmultiplicity occurs when the same input is repeatedly transmitted through the samechannel, thereby producing as many outputs as repeat transmissions. Both casesminimise information loss by exploiting redundancy, either extra channel space orextra time to repeat the same input. We now develop an expression for the mutualinformation for the special case of two channel outputs (original plus repeat) andshow that there is a gain of mutual information over using the single original channeloutput.Consider the channel system shown in Figure 2.8 where is the input alphabet ofsize Ö , is the original output alphabet of size × and is the repeatoutput alphabet of size Ø . The output can be construed as the output from
74 Fundamentals of Information Theory and Coding Designalphabetsymbolt−alphabetsymbols−Aalphabetsymbolr−BCChannelFIGURE 2.8Channel system with two outputs.the ﬁrst channel (if there are two physical channels) or ﬁrst transmission (if thereis one physical channel) and output can be construed as the output of the secondchannel or second (repeat) transmission.To develop an expression for the mutual information we extend our notion of a prioriand a posteriori probabilities and entropies as discussed in Section 2.3 to include thecontribution of a second output as follows:È´ µ: a priori probability of input symbol if we do not know which output sym-bol was receivedÈ´ µ: a posteriori probability of input symbol if we know that output symbolwas receivedÈ´ µ: a posteriori probability of input symbol if we know that both outputsymbols, and , were received.Thus we can deﬁne the following a posteriori entropy:À´ µ¾È´ µ ÐÓ½È´ µ(2.32)the equivocation of with respect to and by:À´ µ¾ ¾È´ µÀ´ µ (2.33)
Information Channels 75and the amount of information channels and provide about that is themutual information of and , by:Á´ µ À´ µ À´ µ (2.34)What can we say about Á´ µ À´ µ À´ µ, the amount of informationchannel provides about , in comparison to Á´ µ, the amount of informa-tion both channels and provide about ? Expand Equation 2.34 as follows:Á´ µ À´ µ À´ µ · À´ µ À´ µ (2.35)Á´ µ · Á´ µwhere Á´ µ is the amount of information channel additionally providesabout after using channel . It can be shown that:I´ µ ¼ Û Ø ÕÙ Ð ØÝ « À´ µ À´ µ (2.36)Thus we have the following result.RESULT 2.7For a channel with input and dual outputs and it is true that:Á´ µ Á´ µ Û Ø ÕÙ Ð ØÝ « À´ µ À´ µ (2.37)That is, dual use of a channel provides more information than the single use of achannel.EXAMPLE 2.9Consider a BSC where the input symbol, , is transmitted through the channel twice.Let represent the output from the original transmission and let be the outputfrom the repeat transmission. Since the same channel is used:È È Ô ÕÕ ÔFor simplicity we assume that È´ ¼µ È´ ½µ ¼ From Example 2.5 wehave that for ¼ :Á´ µ ½ ÕÐÓ½Õ · ÔÐÓ½Ô
76 Fundamentals of Information Theory and Coding DesignWhat is the expression for Á´ µ and how does it compare to Á´ µ? A directextension of Equation 2.11 yields the following expression for Á´ µ:Á´ µ¾ ¾ ¾È ´ µ ÐÓÈ ´ µÈ ´ µÈ ´ µ(2.38)from which we can state that:È ´ µ ¼È ´ µ¾È ´ µÈ ´ µ È ´ µÈ ´ µÈ ´ µ½¾È ´ µÈ ´ µwhere we note that È ´ µ È ´ µ since the repeat output does not depend on theoriginal output. We list the contribution of each term in the expression of Equation2.38 as shown by Table 2.1.Table 2.1 Individual terms ofÁ´ µ expressionÈ ´ µ È ´ µ Type¼ ¼¼ ½¾ Ô¾ X½ ½½ ½¾ Ô¾½¾ ´Ô¾ · Õ¾µX¼ ¼½½¾ ÔÕ Z½ ¼½½¾ ÔÕÔÕZ¼ ½¼½¾ ÔÕ Z½ ½¼½¾ ÔÕÔÕZ¼ ½½ ½¾ Õ¾ Y½ ¼¼ ½¾ Õ¾½¾ ´Ô¾ · Õ¾µYThe Type column assigns a class label to each term. The type X terms represent thecase of no error in both the original or repeat outputs and these positively reinforcethe information provided by the dual use of the channel. Collecting the type X termswe get:¾½¾Ô¾ÐÓ½¾ Ô¾½¾½¾ ´Ô¾ · Õ¾µÔ¾ÐÓ¾Ô¾Ô¾ · Õ¾The type Y terms represent the case of complete error in both the original and repeatoutputs and these negatively reinforce the information provided by the dual use of thechannel. Collecting the type Y terms we get:¾½¾Õ¾ÐÓ½¾ Õ¾½¾½¾ ´Ô¾ · Õ¾µÕ¾ÐÓ¾Õ¾Ô¾ · Õ¾
Information Channels 77The type Z terms, however, indicate contradictory original and repeat outputs whicheffectively negate any information that the dual use of the channel may provide.Collecting the type Z terms we see that these make no contribution to the mutualinformation:½¾ÔÕ ÐÓ½¾ ÔÕ½¾ ÔÕ¼Putting it all together we get the ﬁnal expression for Á´ µ:Á´ µ Ô¾ÐÓ¾Ô¾Ô¾ · Õ¾· Õ¾ÐÓ¾Õ¾Ô¾ · Õ¾By plotting Á´ µ and Á´ µ for different values of Õ we see from Figure2.9 that Á´ µ Á´ µ. It is interesting to note the conditions for equality,00.10.20.30.220.127.116.11.80.910 0.2 0.4 0.6 0.8 1q (error)I(A;B)I(A;BC)FIGURE 2.9Comparison of Á´ µ with Á´ µ.Á´ µ Á´ µ. This happens when:¯ Õ ¼, there is 100% no error, so Á´ µ Á´ µ ½¯ Õ ½, there is 100% bit reversal, so Á´ µ Á´ µ ½¯ Õ ¼ , there is total ambiguity, so Á´ µ Á´ µ ¼
78 Fundamentals of Information Theory and Coding Design2.7 Channel Capacity: Maximum Mutual InformationConsider an information channel with input alphabet , output alphabet and chan-nel matrix È with conditional channel probabilities È´ µ. The mutual infor-mation:Á´ µ¾ ¾È´ µ ÐÓÈ´ µÈ´ µÈ´ µ(2.39)which, if we now assume the logarithm is base 2, indicates the amount of informationthe channel is able to carry in bits per symbol transmitted. The maximum amountof information carrying capacity for the channel is À´ µ, the amount of informationthat is being transmitted through the channel. But this is reduced by À´ µ, whichis an indication of the amount of “noise” present in the channel.The expression for mutual information depends not only on the channel probabil-ities, È´ µ, which uniquely identify a channel, but also on how the channel isused, the input or source probability assignment, È´ µ. As such Á´ µ cannotbe used to provide a unique and comparative measure of the information carryingcapacity of a channel since it depends on how the channel is used. One solution isto ensure that the same probability assignment (or input distribution) is used in cal-culating the mutual information for different channels. The questions then is: whichprobability assignment should be used? Obviously we can’t use an input distribu-tion with À´ µ ¼ since that means Á´ µ ¼ for whatever channel we use!Intuitively an input distribution with maximum information content (i.e., maximumÀ´ µ) makes sense. Although this allows us to compare different channels the com-parison will not be fair since another input distribution may permit certain channelsto exhibit a higher value of mutual information.The usual solution is to allow the input distribution to vary for each channel and todetermine the input distribution that produces the maximum mutual information forthat channel. That is we attempt to calculate the maximum amount of information achannel can carry in any single use (or source assignment) of that channel, and werefer to this measure as the capacity of the channel.DEFINITION 2.5 Channel Capacity The maximum average mutual informa-tion, Á´ µ, in any single use of a channel deﬁnes the channel capacity. Mathe-matically, the channel capacity, , is deﬁned as:Ñ ÜÈ ´ µÁ´ µ (2.40)that is, the maximum mutual information over all possible input probability assign-ments, È ´ µ.The channel capacity has the following properties:
Information Channels 791. ¼ since Á´ µ ¼2. Ñ Ò ÐÓ ÐÓ since Ñ Ü Á´ µ and Ñ Ü Á´ µÑ Ü À´ µ ÐÓ when considering the expression Á´ µ À´ µ À´ µ, and Ñ Ü Á´ µ Ñ Ü À´ µ ÐÓ when considering theexpression Á´ µ À´ µ À´ µThe calculation of involves maximisation of Á´ µ over Ö ÐÓ indepen-dent variables (the input probabilities, È´ µ ½ ¾ Ö ) subject to the twoconstraints:1. È´ µ ¼2.ÈÖ½ È´ µ ½This constrained maximisation of a non-linear function is not a trivial task. Methodsthat can be used include:¯ standard constrained maximisation techniques like the method of Lagrangianmultipliers¯ gradient search algorithms¯ derivation for special cases (e.g., weakly symmetric channels)¯ the iterative algorithms developed by Arimoto  and Blahut 2.7.1 Channel Capacity of a BSCFor a BSC, Á´ µ À´ µ À´ µ where À´ µ is given by Equation 2.19and À´ µ is given by Equation 2.20. We want to examine how the mutual in-formation varies with different uses of the same channel. For the same channel thechannel probabilities, Ô and Õ, remain constant. However, for different uses the inputprobability, È´ ¼µ, varies from 0 to 1. From Figure 2.10 the maximummutual information occurs at È´ ¼µ ½¾.For È´ ¼µ ½¾ the mutual information expression simpliﬁes to:Á´ µ ½ ÔÐÓ½Ô · ÕÐÓ½Õ (2.41)Since this represents the maximum possible mutual information we can then state:Ë ½ ÔÐÓ½Ô · ÕÐÓ½Õ (2.42)How does the channel capacity vary for different error probabilities, Õ? From Figure2.11 the following observations can be made:
80 Fundamentals of Information Theory and Coding Design00.20.40.60.810 0.2 0.4 0.6 0.8 1wI(A;B)FIGURE 2.10Variation of Á´ µ with different source assignments for a typical BSC.¯ When Õ ¼ or ½, that is, no error or 100% bit inversion, the BSC channel willprovide its maximum capacity of 1 bit¯ When Õ ¼ the BSC channel is totally ambiguous or useless and has acapacity of 0 bits2.7.2 Channel Capacity of a BECFor a BEC, Á´ µ À´ µ À´ µ where À´ µ is given by Equation 2.21and À´ µ is given by Equation 2.22. The expression can be further simpliﬁed to:Á´ µ Ô ÐÓ½· ÐÓ½´½ ÕµÀ´ µ (2.43)from which we can immediately see that:Ñ ÜÈ ´ µÁ´ µ Ñ ÜÈ ´ µ´½ ÕµÀ´ µ ½ Õ (2.44)since we know that for the binary input alphabet, , that Ñ Ü È ´ µ À´ µÐÓ ÐÓ ¾ ½ occurs when È ´ ¼µ½¾. The following observationscan be made:
Information Channels 818.104.22.168.810 0.2 0.4 0.6 0.8 1q (error)CFIGURE 2.11Variation of channel capacity of a BSC with Õ.¯ when Õ ¼, that is, no erasures, the BEC channel will provide its maximumcapacity of 1 bit¯ when Õ ½, the BEC channel will be producing only erasures and have acapacity of 0 bits2.7.3 Channel Capacity of Weakly Symmetric ChannelsThe calculation of the channel capacity, , is generally quite involved since it rep-resents a problem in constrained optimisation of a non-linear function. Howeverthere is a special class of channels where the derivation of the channel capacity canbe stated explicitly. Both symmetric and weakly symmetric channels are examplesof this special class. We now provide a formal deﬁnition of symmetric and weaklysymmetric channels.DEFINITION 2.6 (Weakly) Symmetric Channels A channel is said to be sym-metric if the rows and columns of the channel matrix are permutations of each other.A channel is said to be weakly symmetric if the rows of the channel matrix are per-mutations of each other and the column sums,È ¾ È ´ µ, are equal.Obviously a symmetric channel will also have equal column sums since columnsare a permutation of each other.
82 Fundamentals of Information Theory and Coding DesignConsider the weakly symmetric channel with mutual information:Á´ µ À´ µ À´ µÀ´ µ ¾È´ µ¾È´ µ ÐÓ½È´ µÐÓ ¾È´ µ À´ µ (2.45)where À´ µÈ ¾ È´ µ ÐÓ½È´ µand the summation involves terms inthe th row of the channel matrix. Since each row is a permutation of, say, the ﬁrstrow, then À´ ½µ À´ ¾µ À´ Öµ À´ µ where Ö .This means thatÈ ¾È´ µ À´ µÈ ¾È´ µ À´ µ À´ µsinceÈ ¾È´ µ ½. Since Á´ µ is bounded above, then the upper bound ofÐÓ À´ µ would be the channel capacity if it can be achieved by an appro-priate input distribution. Let È´ µ½Ö then È´ µÈ ¾È´ µÈ´ µ½ÖÈ ¾È´ µ Ö . Since the channel is weakly symmetric then the columnsums, , are all equal, , and we have that È´ µ Ö , that is, the output prob-abilities are equal. Since we know that maximum entropy is achieved with equalsymbol probabilities then it follows that À´ µ ÐÓ when È´ µ½Ö . Wehave established that:THEOREM 2.1For a symmetric or weakly symmetric channel , the channel capacity can bestated explicitly as:ÐÓ À´ µ (2.46)where À´ µÈ ¾È´ µ ÐÓ½È´ µcan be calculated for any row .Channel capacity is achieved when the inputs are uniformly distributed, that is,È´ µ½.EXAMPLE 2.10Consider the BSC channel matrix:È Ô ÕÕ ÔThe BSC is a symmetric channel and we can use Theorem 2.1 to explicitly derive thechannel capacity as follows:ÐÓ ÐÓ ¾ ½À´ µ ÔÐÓ½Ô · ÕÐÓ½Õ
Information Channels 83µ ½ ÔÐÓ ½Ô ·ÕÐÓ ½Õwhen È´ µ ½¾ , which is the same expression that was derived in Section 2.7.1.Now consider a channel with the following channel matrix:È ¼ ¼ ¿ ¼¼ ¿ ¼ ¼Obviously the channel is not symmetric, but since the second row is a permutation ofthe ﬁrst row and since the column sums are equal then the channel is weaklysymmetricand the channel capacity can be explicitly stated as:ÐÓ ÐÓ ¿ ½ ¼À´ µ ¼ ÐÓ ½¼ ·¼ ¿ÐÓ ½¼ ¿ ·¼ ÐÓ ½¼ ½½ ¼ ½ ¼ ¼¿ ½when È´ µ ½¾ .2.8 Continuous Channels and Gaussian ChannelsWe extend our analysis of information channels to the case of continuous valued in-put and output alphabets and to the most important class of continuous channel, theGaussian channel. In digital communication systems noise analysis at the most basiclevel requires consideration of continuous valued random variables rather than dis-crete quantities. Thus the Gaussian channel represents the most fundamental formof all types of communication channel systems and is used to provide meaning-ful insights and theoretical results on the information carrying capacity of channels.The BSC and BEC models, on the other hand, can be considered as high-level de-scriptions of the practical implementations and operations observed in most digitalcommunication systems.When considering the continuous case our discrete-valued symbols and discreteprobability assignments are replaced by continuous-valued random variables, ,with associated probability density functions, ´Üµ.
84 Fundamentals of Information Theory and Coding DesignDEFINITION 2.7 Mutual Information of two random variables Let andbe two random variables with joint density ´Ü Ýµ and marginal densities´Üµ and ´Ýµ. Then the mutual information between and is deﬁned as:Á´ µ ´Ü Ýµ ÐÓ´Ü Ýµ´Üµ ´ÝµÜ Ý (2.47)À´ µ À´ µwhere À´ µ is the differential entropy:À´ µ ´Ýµ ÐÓ ´Ýµ Ý (2.48)and À´ µ is the conditional differential entropy:À´ µ ´Ü Ýµ ÐÓ ´Ý Üµ Ý Ü (2.49)The mutual information, Á´ µ, provides a measure of the amount of informationthat can be carried by the continuous channel . Let us now consider the Gaus-sian Channel shown in Figure 2.12 where Æ Æ ´¼ ¾Æµ is a zero-mean Gaussianrandom variable with variance ¾Æ. Let be the value of the input to the channel attime , and let be the value of the output from the channel at time .+ÆFIGURE 2.12The Gaussian channel.The output of the channel is given by:· Æ (2.50)That is, the channel output is perturbed by additive white Gaussian noise (AWGN).We assume that the channel is band-limited to Ï Hz. Two immediate implicationsof this are that:
Information Channels 851. Assume the AWGN has a power spectral density of ÆÓ ¾. Then the noisevariance, or average power, is band-limited and given by ¾Æ Æ¾Ê Ï ÏÆÓ¾ ÆÓÏ2. To faithfully reproduce any signals transmitted through the channel the signalsmust be transmitted at a rate not exceeding the Nyquist rate of ¾Ï samplesper secondWe further assume that the channel is power-limited to È. That is:¾ ¾È (2.51)This band-limited, power-limited Gaussian channel just described is not only of the-oretical importance in the ﬁeld of information theory but of practical importance tocommunication engineers since it provides a fundamental model for many moderncommunication channels, including wireless radio, satellite and ﬁbre optic links.2.9 Information Capacity TheoremIn Section 2.7 we deﬁned the channel capacity as the maximum of the mutual in-formation over all possible input distributions. Of importance to communication en-gineers is the channel capacity of a band-limited, power-limited Gaussian channel.This is given by the following maximisation problem:Ñ Ü´Üµ ¾ ÈÁ´ µÑ Ü´Üµ ¾ ÈÀ´ µ À´ µ (2.52)We now provide the details of deriving an important and well-known closed-formexpression for for Gaussian channels. The result is the Information CapacityTheorem which gives the capacity of a Gaussian communication channel in termsof the two main parameters that confront communication engineers when designingsuch systems: the signal-to-noise ratio and the available bandwidth of the system.CLAIM 2.2If · Æ and is uncorrelated with Æ then:À´ µ À´Æµ (2.53)
86 Fundamentals of Information Theory and Coding DesignPROOF We note from Bayes’ Theorem that ´Ü Ýµ ´Ý Üµ ´Üµ. Alsosince · Æ and since and Æ are uncorrelated we have that ´Ý ÜµÆ´Ý Üµ. Using these in the expression for À´ µ gives:À´ µ ´Ü Ýµ ÐÓ ´Ý Üµ Ý Ü ´Üµ ´Ý Üµ ÐÓ ´Ý Üµ Ý Ü ´ÜµÆÆ´Òµ ÐÓ Æ´Òµ Ò ÜÀ´Æµ ´Üµ Ü À´Æµ (2.54)From Chapter 1 we stated that for a Gaussian random variable the differential entropyattains the maximum value of ÐÓ Ô¾¡; so for the Gaussian random variable, Æ,we know that:À´Æµ ÐÓÔ¾½¾ÐÓ ¾¾Æ¡(2.55)For random variable · Æ where and Æ are uncorrelated we have that:À´ µ½¾ÐÓ ¾¾ ¡ ½¾ÐÓ¢¾ ¾·¾Æ¡£(2.56)with the maximum achieved when is a Gaussian random variable. Thus:Á´ µ À´ µ À´ µÀ´ µ À´Æµ½¾ÐÓ¢¾ ¾·¾Æ¡£ ½¾ÐÓ ¾¾Æ¡½¾ÐÓ¾ · ¾Æ¾Æ½¾ÐÓ ½ ·¾¾Æ(2.57)If is chosen to be a Gaussian random variable with ¾ È then will also be aGaussian random variable and the maximum mutual information or channel capacitywill be achieved:½¾ÐÓ ½ ·È¾ÆØ× Ô Ö ØÖ Ò×Ñ ×× ÓÒ (2.58)Since the channel is also band-limited to Ï Hz then there can be no more than ¾Ïsymbols transmitted per second and ¾Æ ÆÓÏ. This provides the ﬁnal form of the
Information Channels 87channel capacity, stated as Shannon’s most famous Information Capacity Theorem, which is also known as the Shannon-Hartley Law in recognition of the earlywork by Hartley .RESULT 2.8 Information Capacity TheoremThe information capacity of a continuous channel of bandwidth Ï Hz, perturbedby AWGN of power spectral density ÆÓ ¾ and bandlimited also to Ï Hz, is givenby:Ï ÐÓ ½ ·ÈÆÓÏ Ø× Ô Ö × ÓÒ (2.59)where È is the average transmitted power and È ÆÓÏ is the signal-to-noise ratioor SNR.Equation 2.59 provides the theoretical capacity or upper bound on the bits per secondthat can be transmitted for error-free transmission through a channel for a giventransmitted power, È, and channel bandwidth, Ï, in the presence of AWGN noisewith power spectral density, ÆÓ ¾. Thus the information capacity theorem deﬁnes afundamental limit that confronts communication engineers on the rate for error-freetransmission through a power-limited, band-limited Gaussian channel.EXAMPLE 2.11What is the minimum signal-to-noise ratio that is needed to support a 56k modem?A 56k modem requires a channel capacity of 56,000 bits per second. We assume atelephone bandwidth of Ï ¿ ¼¼ Hz. From Equation 2.59 we have:¼¼¼ ¿ ¼¼ ÐÓ ´½ · ËÆÊµµ ËÆÊ ½or:ËÆÊ ½¼ ÐÓ ½¼´ ½ µËÆÊThus a SNR of at least 48 dB is required to support running a 56k modem. Inreal telephone channels other factors such as crosstalk, co-channel interference, andechoes also need to be taken into account.
88 Fundamentals of Information Theory and Coding Design2.10 Rate Distortion TheoryConsider a discrete, memoryless source with alphabet Ü ½ ¾ Õand associated probabilities È´Ü µ ½ ¾ Õ . Source coding is usually re-quired to efﬁciently transmit or store messages from the source in the appropriaterepresentation for the storage or communication medium. For example, with dig-ital communication channels and storage, source symbols need to be encoded asa binary representation. Furthermore, source symbols transmitted through a com-munication channel may be output in a different symbol representation. Thus thesource symbols will appear as symbols from a representation or code word alpha-bet Ý ½ ¾ Ö . If there are no losses or distortions in the coding ortransmission there is perfect representation and Ü can be recovered fully from itsrepresentation Ý . But in the following situations:¯ lossiness in the source coding where the code alphabet and permitted codewords do not allow exact representation of the source symbols and the decod-ing is subject to errors,¯ insufﬁcient redundancy in the channel code such that the rate of informationis greater than the channel capacity,the representation is not perfect and there are unavoidable errors or distortions in therepresentation of the source symbol Ü by the representation symbol Ý .Rate distortion theory, ﬁrst developed by Shannon , deals with the minimum mu-tual information (equivalent to the information or code rate) that the channel mustpossess, for the given source symbol probability distributions, to ensure that the av-erage distortion is guaranteed not to exceed a speciﬁed threshold . To derive thisvalue we ﬁrst deﬁne what we mean by an information or code rate (this is for thegeneral case; in Chapter 5 we deﬁne the code rate for the speciﬁc case that applieswith channel codes).DEFINITION 2.8 Code Rate (General Case) Assume one of Å possible sourcemessages is represented as a code word of length Ò. Let À´Åµ be the averagenumber of bits transmitted with each source message, then the code rate Êis deﬁnedas:Ê À´ÅµÒ (2.60)For the case of equally likely source messages, À´Åµ ÐÓ Å. For the generalcase, À´Åµ is the entropy of the source message.
90 Fundamentals of Information Theory and Coding DesignEXAMPLE 2.12Consider a binary source with symbols Ü½ ¼ Ü¾ ½ that are detected at thereceiver as the ternary representation alphabetwith symbols Ý½ ¼ Ý¾ Ý¿ ½ .This representation covers the following three cases:¯ Ü½ ¼ is detected as Ý½ ¼ and Ü¾ ½ is detected as Ý¿ ½, that is, there isno error in representation.¯ Ü½ ¼ is detected as Ý¾ and Ü¾ ½ is detected as Ý¾ , that is, thereceiver fails to detect the symbol and there is an erasure.¯ Ü½ ¼ is detected as Ý¿ ½ and Ü¾ ½ is detected as Ý½ ¼, that is, thereceiver detects the symbol incorrectly and there is a bit inversion.The corresponding channel is shown in Figure 2.13 where È È´Ý Ü µ are thetransition probabilities and we are given that È´Ü½µ È´Ü¾µ ¼ .01?01È¾¾Ü½Ü¾Ý½Ý¾Ý¿È½¾È½¿È¾¿È½½È¾½FIGURE 2.13Graphical representation of information channel.We deﬁne a distortion measure for this channel as follows:¯ ´Ü½ Ý½µ ´Ü¾ Ý¿µ ¼to indicate there is no distortion (or distortion of ¼)when there is no error in the representation.¯ ´Ü½ Ý¾µ ´Ü¾ Ý¾µ ½ to indicate there is a distortion or cost of ½ whenthere is an erasure.¯ ´Ü½ Ý¿µ ´Ü¾ Ý½µ ¿ to indicate there is a distortion of ¿ where there isa bit inversion.
Information Channels 9101?01001133Ü½Ü¾Ý½Ý¾Ý¿FIGURE 2.14Graphical representation of distortion measure.Note that a higher distortion of ¿ is attributed to bit inversions compared to a lowerdistortion of ½ when there is an erasure. The set of distortion measures are representedgraphically in Figure 2.14.Assume the following conditional probability assignment for the channel:È ¼ ¼ ¾ ¼ ½¼ ½ ¼ ¾ ¼(2.66)We calculate the representation alphabet probabilities: È´Ý½µ ¼ , È´Ý¾µ ¼ ¾,È´Ý¿µ ¼ , and the mutual information (from Equation 2.63 ) is Á´ µ ¼ ¿ .The average distortion is then:¾½¿½È´Ü µÈ´Ý Ü µ ´Ü Ý µ¾½¾´¼ µ´¼µ · ´¼ ¾µ´½µ · ´¼ ½µ´¿µ¼Consider the derivation of Ê´¼ µ, that is, the rate distortion function for ¼ .Theconditionalprobability assignmentofEquation2.66 is -admissibleandprovidesan information rate of Á´ µ ¼ ¿ . Is there another probability assignmentwith ¼ and information rate Á´ µ ¼ ¿ ? Intuitively one expects theminimum information rate to occur at the maximum allowable distortion of ¼ .To investigate this we consider the derivation of the corresponding BSC and BECequivalents of Figure 2.13 such that ¼ . The following conditional probabilityassignments:È Ë¼ ½½ ¼È½¾½¾ ¼¼ ½¾½¾
92 Fundamentals of Information Theory and Coding Designboth yield the same average distortion of ¼ . For the BSC equivalent the mutualinformation iscalculated asÁ Ë ´ µ ¼ ¿ ¼and fortheBECequivalentwehaveÁ ´ µ ¼ . Thus the BSC equivalent is the channel with lowest informationrate for the same level of average distortion. To ﬁnd Ê´¼ µ we need to considerall -admissible conditional probability assignments and ﬁnd the one providing theminimum value for the mutual information.EXAMPLE 2.13An important example of rate distortionanalysis is for the case ofanalog todigital con-version when the analog source signal has to be represented as a digital source signal.An important component of this conversion is the quantisation of the continuous-valued analog signal to a discrete-valued digital sample.Considersourcesymbolsgeneratedfromadiscrete-time,memorylessGaussian sourcewith zero mean and variance ¾. Let Ü denote the value of the source symbol or sam-ple generated by this source. Although the source symbol is discrete in time it iscontinuous-valuedand a discrete-valued (quantised) representation of Ü is needed forstorage and transmission through digital media. Let Ý be a symbol from the discrete-valued representation alphabet that is used to represent the continuous-valued Ü.For example can be the set of non-negative integers and Ý ÖÓÙÒ ´Üµis the valueof Ü rounded to the nearest integer. It should be noted that, strictly speaking, a ﬁniterepresentation is needed since digital data are stored in a ﬁnite number of bits. This isusually achieved in practice by assuming a limited dynamic range for Ü and limitingthe representation alphabet to a ﬁnite set of non-negativeintegers. Thus we refer toÝ as the quantised version of Ü and the mapping of Ü to Ý as the quantisationoperation.The most intuitive distortion measure between Ü and Ý is a measure of the error inrepresenting Ü by Ý and the most widely used choice is the squared error distortion:´Ü Ýµ ´Ü Ýµ¾(2.67)It can be shown with the appropriate derivation (see [4, 5]) that the rate distortionfunction for the quantisation of a Gaussian source with variance ¾ with a squarederror distortion is given by:Ê´ µ´ ½¾ ÐÓ ¾¼ ¾¼ ¾(2.68)
Information Channels 932.10.1 Properties of Ê´ µBy considering the minimum and maximum permissible values for Ê´ µ and thecorresponding distortion threshold, , a more intuitive understanding of the be-haviour of the rate distortion function, Ê´ µ, is possible. Obviously the minimumvalue of Ê´ µ is 0, implying that there is no minimum rate of information, but whatdoes this mean and what does the corresponding distortion threshold imply? Intu-itively the maximum value of Ê´ µ should be À´ µ and this should happen when¼ (perfect reconstruction), but is this really the case? We answer these questionsby stating and proving the following result.RESULT 2.9The rate distortion function is a monotonically decreasing function of , limited inrange by:¼ Ê´ µ À´ µ (2.69)where Ê´ Ñ Òµ À´ µ indicates that the upper bound occurs for the minimumpossible value of the distortion, Ñ Ò, and Ê´ Ñ Üµ ¼ indicates that there is amaximum permissible value Ñ Ü such that Ê´ µ ¼ for Ñ Ü.A typical plot of Ê´ µ as a function of for the case when Ñ Ò ¼ is shown inFigure 2.15.0 Ñ ÜÊ´ µÀ´ µFIGURE 2.15Sketch of typical function Ê´ µ.We can prove that Ê´ Ñ Òµ À´ µ by considering Ñ Ò the minimum permissi-ble value for the distortion . Since this is equivalent to ﬁnding the smallestpossible . That is what we want to ﬁnd the conditional probability assignment that
94 Fundamentals of Information Theory and Coding Designminimises:Õ½Ö½È´Ü µÈ´Ý Ü µ ´Ü Ý µ (2.70)The minimum occurs by considering, for each Ü , the value of Ý that minimises´Ü Ý µ and setting È´Ý Ü µ ½ for these values, with all other conditional prob-abilities set to zero. For each Ü deﬁne ´Ü µ Ñ Ò ´Ü Ý µ as the minimumdistortion for Ü and Ý ´Ü µ, where ´Ü µ Ö Ñ Ò ´Ü Ý µ, as the representationsymbol that yields the minimum distortion. We set È´Ý ´Ü µ Ü µ ½ with all otherconditional probabilities È´Ý Ü µ for Ý Ý ´Ü µ being set to zero. Hence:Ñ ÒÕ½È´Ü µ ´Ü µ (2.71)In typical applications Ý ´Ü µ is the representation alphabet symbol that uniquelyidentiﬁes the source symbol Ü and in such cases the above conditional probabil-ity assignment implies that À´ µ ¼ (perfect reconstruction since there is nouncertainty) and hence Á´ µ À´ µ Thus we have that Ê´ Ñ Òµ À´ µ.Furthermore we typically assign a distortion of ¼ for the pair ´Ü Ý ´Ü µµ and thismeans that Ñ Ò ¼; thus Ê´¼µ À´ µ.Derivation of the maximum permissible Ñ Ü relies on the observation that thiscondition occurs when Ê´ Ñ Üµ ¼. That is, Á´ µ ¼ and thus È´Ý Ü µÈ´Ý µ, the representation symbols are independent of the source symbols and thereis no information conveyed by the channel. The average distortion measure for achannel with Á´ µ ¼ is given by:Ö½È´Ý µ´ Õ½È´Ü µ ´Ü Ý µµ(2.72)Then since we are looking for Ñ Ü such that Ê´ µ ¼ for Ñ Ü, thenÑ Ü is given by the minimum value of Equation 2.72 over all possible probabilityassignments for È´Ý µ . Deﬁne £ Ö Ñ Ò ÈÕ½ È´Ü µ ´Ü Ý µ . The mini-mum of Equation 2.72 occurs when È´Ý £µ ½for that Ý £ such that the expressionÈÕ½ È´Ü µ ´Ü Ý µ is smallest, and È´Ý µ ¼ for all other Ý Ý £. Thisgives:Ñ ÜÖÑ Ò½´ Õ½È´Ü µ ´Ü Ý µµ(2.73)Equation 2.73 implies that if we are happy to tolerate an average distortion that is asmuch as Ñ Ü then there is a choice of representation Ý that is independent of Üsuch that Ñ Ü.
Information Channels 95EXAMPLE 2.14Consider the channel from Example 2.12. The following conditional probabilityassignment:È½¿½¿½¿½¿½¿½¿obviously implies that Á´ µ ¼. The resulting average distortion is:½¾½¿ ¼·½·¿·¿·½·¼ ½½¿However this may not be Ñ Ü since there may be other conditional probabilityassignments for which Á´ µ ¼ that provide a lower distortion. Indeed usingEquation 2.73 gives:Ñ Ü½¾¿Ñ Ò½´Ü½ Ý µ· ´Ü¾ Ý µ½¾ Ñ Ò ´¼·¿µ ´½·½µ ´¿·¼µ ½Thus Ñ Ü ½ and this occurs with the following conditional probability assign-ment:È ¼ ½ ¼¼ ½ ¼Interestingly this represents the condition where, no matter what input is transmittedthrough the channel, the output will always be Ý¾ with an average cost of 1. Thusif we can tolerate an average distortion of 1 or more we might as well not bother!EXAMPLE 2.15Consider the rate distortion function given by Equation 2.68 in Example 2.13 for thequantisation of a Gaussian source with squared error distortion. Since the sourceÜ is continuous-valued and Ý is discrete-valued then for Ñ Ò ¼ the ratedistortion function must be inﬁnite since no amount of information will ever be ableto reconstruct Ü from Ý with no errors, and the quantisation process, no matter whatdiscrete-valued representation is used, will always involve a loss of information. Thatis, from Equation 2.68:Ê´ µ ½ × ¼It should be noted that this does not contradict Result 2.9 since that result implicitlyassumed that both Ü and Ý were discrete-valued.
96 Fundamentals of Information Theory and Coding DesignFor the case of zero rate distortion Equation 2.68 gives:Ê´ µ ¼ ÓÖ¾This result can be conﬁrmed intuitively by observing that if no information is provided(i.e., the receiver or decoder does not have access to Ý) the best estimate for Ü is itsmean value from which the average squared error will, of course, be the variance, ¾.2.11 Exercises1. A binary channel correctly transmits a 0 (as a 0) twice as many times as trans-mitting it incorrectly (as a 1) and correctly transmits a 1 (as a 1) three timesmore often then transmitting it incorrectly (as a 0). The input to the channelcan be assumed equiprobable.(a) What is the channel matrix È? Sketch the channel.(b) Calculate the output probabilities, È´ µ.(c) Calculate the backward channel probabilities, È´ µ.2. Secret agent 101 communicates with her source of information by phone, un-fortunately in a foreign language over a noisy connection. Agent 101 asksquestions requiring only yes and no answers from the source. Due to the noiseand language barrier, Agent 101 hears and interprets the answer correctly only75% of the time, she fails to understand the answer 10% of the time and shemisinterprets the answer 15% of the time. Before asking the question, Agent101 expects the answer yes 80% of the time.(a) Sketch the communication channel that exists between Agent 101 andher source.(b) Before hearing the answer to the question what is Agent 101’s averageuncertainty about the answer?(c) Agent 101 interprets the answer over the phone as no. What is her av-erage uncertainty about the answer? Is she is more uncertain or lessuncertain about the answer given her interpretation of what she heard?Explain!(d) Agent 101 now interprets the answer over the phone as yes. What is heraverage uncertainty about the answer? Is she is more uncertain or lessuncertain about the answer given her interpretation of what she heard?Explain and compare this with the previous case.
Information Channels 973. Calculate the equivocation of A with respect to B, À´ µ, for the com-munication channel of Qu. 2. Now calculate the mutual information of thechannel as Á´ µ À´ µ À´ µ. What can you say about the chan-nel given your answers for Qu. 2(c) and 2(d)? Now verify that Á´ µÀ´ µ À´ µ by calculating À´ µ and À´ µ.*4. A friend of yours has just seen your exam results and has telephoned to tellyou whether you have passed or failed. Alas the telephone connection is sobad that if your friend says “pass” you mistake that for “fail” 3 out of 10 timesand if your friend says “fail” you mistake that for “pass” 1 out of 10 times.Before talking to your friend you were 60% conﬁdent that you had passed theexam. How conﬁdent are you of having passed the exam if you heard yourfriend say you have passed?5. Which is better “erasure” or “crossover”?(a) Consider a ﬁbre-optic communication channel with crossover probabil-ity of ½¼ ¾ and a wireless mobile channel with erasure probability of½¼ ¾. Calculate the mutual information assuming equiprobable inputsfor both types of communication channels. Which system provides moreinformation for the same bit error rate?*(b) Let us examine the problem another way. Consider a ﬁbre-optic com-munication channel with crossover probability of Õ Ë and a wirelessmobile channel with erasure probability of Õ . Assume equiproba-ble inputs and express the mutual information Á´ µ for the ﬁbre opticchannel as a function of Õ Ë and for the wireless mobile channel as afunction of Õ . Calculate Õ Ë and Õ for the same mutual infor-mation of ¼ . Which channel can get away with a higher bit error rateand still provide the same amount of mutual information?6. Given À´ µ ¼ ¿, À´ µ ¼ ¾ and À´ µ ¼ ¿ ﬁnd À´ µ,À´ µ and Á´ µ.7. Consider the following binary channel:0101A B2/33/54/73/7Calculate Á´ µ À´ µ À´ µ À´ µ À´ µ Ò À´ µ as painlesslyas possible.
98 Fundamentals of Information Theory and Coding Design8. Here are some quick questions:(a) Prove that if a BSC is shorted (i.e., all the outputs are grounded to 0) thenthe channel provides no information about the input.(b) Consider the statement: “Surely if you know the information of thesource, À´ µ, and the information provided by the channel, Á´ µ,you will know the information of the output, À´ µ?” Prove whetherthis statement is true or not. If not, under what conditions, if any, wouldit be true?(c) Conceptually state (in plain English) what À´ µ À´ µ Á´ µmeans.9. Sketch a sample channel with Öinputs and ×outputs, ﬁnd the expression for themutual information, the channel capacity and the input probability distributionto achieve capacity, for the following cases:(a) a noiseless, non-deterministic channel(b) a deterministic, non-noiseless channel(c) a noiseless, deterministic channel*10. It was established that the minimum value of Á´ µ is 0, that is Á´ µ ¼.What is the maximum value of Á´ µ?11. Show that the mutual information expression for a BEC can be simpliﬁed to:Á´ µ Ô ÐÓ½· ÐÓ½12. Consider the following errors-and-erasure channel:1−p−q1−p−q0 0qqp1 1?pBAFind all the values of Ôand Õ for which the above channel is:
Information Channels 99(a) totally ambiguous (i.e., Á´ µ ¼),(b) noiseless,(c) deterministic.13. Channel has channel matrix:È¾½ ¼¾¿½¿¼ ½¿and is connected to channel with matrix:È½ ¿ ¼¼ ¼ ½The ternary input to the channel system has the following statistics: È ´ ½µ½ , È ´ ¾µ ¿ , and È ´ ¿µ ½ ¾. The output of the channel is andthe output of channel is .(a) Calculate À´ µ, À´ µ and À´ µ.(b) Calculate Á´ µ, Á´ µ and Á´ µ.(c) What can you say about channel ? Explain!14. The input to a BEC is repeated as shown below:0101Appqq?BC (from repeat transmission)(from original transmission)Given equiprobable binary inputs derive the expression for Á´ µ andshow that Á´ µ Á´ µ for:(a) Ô Õ ½ ¾(b) Ô ½ ¿(c) Ô ¾ ¿
100 Fundamentals of Information Theory and Coding Design15. Agent 01 contacts two of his sources by email for some straight “yes” and“no” answers. The ﬁrst source he contacts is known to be unreliable and togive wrong answers about 30% of the time. Hence Agent 01 contacts hissecond source to ask for the same information. Unfortunately, the secondsource insists on using a non-standard email encoding and Agent 01 ﬁnds thatonly 60% of the answers are intelligible.(a) What is the average uncertainty Agent 01 has about the input given theanswers he receives from his ﬁrst source? Hence what is the mutualinformation of the ﬁrst source?(b) What is the average uncertainty Agent 01 has about the input given theanswers he receives from both the ﬁrst and second source? Hence whatis the mutual information from both sources?*16. In order to improve utilisation of a BSC, special input and output electron-ics are designed so that the input to the BSC is sent twice and the output isinterpreted as follows:¯ if two 0’s are received the channel outputs a single 0¯ if two 1’s are received the channel outputs a single 1¯ if either 01 or 10 is received the channel outputs a single ?Derive the expression for the mutual information through this augmented BSCassuming equiprobable inputs and compare this with the mutual informationof a standard BSC. Either analytically or numerically show whether this aug-mented BSC is superior to the standard BSC. What price is paid for this “im-proved” performance?17. The mutual information between two random variables, and , is deﬁnedby Equation 2.47. Explain how Á´ µ provides a measure of the amount ofindependence between and .*18. A digital transmission channel consists of a terrestrial ﬁbre-optic link with ameasured cross-over (bit error) probability of 0.1 followed by a satellite linkwith an erasure probability of 0.2. No prior statistics regarding the sourceof information being transmitted through the channel are available. What isthe average amount of information that can be resolved by the channel? Inorder to improve the reliability of the channel it is proposed that the same databeing transmitted through the channel also be transmitted through a cheaperand less reliable terrestrial/marine copper channel with cross-over probabilityof 0.3. What is the average amount of information that can now be transmittedthrough the combined channel system? Compare with your previous answer.What is the cost of this improvement?19. Consider the following errors-and-erasure channel:
Information Channels 1011−p−q1−p−q0 0qqp1 1?pBAUnder what conditions will the channel be weakly symmetric? What is theexpression for the channel capacity when the channel is weakly symmetric?*20. Consider the following errors-and-erasure channel:0 01 1?0.7BA0.60.20.30.10.1Express Á´ µ as a function of È ´ ¼µ. Hence derive the channelcapacity by ﬁnding the value of that maximises Á´ µ. You can do thisnumerically, graphically or analytically.*21. Consider the following channel:
102 Fundamentals of Information Theory and Coding Design0 021A1/21/2B1Let È´ ¼µ ½ and È´ ½µ ¾; thus È´ ¾µ ½ ½ ¾.Calculate the mutual information for the following cases:(a) ½ · ¾ ½(b) ½½¿ , ¾½¿Now derive an algebraic expression for the mutual information as a functionof ½and ¾ and graphically, numerically or otherwise try to ﬁnd the conditionfor channel capacity. Explain your ﬁndings!22. A proposed monochrome television picture standard consists of ¿ ¢½¼ pixelsper frame, each occupying one of 16 grayscale levels with equal probability.Calculate the minimum bandwidth required to support the transmission of 40frames per second when the signal-to-noise ratio is 25 dB. Given that for efﬁ-cient spectrum usage the bandwidth should not exceed 10 MHz what do youthink happened to this standard?23. You are asked to consider the design of a cable modem utilising a broadbandcommunications network. One of the requirements is the ability to supportfull duplex 10 Mbps data communications over a standard television channelbandwidth of 5.5 MHz. What is the minimum signal-to-noise ratio that isrequired to support this facility?24. Consider a BSC with channel matrix:È ËÔ ÕÕ Ôwith input ½ ¼ ¾ ½ and output ½ ¼ ¾ ½ . Deﬁne thefollowing single-letter distortion measure:´ µ¼½Assuming equiprobable inputs derive the expression for the average distortion,. Hence what is the expression for the rate-distortion function Ê´ µ as afunction of ?
Information Channels 10325. For Qu. 24, what is Ñ Ò and Ê´ Ñ Òµ?26. For Qu. 24, what is Ñ Ü and what is a value of Õ that yields Ñ Ü?*27. Consider the following errors-and-erasure channel:1−p−q1−p−q0 0qqp1 1?pBAwith input ½ ¼ ¾ ½ and output ½ ¼ ¾ ½ ¿ . Deﬁne thefollowing single-letter distortion measure:´ µ¼½ ¿¾ ÓØ ÖÛ ×Assuming equiprobable inputs derive the expression for the average distortion,. State the constrained minimisation problem that you need to solve to derivethe rate-distortion function, Ê´ µ. Can you solve it?28. For Qu. 27 what is Ñ Ü and what are values of Ôand Õ that yield Ñ Ü?2.12 References S. Arimoto, An algorithm for calculating the capacity of an arbitrary discretememoryless channel, IEEE Trans. Inform. Theory, IT-18, 14-20, 1972. R. Blahut, Computation of channel capacity and rate distortion functions, IEEETrans. Inform. Theory, IT-18, 460-473, 1972. J.-F. Cardoso, Blind signal separation: statistical principles, Proceedings of theIEEE, 86(10), 2009-2025, 1998.
104 Fundamentals of Information Theory and Coding Design T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley Sons, New York, 1991. R.G. Gallager, Information Theory and Reliable Communication, John Wiley Sons, New York, 1968. T.V.L. Hartley, Transmission of information, Bell System Tech. J., 7, 535-563,1928. S. Haykin, Communication Systems, John Wiley Sons, New York, 4th ed.,2001. J. Jeong, J.C. Gore, and B.S. Peterson, Mutual information analysis of theEEG in patients with Alzheimer’s disease, Clinical Neurophysiology, 112(5),827-835, 2001. Y. Normandin, R. Cardin, and R. De Mori, High-performance connected digitrecognition using maximum mutual information, IEEE Trans. Speech and Au-dio Processing, 2(2), 299-311, 1994. C.E. Shannon, A mathematical theory of communication, Bell System Tech. J.,vol. 28, pg 379-423, 623-656, 1948. C.E. Shannon, Coding theorems for a discrete source with a ﬁdelity criterion,IRE Nat. Conv. Record, Part 4, 142-163, 1959. P. Viola, and W.M. Wells, Alignment by maximization of mutual information,Int. J. of Comput. Vision, 24(2), 137-154, 1997.
Chapter 3Source Coding3.1 IntroductionAn important problem in digital communications and computer storage is the efﬁ-cient transmission and storage of information. Furthermore transmission and storageof digital data require the information to be represented in a digital or binary form.Thus there is a need to perform source coding, the process of encoding the informa-tion source message to a binary code word, and lossless and faithful decoding fromthe binary code word to the original information source message. The goal of sourcecoding for digital systems is two-fold:1. Uniquely map any arbitrary source message to a binary code and back again2. Efﬁciently map a source message to the most compact binary code and backagainThis is shown in Figure 3.1 for a source transmitting arbitrary text through a binary(digital) communication channel. The source encoder efﬁciently maps the text toa binary representation and the source decoder performs the reverse operation. Itshould be noted that the channel is assumed noiseless. In the presence of noise,channel codes are also needed and these are introduced in Chapter 5.EncoderSource BINARYCHANNELSourceDecoder010010abccde 010010 abccdeSource ReceiverFIGURE 3.1Noiseless communication system.We ﬁrst present the notation and terminology for the general source coding problemand establish the fundamental results and properties of codes for their practical use105
106 Fundamentals of Information Theory and Coding Designand implementation. We then show how entropy is the benchmark used to determinethe efﬁciency of a particular coding strategy and then proceed to detail the algorithmsof several important binary coding strategies.Consider a mapping from the source alphabet, Ë, of size Õ to the code alphabet, ,of size Ö and deﬁne the source coding process as follows:DEFINITION 3.1 Source Coding Let the information source be describedby the source alphabet of size Õ, Ë × ½ ¾ Õ , and deﬁneÜ ½ ¾ Ö as the code alphabet of size Ö. A source message of length Òisan n-length string ofsymbolsfromthesourcealphabet, thatis, Ò× ½× ¾ × Ò.A code word, ´ µ, is a ﬁnite-length string of, say, Ð symbols from the code alphabet,that is, ´ µ Ü ½Ü ¾Ü ¿ Ü Ð. Encoding is the mapping from a source symbol,× , to the code word, ´× µ, and decoding is the reverse process of mapping acode word, ´× µ, to the source symbol, × . A code table or simply source codecompletely describes the source code by listing the code word encodings of all thesource symbols, × ´× µ ½ ¾ Õ .DEFINITION 3.2 nth Extension of a Code The Òth extension of a codemaps the source messages of length n, Ò, which are the symbols from ËÒ, theÒth extension of Ë, to the corresponding sequence of code words, ´ Òµ´× ½µ ´× ¾µ ´× Òµ, from the code table for source Ë.A source code is identiﬁed as:¯ a non-singular code if all the code words are distinct.¯ a block code of length Ò if the code words are all of ﬁxed length Ò.All practical codes must be uniquely decodable.DEFINITION 3.3 Unique Decodability A code is said to be uniquelydecodableif, and only if, the Òth extension of the code is non-singular for every ﬁnite valueof Ò. Informally, a code is said to be uniquely decodable if there are no instancesof non-unique (i.e., ambiguous) source decodings for any and all possible codemessages.NOTE The following observations can be made:1. A block code of length Ò which is non-singular is uniquely decodable2. It is a non-trivial exercise to prove whether a code is uniquely decodable ingeneral
Source Coding 1073. A code is proven to be NOT uniquely decodable if there is at least one instanceof a non-unique decodingThe code alphabet of most interest is the binary code, ¼ ½ , of size Ö ¾since this represents the basic unit of storage and transmission in computers anddigital communication systems. The ternary code, ¼ ½ ¾ , of size Ö ¿is also important in digital communication systems utilising a tri-level form of linecoding. In general we can speak of Ö-ary codes to refer to code alphabets of arbitrarysize Ö.EXAMPLE 3.1Consider the source alphabet, Ë ×½ ×¾ ×¿ × , and binary code, ¼ ½ .The following are three possible binary source codes.Source Code A Code B Code C×½ 0 0 00×¾ 11 11 01×¿ 00 00 10× 11 010 11We note that:¯ Code A is NOT non-singular since ´×¾µ ´× µ ½½.¯ Code B is non-singular but it is NOT uniquely decodable since the code se-quence ¼¼ can be decoded as either ×¿ or ×½×½, that is ´×¿µ ´×½×½µ ¼¼where ´×½×½µ ´×½µ ´×½µ ¼¼is from the second extension of the code.¯ Code C is a non-singularblock code of length 2 and is thus uniquely decodable.3.2 Instantaneous CodesAlthough usable codes have to be at least uniquely decodable, there is a subclass ofuniquely decodable codes, called instantaneous codes, that exhibit extremely usefulproperties for the design and analysis of such codes and the practical implementationof the encoding and decoding processes.
108 Fundamentals of Information Theory and Coding DesignDEFINITION 3.4 Instantaneous Codes A uniquely decodablecode is said to beinstantaneous if it is possible to decode each message in an encoded string withoutreference to succeeding code symbols.A code that is uniquely decodable but not instantaneous requires the past and currentcode symbols to be buffered at the receiver in order to uniquely decode the sourcemessage. An instantaneous code allows the receiver to decode the current code wordto the correct source message immediately upon receiving the last code symbol ofthat code word (e.g., the decoding is performed “on the ﬂy”).EXAMPLE 3.2Consider the following three binary codes for the source Ë ×½ ×¾ ×¿ × :Source Code A Code B Code C×½ 0 0 0×¾ 10 01 01×¿ 110 011 011× 1110 0111 111Codes A, B and C are uniquely decodable(since no instance of a non-uniquedecodingcan be found) but only code A is instantaneous. Why? Consider the followingencoded string from code A:¼½¼½½½¼½½¼¼½¼¼¼½¼ µ ×½×¾× ×¿×½×¾×½×½×¾The bars under and over each code symbol highlight each of the encoded sourcesymbols. It is apparent that the “0” code symbol acts as a code word terminator orseparator. Hence the moment a “0” is received it represents the last code symbolof the current code word and the receiver can immediately decode the word to thecorrect source symbol; thus code A is instantaneous.Now consider the following encoded string from code B:¼½½¼½¼½½½¼¼ µ ×¿×¾× ×½×½Here too the “0” acts as a code word separator but, unlike code A, the “0” nowrepresents the ﬁrst code symbol of the next code word. This means the current codeword cannot be decoded until the ﬁrst symbol of the next code word, the symbol “0,”is read. For example assume we have received the string ¼½½. We cannot decode thisto the symbol ×¿ until we see the “0” of the next code word since if the next characterwe read is in fact a “1” then that means the current code word (¼½½½) is × , not ×¿.Although we can verify that the code is uniquely decodable (since the “0” acts as aseparator) the code is not instantaneous.
Source Coding 109Now consider the following encoded string from code C:¼½½½½½½½½Surprisingly we cannot yet decode this sequence! In fact until we receive a “0”or an EOF (end-of-ﬁle or code string termination) the sequence cannot be decodedsince we do not know if the ﬁrst code word is 0, 01 or 011. Furthermore once weare in a position to decode the sequence the decoding process is itself not at all asstraightforward as was the case for code A and code B. Nevertheless this code isuniquely decodable since the “0” still acts as a separator.From Example 3.2 one can see that instantaneous codes are so named because de-coding is a very fast and easy process (i.e., instantaneous!). However the practicalityof instantaneous codes also lie in the preﬁx condition which allows such codes to beefﬁciently analysed and designed.DEFINITION 3.5 Preﬁx of a Code Let ´× µ Ü ½Ü ¾ Ü Ò be a code wordof length n. A sub-string of code characters from ´× µ, Ü ½Ü ¾ Ü Ñ , whereÑ Ò, is called a preﬁx of the code word ´× µ.DEFINITION 3.6 Preﬁx Condition A necessary and sufﬁcient condition for acode to be instantaneous is that no complete code word of the code be a preﬁx ofsome other code word.Since the preﬁx condition is both necessary and sufﬁcient the instantaneous codesare also referred to as preﬁx codes since such codes obey the above preﬁx condition.NOTE Since an instantaneous code is uniquely decodable the preﬁx condition isa sufﬁcient condition for uniquely decodable codes. However it is not a necessarycondition since a code can be uniquely decodable without being instantaneous.EXAMPLE 3.3Consider the codes from Example 3.2. We can now immediately state whether thecodes are instantaneous or not:Code A is instantaneous since no code word is a preﬁx of any other code word (codeA obeys the preﬁx condition). Since it is instantaneous it is also uniquelydecodable.Code B is not instantaneous since ×½ ¼ is a preﬁx of ×¾ ¼½, ×¾ ¼½ is a preﬁxof ×¿ ¼½½, etc. However it is uniquely decodable since the “0” acts as aseparator.
110 Fundamentals of Information Theory and Coding DesignCode C is not instantaneous since ×½ ¼ is a preﬁx of ×¾ ¼½, etc. However it isuniquely decodable since the “0” can be used as a separator.Figure 3.2 graphically depicts the universe of all codes and how the different classesof codes we have described are subsets of one another. That is, the class of allinstantaneous codes is a subset of the class of all uniquely decodable codes which inturn is a sub-class of all non-singular codes.Uniquely DecodableCodesInstantaneous (Prefix)CodesAll CodesNon−singular CodesFIGURE 3.2Classes of codes.3.2.1 Construction of Instantaneous CodesThe preﬁx condition not only makes it easy to determine whether a given code isinstantaneous or not but it can also be used to systematically design an instantaneouscode with speciﬁed lengths for the individual code words. The problem can be statedas follows. For a source with Õ symbols it is required to design the Õ individualcode words with speciﬁed code word lengths of Ð½ Ð¾ ÐÕ such that the code isinstantaneous (i.e., it satisﬁes the preﬁx condition). To design the code the code
Source Coding 111word lengths are sorted in order of increasing length. The code words are derivedin sequence such that at each step the current code word does not contain any of theother code words as a preﬁx. A systematic way to do this is to enumerate or countthrough the code alphabet.EXAMPLE 3.4Problem 1: Design an instantaneous binary code with lengths of 3, 2, 3, 2, 2Solution 1: The lengths are re-ordered in increasing order as 2, 2, 2, 3, 3. For the ﬁrstthree code words of length 2 a count from 00 to 10 is used:¼¼¼½½¼For the next two code words of length 3, we count to 11, form 110, and then startcounting from the right most symbol to produce the complete code:¼¼¼½½¼½½¼½½½Problem 2: Design an instantaneous ternary code with lengths 2, 3, 1, 1, 2Solution 2: The code word lengths are re-ordered as 1, 1, 2, 2, 3 and noting that aternary code has symbols ¼ ½ ¾ we systematically design the code as follows:¼½¾¼¾½¾¾¼Problem 3: Design an instantaneous binary code with lengths 2, 3, 2, 2, 2Solution 3: The code word lengths are re-ordered as 2, 2, 2, 2, 3 and we immediatelysee that an instantaneous code cannot be designed with these lengths since:¼¼¼½½¼½½
112 Fundamentals of Information Theory and Coding Design3.2.2 Decoding Instantaneous CodesSince an instantaneous code has the property that the source symbol or message canbe immediately decoded upon reception of the last code character in the current codeword the decoding process can be fully described by a decoding tree or state machinewhich can be easily implemented in logic.EXAMPLE 3.5Figure 3.3 is the decoding tree that corresponds to the binary code:Source Code×½ 00×¾ 01×¿ 10× 110× 11101101010Read×¿Read ×Read××½ReadStart×¾FIGURE 3.3Decoding tree for an instantaneous code.
Source Coding 113The receiver simply jumps to the next node of the tree in response to the currentcode character and when the leaf node of the tree is reached the receiver producesthe corresponding source symbol, × , and immediately returns to the root of the tree.Thus there is no need for the receiver to buffer any of the received code characters inorder to uniquely decode the sequence.3.2.3 Properties of Instantaneous CodesProperty 1 Easy to prove whether a code is instantaneous by inspection of whetherthe code satisﬁes the preﬁx condition.Property 2 The preﬁx code permits a systematic design of instantaneous codesbased on the speciﬁed code word lengths.Property 3 Decoding based on a decoding tree is fast and requires no memory stor-age.Property 4 Instantaneous codes are uniquely decodable codes and where the lengthof a code word is the main consideration in the design and selection of codesthere is no advantage in ever considering the general class of uniquely decod-able codes which are not instantaneous.Property 4 arises because of McMillan’s Theorem which will be discussed in thenext section.3.2.4 Sensitivity to Bit ErrorsThe instantaneous codes we have been discussing are variable-length codes since thecode words can be of any length. Variable-length codes should be contrasted withblock codes of length Ò that restrict code words to be of the same length Ò. As wewill see variable-length codes provide greater ﬂexibility for sources to be encodedmore efﬁciently. However a serious drawback to the use of variable-length codes istheir sensitivity to bit or code symbol errors in the code sequence. If a single biterror causes the decoder to interpret a shorter or longer code word than is the casethen this will create a synchronisation error between the ﬁrst bit of the code wordgenerated by the encoder and the root of the decoding tree. Subsequent code wordswill be incorrectly decoded, including the possibility of insertion errors, until thesynchronisation is re-established (which may take a very long time). With blockcodes a single bit error will only affect the current block code and the effect will notpropagate to subsequent code words. In fact block codes only suffer this probleminitially when the transmission is established and the decoder has to “lock” onto theﬁrst bit of the code and where there are unmanaged timing discrepancies or clockskew between the transmitter and receiver.
114 Fundamentals of Information Theory and Coding DesignEXAMPLE 3.6Consider the following variable-length and block instantaneous binary codes for thesource Ë ×½ ×¾ ×¿ × .Source Variable-length code Block code×½ 0 00×¾ 10 01×¿ 110 10× 111 11Consider the following source sequence:×½½ ×¾¿ ×¿¾ × ×¾ ×½where ×Øindicates that symbol × is being transmitted at time Ø. The correspondingvariable-length code sequence is:¼½½¼½¼½½½½¼¼Assume the code sequence is now transmitted through an information channel whichintroduces a bit error in the 2nd bit. The source decoder sees:¼¼½¼½¼½½½½¼¼and generates the decoded message sequence:×½½ ×¾½ ×¿¾ ×¾ × ×¾ ×½The single bit error in the encoded sequence produces both a substitution error (2ndsymbol ×¾¿ is substituted by ×¾½) and insertion error (×¿¾ is inserted after ×¾½ and subse-quent source symbols appear as ×Ø·½) in the decoded message sequence and 7 ratherthan 6 source symbols are produced.Now assume there is a bit error in the 6th bit. The source decoder sees:¼½½¼½½½½½½¼¼and generates the decoded message sequence:×½½ ×¾¿ ×¿× ×½ ×½The single bit error now causes two isolated substitution errors ( 3rd symbol ×¿¾ sub-stituted by ×¿ and 5th symbol ×¾ substituted by ×½). Now consider the correspondingblock code sequence:
Source Coding 115×½½ ×¾¿ ×¿¾ × ×¾ ×½ µ¼¼½¼¼½½½¼½¼¼For errors in both the 2nd and 6th bits, a bit error in the 2nd bit causes a singlesubstitution error (1st symbol ×½½ substituted by ×½¾) and a bit error in the 6th bit alsocauses a single substitution error (3rd symbol ×¿¾ substituted by ×¿½). In fact any biterrors will only affect the current code word and will not have any effect on subsequentcode words, and all errors will be substitution errors.From Example 3.6 it is apparent that variable-length codes are very sensitive to biterrors, with errors propagating to subsequent code words and symbols being insertedas well as being substituted. If the main goal of the source coder is to map thesource symbols to the code alphabet then a block coding scheme should be used. Forexample, the ASCII code  is a 7-bit (or 8-bit in the case of extended ASCII) blockcode mapping letters of the English alphabet and keyboard characters to a binaryblock code of length 7 or 8. Furthermore, a channel coder (see Chapter 5) with anappropriate selection of an error-correcting or error-detecting code (see Chapters 6to 9) is mandatory to ensure the ﬁnal code sequence is close to error-free and mustalways be present when using variable-length codes.The loss of synchronisation arising from code symbol errors with variable-lengthcodes is a speciﬁc example of the more general problem of word synchronisationbetween the source and receiver which is discussed by Golomb et al. .3.3 The Kraft Inequality and McMillan’s Theorem3.3.1 The Kraft InequalityIf the individual code word lengths are speciﬁed there is no guarantee that an in-stantaneous code can be designed with those lengths (see Example 3.4). The KraftInequality theorem  provides a limitation on the code word lengths for the designof instantaneous codes. Although not of real use in practice (the coding strategies wewill later discuss will guarantee codes that are instantaneous) the Kraft Inequality isa precursor to the more important McMillan’s Theorem  which states that wherecode word lengths are the only consideration, an instantaneous code will always beas good as any uniquely decodable code which is not necessarily instantaneous.
116 Fundamentals of Information Theory and Coding DesignTHEOREM 3.1 Kraft InequalityA necessary and sufﬁcient condition for the existence of an instantaneous code withalphabet size Ö and Õ code words with individual code word lengths of Ð½ Ð¾ ÐÕis that the following inequality be satisﬁed:ÃÕ½Ö Ð½ (3.1)Conversely, given a set of code word lengths that satisfy this inequality, then thereexists an instantaneous code with these word lengths.The proof of the Kraft Inequality is interesting in that it is based on a formal de-scription of how instantaneous codes are constructed (see Section 3.2.1) given thecode word lengths, where the Kraft Inequality needs to be satisﬁed for the code to besuccessfully constructed.PROOF Let × be the th source message or symbol and ´× µthe correspondingcode word of length Ð . The proof requires the code word lengths to be arranged inascending order, that is, we assume that the code word lengths are arranged such thatÐ½ Ð¾ Ð¿ ÐÕ . The number of possible code words for ´× µis ÖÐ. To ensurethe code is instantaneous we need to consider the number of permissible code wordsfor ´× µ such that the preﬁx condition is satisﬁed.Consider the shortest code word, ´×½µ Then the number of permissible code wordsfor ´×½µ is simply ÖÐ½ . Next consider code word ´×¾µ The number of possible´×¾µwith ´×½µas a preﬁx is given by the expressionÖÐ¾ Ð½ (since with ´×½µas thepreﬁx the ﬁrst Ð½ symbols are ﬁxed and one can choose the remaining Ð¾ Ð½ symbolsarbitrarily). To ensure the preﬁx condition is satisﬁed the number of permissible codewords for ´×¾µ is the number of possible code words, ÖÐ¾ less those code wordswhich have ´×½µ as a preﬁx, ÖÐ¾ Ð½ . That is, the number of permissible code wordsfor ´×¾µis ÖÐ¾ ÖÐ¾ Ð½ . Similarly for ´×¿µthe number of permissible code wordsis the number of possible code words, ÖÐ¿ , less those code words which have ´×¾µas a preﬁx, ÖÐ¿ Ð¾ and less those code words which have ´×½µ as a preﬁx, ÖÐ¿ Ð½that is ÖÐ¿ ÖÐ¿ Ð¾ ÖÐ¿ Ð½. For code word, ´× µ, the expression for the number ofpermissible code words is:ÖÐ ÖÐ Ð ½ ÖÐ Ð ¾ ¡¡¡ ÖÐ Ð½ (3.2)To be able to construct the code we want to ensure that there is at least one permissiblecode word for all source messages, ½ ¾ Õ That is we require the followinginequalities to be simultaneously satisﬁed:ÖÐ½ ½ÖÐ¾ ÖÐ¾ Ð½ ½
Source Coding 117...ÖÐÕ ÖÐÕ ÐÕ ½ ¡ ¡ ¡ ÖÐÕ Ð½ ½ (3.3)By multiplying the th equation by Ö Ðand rearranging we get:Ö Ð½ ½Ö Ð¾ · Ö Ð½ ½...Ö ÐÕ · Ö ÐÕ ½ · ¡ ¡ ¡ · Ö Ð½ ½ (3.4)We note that if the last inequality holds then all the preceding inequalities will alsohold. Thus the following inequality expression must hold to ensure we can design aninstantaneous code:Ö ÐÕ · Ö ÐÕ ½ · ¡ ¡ ¡ · Ö Ð½ ½ µÕ½Ö Ð½ (3.5)which is the Kraft Inequality.The Kraft Inequality of Equation 3.1 only indicates whether an instantaneous codecan be designed from the given code word lengths. It does not provide any indica-tion of what the actual code is, nor whether a code we have designed which satisﬁesEquation 3.1 is instantaneous, but it does tell us that an instantaneous code with thegiven code word lengths can be found. Only by checking the preﬁx condition of thegiven code can we determine whether the code is instantaneous. However if a codewe have designed does not satisfy Equation 3.1 then we know that the code is notinstantaneous and that it will not satisfy the preﬁx condition. Furthermore, we willnot be able to ﬁnd a code with the given code word lengths that is instantaneous. Aless apparent property of the Kraft Inequality is that the minimum code word lengthsfor the given alphabet size and number of code words that can be used for designingan instantaneous code is provided by making Ã as close as possible to 1. Obvi-ously shorter overall code word lengths intuitively yield more efﬁcient codes. Thisis examined in the next section.EXAMPLE 3.7Consider the following binary (Ö ¾ ) codesSource Code A Code B Code C×½ 0 0 0×¾ 100 100 10×¿ 110 110 110× 111 11 11
118 Fundamentals of Information Theory and Coding DesignCode A satisﬁes the preﬁx condition and is hence instantaneous. Calculating theKraft Inequality yields:Ã½¾ Ð¾ ½·¾ ¿·¾ ¿·¾ ¿½which is as expected.Code B does not satisfy the preﬁx condition since × is a preﬁx of ×¿; hence the codeis not instantaneous. Calculating the Kraft Inequality yields:Ã½¾ Ð¾ ½·¾ ¿·¾ ¿·¾ ¾½ ½which implies that an instantaneouscode is possible with the givencode word lengths.Thus a differentcode can be derived with the same code word lengths as code B whichdoes satisfy the preﬁx condition. One exampleof an instantaneouscode with the samecode word lengths is:¼½½¼½½½½¼Code C does not satisfy the preﬁx condition since × is a preﬁx of ×¿; hence the codeis not instantaneous. Calculating the Kraft Inequality yields:Ã½¾ Ð¾ ½·¾ ¾·¾ ¿·¾ ¾½which implies that an instantaneous code cannot be designed with these code wordlengths, and hence we shouldn’t even try.3.3.2 McMillan’s TheoremAs we will discuss in the next section, the use of shorter code word lengths createsmore efﬁcient codes. Since the class of uniquely decodable codes is larger than theclass of instantaneous codes, one would expect greater efﬁciencies to be achievedconsidering the class of all uniquely decodable codes rather than the more restric-tive class of instantaneous codes. However, instantaneous codes are preferred overuniquely decodable codes given that instantaneous codes are easier to analyse, sys-tematic to design and can be decoded using a decoding tree (state machine) structure.McMillan’s Theorem assures us that we do not lose out if we only consider the classof instantaneous codes.
Source Coding 119THEOREM 3.2 McMillan’s TheoremThe code word lengths of any uniquely decodable code must satisfy the KraftInequality:ÃÕ½Ö Ð½ (3.6)Conversely, given a set of code word lengths that satisfy this inequality, then thereexists a uniquely decodable code (Deﬁnition 3.3) with these code word lengths.The proof of McMillan’s Theorem is presented as it is instructive to see the way ituses the formal deﬁnition of unique decodability to prove that the inequality must besatisﬁed. The proof presented here is based on that from .PROOF Assume a uniquely decodable code and consider the quantity:Õ½Ö ÐÒ Ö Ð½ ·Ö Ð¾ ·¡ ¡ ¡ ·Ö ÐÕ¡Ò(3.7)When written out, the quantity will consist of the ÕÒterms arising from the Òthextension of the code, each of the form:Ö Ð ½ Ð ¾ ¡¡¡ Ð Ò Ö (3.8)where we deﬁne Ð ½ ·Ð ¾ ·¡ ¡ ¡ ·Ð Ò .Then Ð is the length, Ð , of the th code word in the Òth extension of the code andis the length of the sequence of code words in the Òth extension of the code. LetÐÑ=Ñ Ü Ð ½ ¾ Õ be the maximum code word length over the Õ codewords. The minimum code word length is, of course, a length of 1. Then canassume any value from Ò to ÒÐÑ. Let Æ be the number of terms of the form Ö .Then:Õ½Ö ÐÒ ÒÐÑÒÆ Ö (3.9)Thus Æ represents the number of code word sequences in the Òth extension of thecode with a length of . If the code is uniquely decodable then the Òth extension ofthe code must be non-singular. That is, Æ must be no greater than Ö , the numberof distinct sequences of length . Thus for any value of Òwe must have:Õ½Ö ÐÒ ÒÐÑÒÖ Ö ÒÐÑÒ½ÒÐÑ Ò·½ÒÐÑ (3.10)
120 Fundamentals of Information Theory and Coding Designor:Õ½Ö Ð´ÒÐÑµ½ Ò(3.11)For Ò ½ and ÐÑ ½ we have that Ñ ÒÒÒ´ÒÐÑµ½ ÒÓÐ ÑÒ ½Ò´ÒÐÑµ½ ÒÓ½.Since the above inequality has to hold for all values of Ò, then this will be true if:Õ½Ö Ð½ (3.12)The implication of McMillan’s Theorem is that for every non-instantaneous uniquelydecodable code that we derive, an instantaneous code with the same code wordlengths will always be found since both codes satisfy the same Kraft Inequality.Thus we can restrict ourselves to the class of instantaneous codes since we will notgain any efﬁciencies based on code word lengths by considering the larger class ofuniquely decodable codes.EXAMPLE 3.8It is required to design a uniquely decodable ternary (Ö ¿ ) code with code wordlengths 1, 1, 2, 2, 3, 3. Since a uniquely decodable code satisﬁes the Kraft Inequalityby McMillan’s Theorem we check whether this is the case. Calculating the KraftInequality:Ã½¿ Ð¾´¿ ½µ · ¾´¿ ¾µ · ¾´¿ ¿µ ¼ ¿ ½shows that a uniquely decodable code with these lengths can be found.BUT the same Kraft Inequality is satisﬁed for instantaneous codes and since instan-taneous codes can be systematically constructed following the procedure describedin Section 3.2.1 the following instantaneous code, which is uniquely decodable, isdesigned:¼½¾¼¾½¾¾¼¾¾½
Source Coding 1213.4 Average Length and Compact Codes3.4.1 Average LengthWhen considering a collection of possible Ö-ary codes for the same source a choiceneeds to be made between the different codes by comparing the performance basedon a criteria of interest. For storage and communication purposes the main crite-rion of interest is the average length of a code, where codes with smaller averagelength are preferred. To calculate the average length the source symbol (or message)probabilities are needed.DEFINITION3.7AverageLength Deﬁne È ½ ¾ Õ astheindividualsource symbol probabilities for a source with Õ possible symbols. Deﬁne Ð½ ¾ Õ as the length of the corresponding code words for a given source coding.Then the average length of the code, Ä, is given by:ÄÕ½È Ð (3.13)Consider all possible Ö-ary codes for the same source, Ë. The number of sourcesymbols, Õ, and source probabilities, È ½ ¾ Õ , are constant, but the codeword lengths Ð ½ ¾ Õ vary with each code. The best code or compactcode will be the one with the smallest average length.DEFINITION 3.8 Compact Codes Consider a uniquely decodable code thatmaps the symbols for a source Ë to code words from an Ö-ary code alphabet. Thecode will be a compact code if its average length is less than or equal to the averagelength of all other uniquely decodable codes for the same source and code alphabet.EXAMPLE 3.9Consider the following two binary codes for the same source. Which code is better?Source È Code A Code B×½ 0.5 00 1×¾ 0.1 01 000×¿ 0.2 10 001× 0.2 11 01The average length of code A is obviously Ä ¾ bits per symbol. The averagelength of code B is Ä ´¼ µ½ · ´¼ ½µ¿ · ´¼ ¾µ¿ · ´¼ ¾µ¾ ½ bits per symbol.
122 Fundamentals of Information Theory and Coding DesignCode B is better than code A since Ä Ä . But is there another code, call it codeC, which has Ä Ä ? Or is code B the compact code for this source? And howsmall can the average length get?The fundamental problem when coding information sources and the goal of sourcecoding for data compression is to ﬁnd compact codes. This obviously requires theindividual code word lengths to be made as small as possible, as long as we stillend up with a uniquely decodable (i.e., instantaneous) code. Intuitively the averagelength will be reduced when the shorter length code words are assigned to the mostprobable symbols and the longer code words to the least probable symbols. Thisconcept is shown by Example 3.9 where code B assigns the shortest code word (oflength 1) to the most probable symbol (with probability 0.5). Formally, the problemof searching for compact codes is a problem in constrained optimisation. We statethe problem as follows.REMARK3.1 Given È ½ ¾ Õ forasourcewith ÕsymbolsthecompactÖ-ary code is given by the set of integer-valued code word lengths Ð ½ ¾ Õthat minimise:ÄÕ½È Ð (3.14)such that the Kraft Inequality constraint is satisﬁed:Õ½Ö Ð ½ (3.15)3.4.2 Lower Bound on Average LengthFrom Chapter 1 the information content of a source is given by its entropy. Whenusing logarithms to base 2 the entropy is measured in units of “(information) bits persource symbol.” Similarly when using binary source codes the average length is alsomeasured in units of “(code) bits per source symbol.” Since the entropy provides ameasure of the intrinsic information content of a source it is perhaps not surprisingthat, for there to be no losses in the coding process, the average length must be at leastthe value of the entropy (no loss of information in the coding representation) andusually more to compensate for the inefﬁciencies arising from the coding process.The following theorem establishes this lower bound on the average length of anypossible coding of the source based on the entropy of the source.
Source Coding 123THEOREM 3.3Every instantaneous Ö-ary code of the source, Ë ×½ ×¾ ×Õ , will have anaverage length, Ä, which is at least the entropy, ÀÖ´Ëµ, of the source, that is:Ä ÀÖ´Ëµ (3.16)with equality when È Ö Ð for ½ ¾ Õ where È is the probability of sourcesymbol × , ÀÖ´Ëµ À´ËµÐÓ ¾ Ö is the entropy of the source Ë using logarithms to thebase Ö and À´Ëµ is the entropy of the source Ë using logarithms to the base 2.PROOF Consider the difference between the entropy ÀÖ´Ëµ and the averagelength Ä and simplify the expression as follows:ÀÖ´Ëµ ÄÕ½È ÐÓ Ö½È Õ½È ÐÕ½È ÐÓ Ö½È Õ½È ÐÓ Ö ÖÐ½ÐÒÖÕ½È ÐÒ ½È ÖÐ(3.17)Using the inequality ÐÒÜ Ü ½(with equality when Ü ½µwhere Ü ½È ÖÐ gives:ÀÖ´Ëµ Ä½ÐÒÖÕ½È½È ÖÐ ½½ÐÒÖÕ½½ÖÐ Õ½È½ÐÒÖÕ½Ö Ð ½ (3.18)Since the code is instantaneous the code word lengths will obey the Kraft Inequalityso thatÈÕ½ Ö Ð ½ and hence:ÀÖ´Ëµ Ä ¼ µ Ä ÀÖ´Ëµwith equality when ½È ÖÐ ½, or È Ö Ð for ½ ¾ Õ .The deﬁnition of entropy in Chapter 1 now makes practical sense. The entropy isa measure of the intrinsic amount of average information (in Ö-ary units) and the
124 Fundamentals of Information Theory and Coding Designaverage length of the code must be at least equal to the entropy of the code to ensureno loss in coding (lossless coding). Thus the smallest average length, Ä, for any codewe design will be Ä ÀÖ´Ëµ. However there is no guarantee that a compact codefor a particular source and code alphabet can be found. Indeed, unless È Ö Ð for½ ¾ Õ we will always have that Ä ÀÖ´Ëµ, but the closer Ä is to ÀÖ´Ëµthen the better or more efﬁcient the code.DEFINITION 3.9 Code Efﬁciency The efﬁciency of the code is given by:ÀÖ´ËµÄ¢½¼¼± (3.19)where if Ä ÀÖ´Ëµ the code is ½¼¼±efﬁcient.EXAMPLE 3.10The entropy for the source in Example 3.9 is:À´Ëµ ¼ ÐÓ ½¼ ·¼ ½ÐÓ ½¼ ½ ·¾´¼ ¾µÐÓ ½¼ ¾ ½ ½ Ø×Ô Ö×ÝÑ ÓÐThus for this source we must have Ä ½ ½ no matter what instantaneous binarycode we design. From Example 3.9 we have:¯ Ä ¾ and code A has an efﬁciency of ½ ½¾ ±.¯ Ä ½ and code B has an efﬁciency of ½ ½½ ±.Since code B is already at ±efﬁciency we can probably state with some conﬁdencethat code B is a compact code for source Ë. And even if it wasn’t, at best a compactcode would only provide us with no more than ¾±improvement in coding efﬁciency.Ideally compact codes should exhibit ½¼¼± efﬁciency. This requires that ÄÀÖ´Ëµ, and from the condition for equality, È Ö Ð , it implies Ð ÐÓ Ö½Èfor ½ ¾ Õ. The problem lies in the fact that the Ð have to be integer values.If this is the case then we have a special source.DEFINITION 3.10 Special Source A source with symbol probabilities È½ ¾ Õ such thatÒÐÓ Ö½È ½ ¾ ÕÓare integers is a special sourcefor Ö-ary codes since an instantaneouscode with code word lengths Ð ÐÓ Ö½È fora code alphabet of size Ö can be designed which is 100% efﬁcient with Ä ÀÖ´Ëµ.EXAMPLE 3.11Consider the following 4-symbol source:
Source Coding 125Source A È×½ 0.125×¾ 0.25×¿ 0.5× 0.125We note that the symbol probabilities are of the form È ½¾¡Ðwith Ð½ ¿, Ð¾ ¾,Ð¿ ½, and Ð ¿. Thus a 100% efﬁcient compact binary code can be designed withcode word lengths of 3, 2, 1, 3 with Ä À´Ëµ ½ . For example:Source A Code A×½ 110×¾ 10×¿ 0× 111Now consider the following 9-symbol source:Source B È×½ 1/9×¾ 1/9×¿ 1/3× 1/27× 1/27× 1/9× 1/9× 1/27× 1/9We note that the symbol probabilitiesare of the formÈ ½¿¡Ðwith Ð ¾ ¾ ½ ¿¿ ¾ ¾ ¿ ¾ for ½ ¾ . Thus a 100% efﬁcient ternary code can be designedwith code word lengths of 2,2,1,3,3,2,2,3,2 with Ä À¿´Ëµ ½ .3.5 Shannon’s Noiseless Coding Theorem3.5.1 Shannon’s Theorem for Zero-Memory SourcesEquation 3.16 states that the average length of any instantaneous code is boundedbelow by the entropy of the source. In this section we show that the average length,Ä, of a compact code is also bounded above by the entropy plus 1 unit of information(1 bit in the case of a binary code). The lower bound was important for establishinghow the efﬁciency of a source code is measured. Of practical importance is the exis-tence of both a lower and upper bound which, as we will see, leads to the important
126 Fundamentals of Information Theory and Coding DesignShannon’s Noiseless Coding Theorem . The theorem states that the coding efﬁ-ciency (which will always be less than 100% for compact codes that are not special)can be improved by coding the extensions of the source (message blocks generatedby the source) rather than just the source symbols themselves.THEOREM 3.4Let Ä be the average length of a compact Ö-ary code for the source Ë. Then:ÀÖ´Ëµ Ä ÀÖ´Ëµ · ½ (3.20)The proof of the theorem as presented makes mention of a possible coding strategythat one may adopt in an attempt to systematically design compact codes. Althoughsuch a strategy, in fact, designs codes which are not compact it establishes the theo-rem which can then be extended to the case of compact codes.PROOF Let Ë ×½ ×¾ ×Õ and consider the sub-optimal coding schemewhere the individual code word lengths are chosen according to:ÐÓ Ö½È Ð ÐÓ Ö½È · ½ (3.21)We can justify that this is a reasonable coding scheme by noting from Theorem 3.3that a 100% efﬁcient code with À´Ëµ Ä is possible if we select Ð ÐÓ Ö½Èand where ÐÓ Ö½È does not equal an integer we “round up” to provide integer lengthassignments, yielding the coding scheme just proposed. Multiplying Equation 3.21by È and summing over gives:Õ½È ÐÓ Ö½ÈÕ½È ÐÕ½È ÐÓ Ö½È·Õ½È (3.22)which yields, using the deﬁnitions for entropy and average length:ÀÖ´Ëµ Ä× ÀÖ´Ëµ · ½ (3.23)where Ä× is the average length using this sub-optimal coding scheme. Considera compact code for the same source Ë with average length Ä. Then by deﬁnitionÄ Ä× and from Theorem 3.3 we have ÀÖ´Ëµ Ä. Thus we must also have:ÀÖ´Ëµ Ä ÀÖ´Ëµ · ½ (3.24)
Source Coding 127NOTE One coding scheme resulting in code word lengths given by Equation 3.21is the Shannon code . Due to the sub-optimal nature of this coding scheme it willnot be elaborated further upon.Equation 3.20 indicates the average length of a compact code will be no more than 1unit away from the average length of a 100% efﬁcient code. However we now showthat by taking extensions using Equation 3.20 we can improve the efﬁciency of thecode.Let ËÒ Ò½Ò¾ÒÕÒ be the Òth extension of the source Ë ×½ ×¾ ×Õwhere Ò × ½× ¾ × Ò. Consider the compact code for ËÒ with average lengthÄÒ. Then from Theorem 3.4 we have that:ÀÖ´ËÒµ ÄÒ ÀÖ´ËÒµ·½ (3.25)The ÀÖ´ËÒµ can be considered the joint entropy of Ò independent sources since:È´ Òµ È´× ½µÈ´× ¾µ È´× Òµ (3.26)and from Section 1.18 we have the result that ÀÖ´ËÒµ ÈÒÀÖ´Ëµ ÒÀÖ´Ëµwhich when substituted into Equation 3.25 and dividing by Ò yields:ÀÖ´Ëµ ÄÒÒÀÖ´Ëµ· ½Ò(3.27)NOTE The term ÄÒÒ is the average length of the code words per symbol × whencoding the Òth extension ËÒ which should not be confused with Äwhich is the averagelength of the code words per symbol × when coding the source Ë. However ÄÒÒ canbe used as the average length of a coding scheme for the source Ë (based on codingthe Òth extension of the source).From Equation 3.27 the average length of a compact code for Ë based on codingthe Òth extension of Ë is no more than ½Ò units away from the average length of a100% efﬁcient code and this overhead decreases with larger extensions. Intuitivelywe expect that coding the Òth extension of the source will provide more efﬁcientcodes, approaching 100% efﬁciency as Ò ½. We formally state these results inthe following theorem.
128 Fundamentals of Information Theory and Coding DesignTHEOREM 3.5 Shannon’s Noiseless Coding TheoremLet Ä be the average length of a compact code for the source Ë, then:ÀÖ´Ëµ Ä ÀÖ´Ëµ · ½Now let ÄÒbe the average length of a compact code for the Òth extension of Ë, then:ÀÖ´ËµÄÒÒÀÖ´Ëµ ·½Òand thus:Ð ÑÒ ½ÄÒÒ ÀÖ´Ëµthat is, in the limit the code becomes 100% efﬁcient.NOTE Since we can’t get something for nothing there is a price to be paid inimproving coding efﬁciency by taking extensions. The price is the increased cost ofencoding and decoding a code with ÕÒ code words, which represents an exponentialincrease in complexity.EXAMPLE 3.12Taking extensions to improve efﬁciency is apparent when the source and code alpha-bets are the same. Consider the case of a binary coding scheme for a binary source.Without extensions we would have the trivial result:Source È CompactCode×½ 0.8 0×¾ 0.2 1The efﬁciency of this compact code (which is no code at all!) is À´ËµÄ whereÄ ½ bits per symbol and À´Ëµ ¼ ¾¾ yielding a compact code which is only72% efﬁcient. Consider taking the second extension of the code where ¾ × ×and È È´ ¾µ È´× µÈ´× µ for ½ ¾ and ½ ¾:Source È CompactCode×½×½ 0.64 0×½×¾ 0.16 10×¾×½ 0.16 110×¾×¾ 0.04 111The compact code for the second extension is as shown in the above table. Theaverage length of this code is Ä¾ ´¼ µ½ · ´¼ ½ µ¾ · ´¼ ½ µ¿ · ´¼ ¼ µ¿ ½
Source Coding 129bits per symbol pair and Ä¾¾ ¼ bits per symbol yielding a compact code for thesecond extension which is À´ËµÄ¾ ¾¾ ± efﬁcient, a signiﬁcant improvementover 72% efﬁcient.3.5.2 Shannon’s Theorem for Markov SourcesThe statement of Shannon’s Noiseless Coding Theorem was proven for zero-memorysources. We now show that a similar statement applies to the general case of Markovsources. Consider an Ñth order Markov source, Ë, and the corresponding entropy,ÀÖ´Ëµ, based on the Ñth order Markov model of the source. Deﬁne ÀÖ´Ëµ as theentropy of the equivalent zero-memory model of the source (where Ë is the adjointsource). Since Equation 3.20 applies to zero-memory sources it also applies to theadjoint source, hence:ÀÖ´Ëµ Ä ÀÖ´Ëµ·½ (3.28)and also:ÀÖ´ËÒµ ÄÒ ÀÖ´ËÒµ·½ (3.29)where ËÒ is the adjoint of the Òth extension of the source, ËÒ (not to be confusedwith ËÒwhich is the Òth extension of the adjoint source, Ë).We know from our results from Chapter 1 that ÀÖ´ËÒµ ÀÖ´ËÒµ, or equivalentlythat ÀÖ´ËÒµ ÀÖ´ËÒµ · ¯ (for ¯ ¼). We also have that ÀÖ´ËÒµ ÒÀÖ´Ëµ.Substituting these into Equation 3.29 gives:ÒÀÖ´Ëµ·¯ ÄÒ ÒÀÖ´Ëµ·¯ ·½ (3.30)andÀÖ´Ëµ· ¯ÒÄÒÒÀÖ´Ëµ· ¯Ò· ½Ò(3.31)and hence we can see thatÐ ÑÒ ½ÄÒÒÀÖ´Ëµ (3.32)That is, the lower bound on ÄÒÒ is the entropy of the Ñth order Markov model ÀÖ´Ëµ.Consider a sequence of symbols from an unknown source that we are coding. Theentropy of the source depends on the model we use for the source. From Chapter 1 wehad that, in general, ÀÖ´ËÑµ ÀÖ´ËÑ ½µ where ËÑ is used to indicate a Markovsource of order Ñ. Thus higher order models generally yield lower values for theentropy. And the lower the entropy is, then from Equation 3.32, the smaller the
130 Fundamentals of Information Theory and Coding Designaverage length we expect as we take extensions to the source. This realisation doesnot affect how we design the compact codes for extensions to the source, but affectshow we calculate the corresponding Òth extension symbol probabilities (which arebased on the underlying model of the source) and the resulting code efﬁciency.RESULT 3.1When coding the Òth extension of a source, the probabilities should be derived basedon the “best” or “true” model of the source, and the efﬁciency of the code shouldbe based on the same model of the source.EXAMPLE 3.13Consider the following typical sequence from an unknown binary source:0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1If we assume a zero-memory model of the source then we can see that È´¼µ ¼ ,È´½µ ¼ and thus À´Ëµ ½. This implies that the original binary source is 100%efﬁcient.However the “true” model is the ﬁrst order Markov model from Figure 3.4.0 10.90.90.1 0.1FIGURE 3.4True model of source.The “true” entropy is calculated to be À´Ëµ ¼ and the original binary sourceis, in fact, only ± efﬁcient! To improve the efﬁciency we ﬁnd a binary code forthe second extension. Noting that È´× × µ È´× × µÈ´× µ this gives the secondextension symbol probabilities:
Source Coding 131Source È CompactCode×½×½ 0.05 110×½×¾ 0.45 0×¾×½ 0.45 10×¾×¾ 0.05 111The compact code is shown in the above table. The average length of this code isÄ¾ ½ bits per symbol pair or Ä¾¾ ¼ ¾ bits per symbol which improves thecoding efﬁciency to À´ËµÄ¾ ¾ ±.3.5.3 Code Efﬁciency and Channel CapacityConsider the communication system shown in Figure 3.5 for the case of a noiselesschannel.NoiselessChannelinputrentropyH(X)codeDecoder−aryrcodemessageReceiver−aryqmessagesourceSourceentropysourceH(S)EncoderFIGURE 3.5Noiseless communication system.The source entropy À´Ëµis interpreted as the average number of bits transmitted persource symbol. Each source symbol is encoded to a code word of Ä code symbols onaverage. Thus the ratio À´ËµÄ represents the average number of bits transmitted percode symbol. By deﬁnition this gives the entropy of the encoded sequence or codeentropy, À´ µ. That is:À´ µ À´ËµÄ(3.33)
132 Fundamentals of Information Theory and Coding DesignSince ÀÖ´ËµÀ´ËµÐÓ ¾ Ö and ÀÖ´ËµÄ then we have:À´ µ ÐÓ ¾ Ö (3.34)Thus if the code is 100% efﬁcient then À´ µ ÐÓ ¾ Ö which represents the caseof maximum entropy for a source with Ö symbols. Consider the Ö-input noiselesschannel. From Chapter 2 we established that the mutual information of a noiselesschannel is Á´ µ À´ µ, where and are the channel input and output,respectively, and hence for a noiseless channel Ñ Ü Á´ µ Ñ Ü À´ µÐÓ ¾ Ö.RESULT 3.2We make the following observations:(a) A 100% efﬁcient source coding implies maximum code entropy and henceequiprobable code symbols.(b) A 100% efﬁcient source coding implies maximum or “best” use of the channel.This result has important implications when we consider the inclusion of a channelcode with the source code. As we will see in Chapter 5 one of the key assumptionsis that the (binary) inputs to the channel coder are equally likely, and with efﬁcientsource coding this will deﬁnitely be the case.EXAMPLE 3.14Consider the binary source sequence:0 0 0 1 1 0 0 1 0 0Assuming a zero-memory model of the source gives È´¼µ ¼ , È´½µ ¼ ¿ andÀ´Ëµ ¼ . We ﬁnd the compact binary code of the second extension of the sourceis as shown in the following table:× × È´× × µ CompactCode00 0.49 001 0.21 1010 0.21 11011 0.09 111The average length of the code is Ä¾¾¼ ¼ and À´ËµÄ¾ ¾¼ ¾. The codeentropy is À´ µ ¼ ¾. Consider the corresponding code sequence given theabove compact code and initial source sequence:0 10 110 10 0
Source Coding 133We assume a zero-memory model and this gives È´¼µ , È´½µ and À´ µ¼ . Although contrived, this still shows that the code sequence exhibits a higherentropy than the source.3.6 Fano CodingWe now describe a coding scheme for the design of instantaneous codes called theFano coding scheme, or Shannon-Fano code in reference to the fact that the schemewas independently published by Shannon  and Fano , that can yield close tooptimal codes in most cases. Although mainly of historical interest because of itslack of optimality, Fano coding can be considered the precursor to optimal Huff-man coding discussed in the next section as it is based on the same principle butimplemented less rigidly.Fano coding is predicated on the idea that equiprobable symbols should lead to codewords of equal length. We describe Fano’s method for the design of Ö-ary instanta-neous codes. Consider the source Ë with Õ symbols: × ½ ¾ Õ and asso-ciated symbol probabilities È´× µ ½ ¾ Õ . Let the symbols be numberedso that È´×½µ È´×¾µ È´×Õµ. In Fano’s method the Õ symbols are di-vided or split into Ö equiprobable or close to equiprobable groups of symbols. Eachsymbol in the th group is then assigned the code symbol Ü and this is done for all½ ¾ Ö groups. Each of the Ö groups is then further split into Ö sub-groups(or into less than Ö sub-groups of one symbol if there are less than Ö symbols in thegroup) with each symbol in the th sub-group having the code symbol Ü appendedto it for all ½ ¾ Ö sub-groups. The process is repeated until the groups can besplit no further.For the important case of binary codes, Fano coding splits each group into twoequiprobable sub-groups appending a ¼ to one group and ½ to the other group.Groups are successively split until no more groups can be split (i.e., each grouphas only one symbol in it).EXAMPLE 3.15Consider the design of a binary code using the Fano coding scheme for a 8-symbolsource with the following ordered probability assignments:È´×½µ ½ ¾È´×¾µ È´×¿µ ½È´× µ È´× µ È´× µ ½ ½È´× µ È´× µ ½ ¿¾
134 Fundamentals of Information Theory and Coding DesignThe regrouping and code symbol assignment in each step of the Fano coding schemeare shown in Figure 3.6. The dashed lines represent the splitting of each set of symbolsinto two equiprobable groups. For each division a 0 is appended to each symbol inone group and a 1 is appended to each symbol in the other group. The number infront of each dashed line indicates the level of grouping. Thus the level 1 groupingshown in Figure 3.6(a) splits the original source into two groups with one group ofprobability 1/2 containing ×½ being assigned a 0 and the other group of probability1/2 containing ×¾ ×¿ × × × × × being assigned a 1. Each of these groups issplit into the level 2 groups as shown in Figure 3.6(b). This continues for each levelgroup until the level 5 grouping of Figure 3.6(e) when there are no more groups leftto split is reached. Figure 3.6(e) is the binary Fano code for this source. The average1Prob Code11/21/161/161/161/321/321/801111111/812111(a)Prob Code1 01/21/161/161/161/321/321/801 01 11 11 11 11 11/8Prob Code1 0 11/21/161/161/161/321/321/801 0 01 1 01 1 01 1 11 1 11 1 11/8332Prob Code1 0 11/21/161/161/161/321/321/801 0 01 1 0 01 1 0 11 1 1 01 1 1 1 01 1 1 1 11/833244Prob Code1 0 11/21/161/161/161/321/321/801 0 01 1 0 01 1 0 11 1 1 01 1 1 11 1 1 11/8332445(b) (c)(d) (e)FIGURE 3.6Binary Fano coding.length of the Fano code is Ä ¾ ¿½¾ bits and since À ´Ëµ ¾ ¿½¾ the Fano codeprovides the compact code (with efﬁciency of 100%). This result can also be veriﬁedby noting that the source probabilities are such that this is a special source for binarycodes.
Source Coding 135In Example 3.15 the symbols are successively divided into equiprobable groups of1/2, 1/4, 1/8, 1/16 and 1/32. In the majority of cases equal groupings are not possible,so closest to equal groupings are used and since there may be many such “quasi”equiprobable groups different Fano codes will result, and not all may yield compactcodes.EXAMPLE 3.16Consider the design of a binary code using the Fano coding scheme for the 5-symbolsource with the following probability assignment:È´×½µÈ´×¾µ È´×¿µ È´× µ È´× µ È´× µ ½There are two possible Fano codes depending upon how we perform the level 1grouping. One grouping creates a group of probability 4/9 containing ×½ and theother group of probability of 5/9 containing ×¾ ×¿ × × × as shown by Figure3.7(a). The alternative groupingcreates a group of probability 5/9 containing ×½ ×¾and the other group of probability 4/9 containing ×¾ ×¿ × × as shown by Figure3.7(b). Both groupings are equally valid. The average lengths of the two Fano codes1/91/91/91/91/94/9 0 00 11 0 11 1 11 1 03311 0 022(a) (b)1/91/91/91/91/94/9 01 0 0 01 0 0 11 0 11 1 11 1 033412FIGURE 3.7Different binary Fano codes.shown in Figure 3.7 are Ä´ µ ¾½ and Ä´ µ ¾¾ with code (b) being thecompact code. Thus a sub-optimal code may result depending on how the symbolsare grouped.Example 3.16 demonstrates that Fano codes are not necessarily compact and differ-ent Fano codes may yield different average lengths. Thus Fano codes are of limiteduse since such codes cannot be guaranteed to be compact.
136 Fundamentals of Information Theory and Coding Design3.7 Huffman Coding3.7.1 Huffman CodesWe now describe an important class of instantaneous codes known as Huffman codeswhich are attributed to the pioneering work of Huffman . Huffman codes arisewhen using the Huffman algorithm for the design of instantaneous codes given thesource symbols, × , and corresponding source probabilities, È´× µ. The Huffmanalgorithm attempts to assign each symbol a code word of length proportional to theamount of information conveyed by that symbol. Huffman codes are important be-cause of the following result.RESULT 3.3Huffman codes are compact codes. That is, the Huffman algorithm produces a codewith an average length, Ä, which is the smallest possible to achieve for the givennumber of source symbols, code alphabet and source statistics.We now proceed to outline the basic principles of the Huffman algorithm and thendetail the steps and provide examples for speciﬁc cases.The Huffman algorithm operates by ﬁrst successively reducing a source with Õ sym-bols to a source with Ö symbols, where Ö is the size of the code alphabet.DEFINITION 3.11 Reduced Source Consider the source Ë with Õ symbols:× ½ ¾ Õ and associated symbol probabilities È´× µ ½ ¾ Õ .Let the symbols be renumbered so that È´×½µ È´×¾µ È´×Õµ. By com-bining the last Ö symbols of Ë, ×Õ Ö·½ ×Õ Ö·¾ ×Õ , into one symbol, ×Õ Ö·½,with probability, È´×Õ Ö·½µÈÖ½ È´×Õ Ö· µ, we obtain a new source, termed areduced source of Ë, containing only Õ Ö· ½ symbols, ×½ ×¾ ×Õ Ö ×Õ Ö·½ .Call this reduced source Ë½. Successive reduced sources Ë¾ Ë¿ can be formedby a similar process of renumbering and combining until we are left with a sourcewith only Ö symbols.It should be noted that we will only be able to reduce a source to exactly Ö symbols ifthe original source has Õ Ö · «´Ö ½µ symbols where « is a non-negative integer.For a binary ´Ö ¾µ code this will hold for any value of Õ ¾. For non-binary codesif « Õ Ö´Ö ½µ is not an integer value then “dummy” symbols with zero probabilityare appended to create a source with Õ Ö · « ´Ö ½µ symbols, where « is thesmallest integer greater than or equal to «.The trivial Ö-ary compact code for the reduced source with Ö symbols is then usedto design the compact code for the preceding reduced source as described by thefollowing result.
Source Coding 137RESULT 3.4Assume that we have a compact code for the reduced source Ë . Designate the last Ösymbols from Ë ½, ×Õ Ö·½ ×Õ Ö·¾ ×Õ , as the symbols which were combinedto form the combined symbol, ×Õ Ö·½ of Ë . We assign to each symbol of Ë ½,except the last Ö symbols, the code word used by the corresponding symbol of Ë .The code words for the last Ö symbols of Ë ½ are formed by appending ¼ ½ Öto the code word of ×Õ Ö·½ of Ë to form Ö new code words.The Huffman algorithm then operates by back-tracking through the sequence of re-duced sources, Ë Ë ½ , designing the compact code for each source, until thecompact code for the original source Ë is designed.3.7.2 Binary Huffman Coding AlgorithmFor the design of binary Huffman codes the Huffman coding algorithm is as follows:1. Re-order the source symbols in decreasing order of symbol probability.2. Successively reduce the source Ë to Ë½, then Ë¾ and so on, by combining thelast two symbols of Ë into a combined symbol and re-ordering the new setof symbol probabilities for Ë ·½ in decreasing order. For each source keeptrack of the position of the combined symbol, ×Õ ½. Terminate the sourcereduction when a two symbol source is produced. For a source with Õ symbolsthe reduced source with two symbols will be ËÕ ¾.3. Assign a compact code for the ﬁnal reduced source. For a two symbol sourcethe trivial code is ¼ ½ .4. Backtrack to the original source Ë assigning a compact code for the th re-duced source by the method described in Result 3.4. The compact code as-signed to Ë is the binary Huffman code.The operation of the binary Huffman coding algorithm is best shown by the followingexample.EXAMPLE 3.17Consider a 5 symbol source with the following probability assignments:È´×½µ ¼ ¾ È´×¾µ ¼ È´×¿µ ¼ ½ È´× µ ¼ ½ È´× µ ¼ ¾Re-ordering in decreasing order of symbol probability produces ×¾ ×½ × ×¿ × .The re-ordered source Ë is then reduced to the source Ë¿ with only two symbols asshown in Figure 3.8, where the arrow-heads point to the combined symbol created in
138 Fundamentals of Information Theory and Coding DesignË by the combination of the last two symbols from Ë ½. Starting with the trivialcompact code of ¼ ½ for Ë¿ and working back to Ë a compact code is designed foreach reduced source Ë following the procedure described by Result 3.4 and shownin Figure 3.8. In each Ë the code word for the last two symbols is produced by takingthe code word of the symbol pointed to by the arrow-head and appending a 0 and 1to form two new code words. The Huffman coding algorithm of Figure 3.8 can bedepicted graphically as the binary Huffman coding tree of Figure 3.9. The Huffmancoding tree is of similar form to the decoding tree for instantaneous codes shown inFigure 3.3 and can thus be used for decoding Huffman codes. The Huffman codeitself is the bit sequence generated by the path from the root to the corresponding leafnode.0.4 10.2 010.2 0000.2 0010.4 10.4 000.2 010.6 00.4 10.20.20.10.101000001000110.4 1Ë Ë¾ Ë¿Ë½×¾×½××¿×FIGURE 3.8Binary Huffman coding table.The binary Huffman code is:Symbol È´× µ HuffmanCode×½ 0.2 01×¾ 0.4 1×¿ 0.1 0010× 0.1 0011× 0.2 000The average length of the code is Ä ¾ ¾ bits/symbol and the efﬁciency of theHuffman code is À´ËµÄ¾ ½¾¾¾ ¾ ±. Since the Huffman code is a compactcode then any other instantaneous code we may design will have an average length,ÄÇ, such that ÄÇ Ä.From Figure 3.8 in Example 3.17 different reduced sources Ë½ and Ë¾ are possibleby inserting the combined symbol ×Õ ½ in a different position when re-ordering.
140 Fundamentals of Information Theory and Coding DesignThis may or may not result in a compact code with different individual code wordlengths. Either way the compact code will have the same average length. Howeverthe average length variance:¾Õ½È ´Ð Äµ¾ (3.35)may be different.EXAMPLE 3.18The average length variance for the Huffman code of Example 3.17 is ¾ ½ .Consider a different Huffman code tree for the same source shown by Figures 3.10and 22.214.171.124 10.4 000.2 010.6 00.4 10.4 000.2 010.2 110.2 126.96.36.199.1110100110.4 0010×¾×½××¿×Ë Ë½ Ë¾ Ë¿FIGURE 3.10Different binary Huffman code table.The binary Huffman code is now:Symbol È´× µ HuffmanCode×½ 0.2 10×¾ 0.4 00×¿ 0.1 010× 0.1 011× 0.2 11Although this different Huffman code for the same source possesses different indi-vidual code word lengths the average length is still Ä ¾ ¾ bits/symbol; however,the average length variance is now ¾ ¼ ½ .
142 Fundamentals of Information Theory and Coding DesignThe Huffman code from Example 3.18 has a smaller average length variance thanthe code produced in Example 3.17. Codes with smaller average length variance arepreferable since they produce a more constant code bit rate. The following resultcan be used to ensure the Huffman coding tree produces a compact code with thesmallest average length variance.RESULT 3.5If the combined symbol, ×Õ ½, is placed at the highest available position for thatprobability assignment when the reduced source is re-ordered then the resultingcompact code will possess the smallest average length variance.3.7.3 Software Implementation of Binary Huffman CodingThe binary Huffman coding algorithm can be implemented in software as a sequenceof merge and sort operations on a binary tree. The Huffman coding algorithm is agreedy algorithm that builds the Huffman decoding tree (e.g., Figures 3.9 and 3.11)by initially assigning each symbol as the root node of a single-node tree. The decod-ing tree is then constructed by successively merging the last two nodes, labeling theedges of the left-right child pairs of the merge operation with a 0 and 1, respectively,and sorting the remaining nodes until only one root node is left. The code word for× is then the sequence of labels on the edges connecting the root to the leaf node of× . The details of the algorithm including a proof of the optimality of Huffman codescan be found in .3.7.4 Ö -ary Huffman CodesFor the design of general Ö-ary Huffman codes the Huffman coding algorithm is asfollows:1. Calculate «Õ Ö´Ö ½µ. If « is a non-integer value then append “dummy” sym-bols to the source with zero probability until there are Õ Ö · « ´Ö ½µsymbols.2. Re-order the source symbols in decreasing order of symbol probability.3. Successively reduce the source Ë to Ë½, then Ë¾ and so on, by combiningthe last Ö symbols of Ë into a combined symbol and re-ordering the new setof symbol probabilities for Ë ·½ in decreasing order. For each source keeptrack of the position of the combined symbol, ×Õ Ö·½. Terminate the sourcereduction when a source with exactly Ö symbols is produced. For a source withÕ symbols the reduced source with Ö symbols will be Ë « .4. Assign a compact Ö-ary code for the ﬁnal reduced source. For a source with Ösymbols the trivial code is ¼ ½ Ö .
Source Coding 1435. Backtrack to the original source Ë assigning a compact code for the th re-duced source by the method described in Result 3.4. The compact code as-signed to Ë, minus the code words assigned to any “dummy” symbols, is theÖ-ary Huffman code.The operation of the Ö-ary Huffman code is demonstrated in the following example.EXAMPLE 3.19We want to design a compact quaternary ´Ö µ code for a source with originally 11´Õ ½½µ symbolsand thefollowing probabilityassignments(re-ordered in decreasingorder of probability for convenience):È´×½µ ¼ ½ È´×¾µ ¼ ½ È´×¿µ ¼ ½¿ È´× µ ¼ ½¾ È´× µ ¼ ½¼È´× µ ¼ ½¼ È´× µ È´× µ ¼ ¼ È´× µ ¼ ¼ È´×½¼µ È´×½½µ ¼ ¼Now « ¾ ¿¿ and since « is not an integer value we need to append “dummy”symbols so that we have a source with Õ Ö · « ´Ö ½µ ½¿ symbols where« ¿. Thus we append symbols ×½¾ ×½¿ with È´×½¾µ È´×½¿µ ¼ ¼¼.The source Ë is then reduced to the source Ë¿ with only Ö symbols as shown inFigure 3.12, where the arrow-heads point to the combined symbol created in Ë bythe combination of the last Ö symbols from Ë ½. Starting with the trivial compactcode of ¼ ½ ¾ ¿ for Ë¿ and working back to Ë a compact code is designed for eachreduced source Ë following the procedure described by Result 3.4. The code wordfor the last Ö symbols is produced by taking the code word of the symbol pointed toby the arrow-head and appending ¼ ½ ¾ ¿ to form Ö new code words.The Ö-ary Huffman code is (ignoring the “dummy” symbols ×½¾ and ×½¿):Symbol È´× µ HuffmanCode×½ 0.16 2×¾ 0.14 3×¿ 0.13 00× 0.12 01× 0.10 02× 0.10 03× 0.06 11× 0.06 12× 0.05 13×½¼ 0.04 100×½½ 0.04 101The average length of the code is Ä ½ quaternary units per symbol and theHuffman code has an efﬁciency of À ´ËµÄ½½ ¿±.
146 Fundamentals of Information Theory and Coding Design3.8 Arithmetic CodingThe Huffman coding method described in Section 3.7 is guaranteed to be optimalin the sense that it will generate a compact code for the given source alphabet andassociated probabilities. However a compact code may be anything but optimal if itis not very efﬁcient. Only for special sources, where Ð ÐÓ Ö½È ÐÓ Ö È is aninteger for ½ ¾ Õ, will the compact code also be ½¼¼±efﬁcient. Inefﬁcienciesare introduced when Ð is a non-integer and it is required to “round-up” to the nearestinteger value. The solution to this as a consequence of Shannon’s Noiseless CodingTheorem (see Section 3.5) is to consider the Òth extension of the source for Ò largeenough and build the Huffman code for all possible blocks of length Ò Although thisdoes yield optimal codes, the implementation of this approach can easily becomeunwieldy or unduly restrictive. Problems include:¯ The size of the Huffman code table is ÕÒ, representing an exponential increasein memory and computational requirements.¯ The code table needs to be transmitted to the receiver.¯ The source statistics are assumed stationary. If there are changes an adaptivescheme is required which re-estimates the probabilities, and recalculates theHuffman code table.¯ Encoding and decoding is performed on a per block basis; the code is notproduced until a block of Ò symbols is received. For large Ò this may requirethe last segment to be padded with dummy data.One solution to using Huffman coding on increasingly larger extensions of the sourceis to directly code the source message to a code sequence using arithmetic codingwhich we will describe here. Rather than deriving and transmitting a code table ofsize ÕÒ, segmenting the incoming source message into consecutive blocks of lengthÒ, and encoding each block by consulting the table, arithmetic coding directly pro-cesses the incoming source message and produces the code “on-the-ﬂy.” Thus thereis no need to keep a code table, making arithmetic coding potentially much morecomputationally efﬁcient than Huffman coding. The basic concept of arithmetic cod-ing can be traced back to the 1960’s; however, it wasn’t until the late 1970’s and mid1980’s that the method started receiving much more attention, with the paper byWitten et al.  often regarded as one of the seminal papers in the area.Arithmetic coding, like Huffman coding, does require reliable source statistics to beavailable. Consider the Æ -length source message × ½ × ¾ × Æ where × ½¾ Õ are the source symbols and × indicates that the th character in the mes-sage is the source symbol × . Arithmetic coding assumes that the probabilities,
Source Coding 147È´× × ½ × ¾ × ½µ, for ½ ¾ Æ, can be calculated. The origin of theseprobabilities depends on the underlying source model that is being used (e.g., zero-memory source or Ñth order Markov model) and how such models are arrived atand probabilities estimated is not discussed here. Rather we assume that the requiredprobabilities are available and concentrate on the encoding and decoding process.The goal of arithmetic coding is to assign a unique interval along the unit num-ber line or “probability line” ¼ ½µ of length equal to the probability of the givensource message, È´× ½ × ¾ × Æ µ, with its position on the number line givenby the cumulative probability of the given source message, ÙÑ´× ½ × ¾ × Æ µ.However there is no need for direct calculation of either È´× ½ × ¾ × Æ µ orÙÑ´× ½ × ¾ × Æ µ. The basic operation of arithmetic coding is to produce thisunique interval by starting with the interval ¼ ½µ and iteratively subdividing it byÈ´× × ½ × ¾ × ½µ for ½ ¾ Æ .The interval subdivision operation of arithmetic coding proceeds as follows. Con-sider the ﬁrst letter of the message, × ½. The individual symbols, × , are each assignedthe interval Ð µ where ÙÑ´× µÈ ½ È´× µ and Ð ÙÑ´× µ È´× µÈ ½½ È´× µ. That is the length of each interval is Ð È´× µand the end of the interval is given by ÙÑ´× µ. The interval corresponding to thesymbol × ½, that is Ð ½ ½µ, is then selected. Next we consider the second letter ofthe message, × ¾. The individual symbols, × , are now assigned the interval Ð µwhere:Ð ½ · ÙÑ´× × ½µ £È´× ½µÐ ½ ·´½È´× × ½µµ£Ê (3.36)Ð Ð ½ · ÙÑ´× × ½µ È´× × ½µ £È´× ½µÐ ½ ·´ ½½È´× × ½µµ£Ê (3.37)and Ê ½ Ð ½ È´× ½µ. That is the length of each interval is ÐÈ´× × ½µ £È´× ½µ. The interval corresponding to the symbol × ¾, that is Ð ¾ ¾µ,is then selected. The length of the interval corresponding to the message seen so far,× ½ × ¾, is ¾ Ð ¾ È´× ¾ × ½µ£È´× ½µ È´× ½ × ¾µ. This interval subdivisionoperation is depicted graphically in Figure 3.14.We continue in this way until the last letter of the message, × Æ , is processed and theﬁnal interval of length Æ Ð Æ È´× ½ × ¾ × Æ µ is produced. Any numberthat falls within that interval can be chosen and transmitted to identify the interval(and hence the original message) to the receiver. Arithmetic codes are typicallybinary codes. Thus the binary representation of the number is considered. Optimalcoding is assured by selecting a number that requires the minimum number of bits tobe transmitted to the receiver (by only transmitting the signiﬁcant bits and ignoring
148 Fundamentals of Information Theory and Coding DesignÈ´×½µ È´×¾µ È´× µÈ´×½ × µÈ´× µ È´×¾ × µÈ´× µ È´×Õ × µÈ´× µÈ´× µÙÑ´×½µ ÙÑ´×¾µ ÙÑ´× µ0È´×ÕµÙÑ´×Õµ ½FIGURE 3.14Interval subdividing operation of arithmetic coding.the trailing 0’s). Since the interval occurs with probability È´× ½ × ¾ × Æ µ thenapproximately ÐÓ ¾ È´× ½ × ¾ × Æ µ bits are needed and where this becomesequality for all message sequences a ½¼¼± efﬁcient coding is possible. A simpledemonstration of arithmetic coding is given in Example 3.20.EXAMPLE 3.20Consider the message ×¾ ×¾ ×½ originating from the 3-symbol source with the fol-lowing individual and cumulative probabilities:Symbol È´× µ ÙÑ´× µ×½ 0.2 0.2×¾ 0.5 0.7×¿ 0.3 1.0We further assume that the source is zero-memory, thus È´× × ½ × ¾ × ½µÈ´× µ. As shown by Figure 3.15 initially the probability line ¼ ½µ is divided intothree consecutive, adjoint intervals of ¼ ¼ ¾µ, ¼ ¾ ¼ µ and ¼ ½ ¼µ correspondingto ×½,×¾ and ×¿ and the length ratios ¼ ¾ ¼ ¼ ¿, respectively. The length ratios andinterval lengths both correspond to È´× µ. When the ﬁrst letter of the message, ×¾, isreceived the interval ¼ ¾ ¼ µ is selected. The interval ¼ ¾ ¼ µ is further divided intothree consecutive,adjoint subintervals of length ratios ¼ ¾ ¼ ¼ ¿, that is ¼ ¾ ¼ ¿µ¼ ¿ ¼ µ and ¼ ¼ µ. The length ratios correspond to È´× ×¾µ È´× µ andthe subinterval lengths are equal to È´× ×¾µÈ´×¾µ È´× µÈ´×¾µ È´× ×¾µ.When the second letter of the message, ×¾, is received the subinterval ¼ ¿ ¼ µ isselected. The interval ¼ ¿ ¼ µ is then further subdivided into the three subintervals¼ ¿ ¼ ¿ µ, ¼ ¿ ¼ µ and ¼ ¼ µ with length ratios ¼ ¾ ¼ ¼ ¿ and when
Source Coding 149the third and last letter of the message, ×½, is received the corresponding and ﬁnalinterval ¼ ¿ ¼ ¿ µ of length È ´×¾µÈ ´×¾µÈ ´×½µ È ´×¾ ×¾ ×½µ ¼ ¼ is selected.The interval selection process can be summarised by the following table:Next Letter Interval×¾ [0.2,0.7)×¾ [0.3,0.55)×½ [0.3,0.35)In binary the ﬁnal interval is [0.01001100,0.01011001). We need to select a numberfrom within the interval that can be represented with the least number of signiﬁcantbits. The number 0.0101 falls within the interval and only requires the 4 bits 0101 to betransmitted. We note that we need to transmit ÐÓ ¾ È ´×¾ ×¾ ×½µ ÐÓ ¾´¼ ¼ µ¿¾ bits of information and we have been able to do this with 4 bits.The above description of arithmetic coding is still incomplete. For example, howdoes one select the number that falls within the ﬁnal interval so that it can be trans-mitted in the least number of bits? Let ÐÓÛ µ denote the ﬁnal interval. Onescheme described in  which works quite well in the majority of cases is to carryout the binary expansion of ÐÓÛ and until they differ. Since ÐÓÛ , atthe ﬁrst place they differ there will be a 0 in the expansion for ÐÓÛ and a 1 in theexpansion for . That is:ÐÓÛ ¼ ½ ¾ Ø ½¼¼ ½ ¾ Ø ½½The number ¼ ½ ¾ Ø ½½ falls within the interval and requires the least numberof bits from any other number within the same interval; so it is selected and trans-mitted as the Ø-bit code sequence ½ ¾ Ø ½½.3.8.1 Encoding and Decoding AlgorithmsWe have described how the message is encoded and transmitted but have not yetdiscussed how the received code is decoded back to the original message. Figure3.16 lists an algorithm for encoding a message using arithmetic coding to the binaryinteger code, Ú ÐÙ , and Figure 3.17 lists an algorithm for decoding the receivedbinary integer, Ú ÐÙ , back to the original message.NOTE In both the encoding and decoding algorithms mention is made of theÇ or end-of-ﬁle. Consider Example 3.20. The code sequence 0101 correspondsto the binary fraction ¼ ¼½¼½ and decimal fraction ¼ ¿½¾ . The number ¼ ¿½¾ notonly falls within the ﬁnal interval for the message ×¾ ×¾ ×½ but also for the longer
150 Fundamentals of Information Theory and Coding Design1.00.70.200.188.8.131.52.20.30.3(0.5) = 0.150.5(0.5) = 0.250.2(0.5) = 0.10.30.550.3(0.25) = 0.0750.5(0.25) = 0.1250.2(0.25) = 0.050.350.4750.55×¾×½×¿×¿×¾×½×¿×¾×¾ ×¾ ×½×½FIGURE 3.15Arithmetic coding for Example 3.20.
Source Coding 151Deﬁnitions1. Let Ë × ½ ¾ Õ denote the Õ distinct source symbols or lettersof the alphabet.2. Let × ½ × ¾ × Æ denote the Æ-length message that is to be encoded where× indicates that the th letter in the message is the symbol × .3. Assume È´× × ½ × ¾ × ½µ for ½ ¾ Õ and ½ ¾ Æ areavailable or can be estimated from the underlying source model.Initialisation¯ ÐÓÛ ¼ ¼¯ ½ ¼¯ ½IterationWhile ´ ÒÔÙØ Ç µ do1. get next letter of message, ×2. Ê Ò ÐÓÛ3. ÐÈ ½½ È´× × ½ × ¾ × ½µ4.È ½ È´× × ½ × ¾ × ½µ Ð ·È´× × ½ × ¾ × ½µ5. ÐÓÛ ·Ê Ò £6. ÐÓÛ ÐÓÛ ·Ê Ò £ Ð7. ·½doneTerminationLet ÐÓÛ ¼ ½ ¾ Ø ½¼ and ¼ ½ ¾ Ø ½½ . Then transmitthe code as the binary integer Ú ÐÙ ½ ¾ Ø ½½.FIGURE 3.16Encoding algorithm for arithmetic codes.
152 Fundamentals of Information Theory and Coding DesignDeﬁnitions1. Let Ë × ½ ¾ Õ denote the Õ distinct source symbols or lettersof the alphabet.2. Let Ú ÐÙ denote the binary integer that is to be decoded to the originalmessage × ½ × ¾ × Æ, where × indicates that the th letter in the messageis the symbol × .3. Assume È´× × ½ × ¾ × ½µ for ½ ¾ Õ and ½ ¾ Æ areavailable or can be estimated from the underlying source model.Initialisation¯ ÐÓÛ ¼ ¼¯ ½ ¼¯ ½¯ get Ú ÐÙ and store Ú ÐÙ as the binary fraction Ú ÐÙ ¼ .IterationRepeat1. Start at ¼ and repeat:(a) ·½(b) ÐÈ ½½ È´× × ½ × ¾ × ½µ(c)È ½ È´× × ½ × ¾ × ½µuntil Ð Ú ÐÙ ÐÓÛ ÐÓÛ2. output × as the symbol ×3. Ê Ò ÐÓÛ4. ÐÓÛ ·Ê Ò £5. ÐÓÛ ÐÓÛ ·Ê Ò £ Ð6. ·½until Ç is reachedFIGURE 3.17Decoding algorithm for arithmetic codes.
Source Coding 153length messages ×¾ ×¾ ×½ ×¾ and ×¾ ×¾ ×½ ×¾ ×¾ and so on. Thus there is a needfor the decoder to know when to stop decoding. Possible solutions to this probleminclude:¯ Deﬁning a special or extra symbol called Ç which is placed at the end ofeach message.¯ Transmitting the length of the message, Æ , along with the code.¯ Encoding and decoding messages in ﬁxed-sized blocks and only transmittingthe size of the block to the decoder at the beginning of the code sequence.A demonstration of the encoding and decoding algorithms depicted in Figures 3.16and 3.17 is provided by Examples 3.21 and 3.22.EXAMPLE 3.21Consider a zero-memorysource with the following source alphabet and probabilities:Symbol È ´× µ ÙÑ´× µ Ð µ6/15 6/15 [0.0, 6/15)2/15 8/15 [6/15, 8/15)2/15 10/15 [8/15, 10/15)4/15 14/15 [10/15, 14/15)1/15 15/15 [14/15, 1.0)Suppose the message is generatedby the source where is the Ç . Apply-ing the encoding algorithm ofFigure 3.16,where È ´× × ½ × ¾ × ½µ È ´× µ,yields the following:Next letter Ê Ò Ð µ ÐÓÛ µInit - - [0.0,1.0)1.0 [6/15, 8/15) [0.4, 0.533333333)0.133333333 [0.0, 6/15) [0.4, 0.453333333)0.053333333 [8/15, 10/15) [0.428444444,0.435555556)0.007111111 [10/15, 14/15) [0.433185185,0.435081481)0.001896296 [8/15, 10/15) [0.434196543,0.434449383)0.000252840 [0.0, 6/15) [0.434196543,0.434297679)0.000101136 [6/15, 8/15) [0.434236998,0.434250482)0.000013485 [14/15, 1.0) [0.434249583,0.434250482)The binary representation of the ÐÓÛ ¼ ¿ ¾ ¿ and ¼ ¿ ¾ ¼ ¾ is:ÐÓÛ ¼ ¼½½¼½½½½¼¼½¼½¼½¼½½½½¼ ¼½½¼½½½½¼¼½¼½¼½½¼¼¼¼
154 Fundamentals of Information Theory and Coding Designand hence the integer code that uniquely identiﬁes the interval that is transmitted isthe 16-bit Ú ÐÙ ¼½½¼½½½½¼¼½¼½¼½½.EXAMPLE 3.22Consider decoding the received code Ú ÐÙ ¼½½¼½½½½¼¼½¼½¼½½produced by theencoding process in Example 3.21. Applying the decoding algorithm of Figure 3.17given the source alphabet and probabilities from Example 3.21 to the stored value ofÚ ÐÙ ¼ ¼½½¼½½½½¼¼½¼½¼½½¼ ¼ ¿ ¾ yields the following:ÐÓÛ µ Ê Ò Ú ÐÙ ÐÓÛ ÐÓÛ × Ð µ[0.0, 1.0) 1.0 0.434249878 : [6/15, 8/15)[0.4, 0.533333333) 0.133333333 0.256874149 : [0, 6/15)[0.4, 0.453333333) 0.053333333 0.642185373 : [8/15, 10/15)[0.428444444, 0.435555556) 0.007111111 0.816389094 : [10/15, 14/15)[0.433185185, 0.435081481) 0.001896296 0.561459102 : [8/15, 10/15)[0.434196543, 0.434449383) 0.000252840 0.210943262 : [0, 6/15)[0.434196543, 0.434297679) 0.000101136 0.527358154 : [6/15, 8/15)[0.434236998, 0.434250482) 0.000013485 0.955186157 : [14/15, 1)and the decoded message is where is the Ç used to terminate thedecoding process.3.8.2 Encoding and Decoding with ScalingThe encoding and decoding algorithms depicted in Figures 3.16 and 3.17 iterativelyreduce the length of the interval with each successive letter from the message. Thuswith longer messages, smaller intervals are produced. If the interval range becomestoo small, then underﬂow is a very real problem. In Example 3.21 the interval rangeis already down to 1.011e-04 after only the 8th letter in the message and in the binaryrepresentation of the ﬁnal ÐÓÛ µ interval the ﬁrst (most signiﬁcant) 16 bits areidentical. With modern 64-bit CPUs the ÐÓÛ and will become indistinguishablewhen the ﬁrst 64 bits are identical. The practical implementation of the encodingand decoding algorithms requires an added scaling operation. The scaling operationrescales the current ÐÓÛ µ based on the pseudo-algorithm depicted in Figure3.18.The scaling operation not only solves the underﬂow problem but permits transmis-sion of the coded value before the encoding operation is complete. Thus a longmessage block can be encoded without having to wait until the end of the messagebefore transmitting the coded value. Similarly, the decoder can begin decoding the
Source Coding 155RescaleIf ÐÓÛ ¼ ½ ¾ Ø ½¼Ð½Ð¾ and ¼ ½ ¾ Ø ½½ ½ ¾ thenrescale to:ÐÓÛ ¼ ¼Ð½Ð¾¼ ½ ½ ¾If encoding then Ú ÐÙ Ú ÐÙ · ½ ¾ Ø ½ (initially Ú ÐÙ is the NULLstring), where · is the string concatenation operator.If decoding then Ú ÐÙ ´Ú ÐÙ £¾Ø ½µÑÓ ½, where ÑÓ ½returns the fractionalcomponent of the product.FIGURE 3.18Scaling algorithm to avoid underﬂow.most signiﬁcant bits of the coded value (which are transmitted ﬁrst by the encoderwhen rescaling the intervals) before all the bits have been received. Examples 3.23and 3.24 illustrate the use of the scaling algorithm of Figure 3.18 for the encodingand decoding of arithmetic codes.EXAMPLE 3.23The encoding from Example 3.21 is now repeated with the additional rescaling oper-ation of Figure 3.18 to prevent underﬂow from occurring. The encoding with scalingis shown in the following table where ÐÓÛ µÖ indicates that the ÐÓÛ andare expressed in base Ö:Next letter Ê Ò Ð µ ÐÓÛ µ½¼ ÐÓÛ µ¾ ÓÙØÔÙØInit - - [0.0, 1.0) [0.0, 1.0)1.0 [6/15, 8/15) [.4, .533333) [.01100110, .10001000).133333 [0, 6/15) [.4, .453333) [.01100110, .01110100)(rescale) [.2, .626667) [.00110..., .10100...) 011.426667 [8/15, 10/15) [.427556, .484444) [.01101101, .01111100)(rescale) [.420444, .875556) [.01101..., .11100...) 011.455111 [10/15, 14/15) [.723852, .845215) [.10111001, .11011000)(rescale) [.447704, .690430) [.0111001..., .1011000...) 1.242726 [8/15, 10/15) [.577158, .609521) [.10010011, .10011100)(rescale) [.234520, .752335) [.0011..., .1100...) 1001.517815 [0, 6/15) [.234520, .441646) [.00111100, .01110001)(rescale) [.469040, .883292) [.0111100..., .1110001...) 0.414252 [6/15, 8/15) [.634741, .689975) [.10100010, .10110000)(rescale) [.077928, .519797) [.00010..., .10000...) 101.441869 [14/15, 1) [.490339, .519797) [.01111101, .10000101)(terminate) 1When thelastor Ç letterofthemessageisread aﬁnal½istransmitted. Concatenat-ing the intermediate bits that were output yields the ﬁnal Ú ÐÙ ¼½½¼½½½½¼¼½¼½¼½½
156 Fundamentals of Information Theory and Coding Designwhich is the same Ú ÐÙ as in Example 3.21.EXAMPLE 3.24The decoding from Example 3.22 is repeated with the additional rescaling operationof Figure 3.18. The decoding processing with scaling is shown in the following table:Ú ÐÙ ÐÓÛ ÐÓÛ × ÐÓÛ µ½¼ ÐÓÛ µ¾ Ê Ò Ú ÐÙ- - [0.0, 1.0) [0.0, 1.0) 1.0 .434250.434250 [.400000,.533333) [.01100110, .10001000) .133333 .434250.256874 [.4, .453333) [.01100110, .01110100)(rescale) [.2, .626667) [.001100..., .10100...) .426667 .473999.642185 [.427556, .484444) [.01101101, .01111100)(rescale) [.420444, .875556) [.01101..., .11100...) .455111 .791992.816389 [.723852, .845215) [.10111001, .11011000)(rescale) [.447704, .690430) [.0111001..., .1011000...) .242726 .583984.561459 [.577158, .609521) [.10010011, .10011100)(rescale) [.234520, .752335) [.0011..., .1100...) .517815 .343750.210944 [.234520, .441646) [.00111100, .01110001)(rescale) [.469040, .883292) [.0111100..., .1110001...) .414252 .687500.527360 [.634741, .689975) [.10100010, .10110000)(rescale) [.077928, .519797) [.00010..., .10000...) .441870 .500000.955197NOTE The scaling operation not only expands the interval but repositions itaround 0.5. That is, the rescaled interval has ÐÓÛ ¼ and ¼ . Thus,rescaling is needed whenever ÐÓÛ and are both less than or greater than 0.5.One pathological condition that may arise and defeat the scaling operation is whenÐÓÛ is just below 0.5 and is just above 0.5. Consider:ÐÓÛ ¼ ¼½½½½½½½½½½½½¼ ½¼¼¼¼¼¼¼¼¼¼¼½The interval cannot be rescaled and the interval Ê Ò ÐÓÛ ¼ and¼ when the CPU bit-size is exceeded and underﬂow occurs.3.8.3 Is Arithmetic Coding Better Than Huffman Coding?Shannon’s noiseless coding theorem states that if one designs a compact binary code,using Huffman coding, for the Òth extension of a source then the average length ofthe code, ÄÒÒ , is no greater than À´Ëµ · ½Ò. A similar theorem outlined in  states
Source Coding 157that for the arithmetic coding scheme just described the average length is also nogreater than À´Ëµ ·½Ò when encoding a message of length Ò. Thus Huffman codingand arithmetic coding exhibit similar performance in theory. However the arith-metic code encoding and decoding algorithms with scaling can be implemented formessages of any length Ò. Huffman coding for the Òth extension of a source, how-ever, becomes computationally prohibitive with increasing Òsince the computationalcomplexity is of the order ÕÒ.3.9 Higher-order ModellingShannon’s Noiseless Coding Theorem discussed in Section 3.5 states that code efﬁ-ciency can be improved by coding the Òth extension of the source. Huffman codingis then used to derive the compact code for ËÒ, the Òth extension of the source. Thearithmetic codes from Section 3.8 directly code the Ò-length message sequence to thecode word. In both cases source statistics are needed to provide the required condi-tional and joint probabilities which we obtain with the appropriate model. But whatmodel do we use? Result 3.1 alluded to the requirement that we use the “true” modelfor the source, except that in practice the order of the “true” model is not known, itmay be too high to be computationally feasible or, indeed, the “true” model is not aMarkov model of any order!We now formally deﬁne what we mean by the differentorders of modelling power.DEFINITION 3.12 th-order ModellingIn th-order modelling the joint probabilities È´× ½ × ¾ × ·½µ of all possible´ · ½µ-length message sequences are given. Furthermore:¯ SinceÈ´× ½ × ¾ × ÑµÑ·½ Ñ·¾ ·½È´× ½ × Ñ × Ñ·½ × ·½µ(3.38)then the joint probabilities for the Ñth-order model, where Ñ · ½, arealso given.¯ SinceÈ´× Ñ·½ × ½ × ¾ × ÑµÈ´× ½ × ¾ × Ñ × Ñ·½µÈ´× ½ × ¾ × Ñµ(3.39)then the stationary state and conditional probabilities for the Ñth orderMarkov model, where Ñ · ½, are also provided.We need to consider what effect the model order has on the Huffman and arithmetic
158 Fundamentals of Information Theory and Coding Designcoding schemes we have discussed. In Chapter 1 it was shown that higher-orderMarkov models yield lower values of entropy, and since the lower bound on theaverage length of a code is the entropy, higher-order modelling would be expectedto produce more efﬁcient codes (see Result 3.1). Thus it is not only important toconsider the implementation of the encoder and decoder for both the Huffman andarithmetic algorithms but also how the source statistics are calculated.Ideally for encoding a Ò-length message or Òth extension of a source the highest-order model of order ´Ò ½µ should be used. Thus coding of longer messagesand source extensions will indirectly beneﬁt from higher orders of modelling powerwhich, from Shannon’s Noiseless Coding Theorem and Result 3.1, will producemore efﬁcient codes as the source entropy decreases. Intuitively, longer messagesand extensions are able to better exploit any redundant or repetitive patterns in thedata.3.9.1 Higher-order Huffman CodingWith higher-order Huffman coding we consider the case of deriving the Huffmancode for the Òth extension of the source, ËÒ, using a th order model of the source,Ë.¯ If Ò · ½ then the joint probabilities È´× ½ × ¾ × Òµ of the ´Ò ½µthorder (Markov) model can be used.¯ If Ò ·½ then the product of the joint th order probabilities can be used.In practice the ´ · ½µ-gram model probabilities È´× ½ × ¾ × ·½µ are directlyestimated from the frequency counts:È´× ½ × ¾ × ·½µ ´× ½ × ¾ × ·½µÈ ´× ½ × ¾ × ·½µ (3.40)where ´× ½ × ¾ × ·½µ is the number of times the ´ · ½µ-length message,× ½ × ¾ × ·½, is seen andÈ ´× ½ × ¾ × ·½µ is the sum of all ´ · ½µ-length messages that have been seen so far.When building the Huffman coding tree only the relative order of the Òth extensionprobabilities is important. Thus a ﬁxed scaling of the probabilities can be appliedwithout affecting the Huffman coding algorithm. Thus practical implementations usethe model counts, ´× ½ × ¾ × ·½µ, directly instead of the model probabilities,È´× ½ × ¾ × ·½µ.
Source Coding 159EXAMPLE 3.25After observing a typical ½¼¼ length binary message the following ¾nd order modelcounts are produced:´¼¼¼µ´¼¼½µ ½¾´¼½¼µ´¼½½µ´½¼¼µ ½¾´½¼½µ ¿´½½¼µ´½½½µThe binary Huffman code for the ¿rd extension of the source is to be designed basedon the above ¾nd order model. The binary Huffman code tree is shown in Figure 3.19where the counts are directly used for construction of the tree.The Huffman code is given in the following table:¿rd extension Huffman code000 1001 010010 0010011 0000100 011101 00011110 0011111 000103.9.2 Higher-order Arithmetic CodingWhen performing arithmetic coding on a message of length Ò we require knowledgeof the conditional probabilities È´× × ½ × ¾ × ½µ for ½ ¾ Ò as shownby Figure 3.16. Assuming the th order model is given then:¯ If Ò · ½ then È´× × ½ × ¾ × ½µÈ ´× ½ × ¾ × ½ × µÈ ´× ½ × ¾ × ½µ for½ ¾ Ò where the joint probabilities of the ´Ò ½µth order model can beused.¯ If Ò ·½ then for ·¾ ·¿ Òwe have È´× × ½ × ¾ × ½µÈ´× × × ·½ × ½µÈ ´× × ·½ × ½ × µÈ ´× × ·½ × ½µ .
160 Fundamentals of Information Theory and Coding Design00011101010101root10054 4729 241415 12 128 7 7 74 3011101010 110001 100000111FIGURE 3.19¿rd order Huffman coding tree.
Source Coding 161The th order probabilities are derived from the respective frequency counts of thedata seen thus far using Equation 3.40. Unlike Huffman coding, arithmetic cod-ing requires the absolute probabilities so a count of both ´× ½ × ¾ × ·½µ andÈ ´× ½ × ¾ × ·½µ are needed.EXAMPLE 3.26Consider performing arithmetic coding on the message 01011 based on the ¾nd ordermodel counts from Example 3.25. From the counts the ¿rd extension, ¾nd extensionand single binary probabilities can be calculated as shown in Figure 3.20.P(000)=0.47P(010)=0.07P(011)=0.08P(100)=0.12P(101)=0.03P(110)=0.07P(111)=0.04P(001)=0.12P(00)=0.59P(01)=0.15P(10)=0.15P(11)=0.11P(0)=0.74P(1)=0.26FIGURE 3.20Joint probabilities up to 3rd order.Applying the encoding algorithm of Figure 3.16 where È´× ¾ × ½µÈ ´× ½ × ¾µÈ ´× ½µ andÈ´× × ¾ × ½µÈ ´× ¾ × ½ × µÈ ´× ¾ × ½µ yields:× × ½ × ¾ Ê Ò È ´× × ½ × ¾ × ½µ Ð µ ÐÓÛ µ- - - - [0.0, 1.0)0/- 1.0 È ´¼µ ¼ [0.0, 0.74) [0.0, 0.74)1/0 0.74 È ´½ ¼µÈ ´¼½µÈ ´¼µ ¼ ¾¼¿ [0.797, 1.0) [0.590, 0.740)0/01 0.15 È ´¼ ¼½µÈ ´¼½¼µÈ ´¼½µ ¼ [0.0, 0.467) [0.590, 0.660)1/10 0.070 È ´½ ½¼µÈ ´½¼½µÈ ´½¼µ ¼ ¾¼¼ [0.800, 1.0) [0.646, 0.660)1/01 0.014 È ´½ ¼½µÈ ´¼½½µÈ ´¼½µ ¼ ¿¿ [0.467, 1.0) [0.6525, 0.6600)The corresponding interval sub-division of the probability line [0.0,1.0) is shown inFigure 3.21.The ﬁnal interval for the message 01011 is [0.6525, 0.6600) which in binary is[0.101001110000, 0.101010001111) and hence the code word that is transmittedis 10101. It should be noted that for this example the same number of bits is used forboth the code word and the message.
162 Fundamentals of Information Theory and Coding Design10111100000.01.00.740.590.74 0.740.660.59 0.590.6460.66 0.660.6460.6520.0FIGURE 3.21Subdivision of probability line.
Source Coding 1633.10 Exercises1. For the following binary codes, determine:(a) Whether the code is uniquely decodable. If not, exhibit two source mes-sages with the same code.(b) Whether the code is instantaneous. If not, can you design an instanta-neous code with the same lengths of code words?code A code B code C code D code E code F code G code H×½ 000 0 0 0 0 0 01 1010×¾ 001 01 10 10 10 100 011 001×¿ 010 011 110 110 1100 101 10 101× 011 0111 1110 1110 1101 110 1000 0001× 100 01111 11110 1011 1110 111 1100 1101× 101 011111 111110 1101 1111 001 0111 10112. Consider a block code based on an r-symbol code alphabet and designed fora source with q possible messages. Derive the expression for the lower boundon the block code word length.3. We are going to devise a code for the decimal digits 0-9 using a binary code.Analysis shows that we should use the shortest codes for 0 and 1. If we code:digit 0 to code 00digit 1 to code 11ﬁnd the minimum code word length for the remaining digits 2-9 assumingwe want them to be the same code word length.4. For the following code word lengths:Code LengthsA 2 2 2 4 4 4B 1 1 2 3 3C 1 1 2 2 2 2(a) Can an instantaneous binary code be formed? If so, give an example ofsuch a code.(b) Can an instantaneous ternary code be formed? If so, give an example ofsuch a code.
164 Fundamentals of Information Theory and Coding Design(c) If neither a binary nor ternary code can be formed ﬁnd the smallest num-ber of code symbols that will allow a code to be formed. Give an exampleof such a code.*5. Find all possible combinations of the individual code word lengths Ð½ Ð¾ Ð¿when coding the messages ×½ ×¾ ×¿ which result in uniquely decodable binarycodes with words not more than 3 bits long (i.e., Ð ¿).6. A 6-symbol source has the following statistics and suggested binary and ternarycodes:× È Code A Code B×½ 0.3 0 00×¾ 0.2 10 01×¿ 0.1 1110 02× 0.1 1111 10× 0.2 1100 11× 0.1 1101 12(a) What is the efﬁciency of binary code A?(b) What is the efﬁciency of ternary code B?(c) Can you design a more efﬁcient binary code and, if so, what is the efﬁ-ciency of your code?(d) Can you design a more efﬁcient ternary code and, if so, what is the efﬁ-ciency of your code?(e) Which is the most efﬁcient code: binary or ternary?7. Consider the following information source:Symbol ×½ ×¾ ×¿ × × ×È 0.1 0.1 0.45 0.05 0.2 0.1(a) What is À´Ëµ?(b) Derive a compact binary code using Huffman coding. What is the aver-age length of your code?(c) Can you ﬁnd another compact code with different individual code wordlengths? If so, what is the average length of this code?(d) What is the efﬁciency of the code?(e) What coding strategy can be used to improve the efﬁciency?8. Consider the following source:
Source Coding 165Symbol ×½ ×¾ ×¿ × ×È 0.1 0.2 0.2 0.4 0.1Design a compact binary code for the source with minimum variance betweencode word lengths. What is the efﬁciency of your code? Now design thecompact binary code with maximum variance and compare.9. Derive the compact ternary code using Huffman coding for the followingsource and compute the efﬁciency:Symbol ×½ ×¾ ×¿ × × × × ×È 0.07 0.4 0.05 0.2 0.08 0.05 0.12 0.0310. Consider the following binary source:Symbol ×½ ×¾È 0.1 0.9(a) Find À´Ëµ.(b) What is the compact (indeed trivial!) binary code for this source? Com-pare the average length Ä½with À´Ëµ. What is the efﬁciency?(c) Repeat (b) for ËÒ (the ÒØ extension of Ë) for Ò ¾ ¿. Find ÄÒ Ò andcompare this with À´Ëµ. What is happening?11. A binary source has the following 2nd extension probabilities:Symbol È00 0.7501 0.110 0.111 0.05(a) Design the source encoder by devising a compact code for the source.(b) What is the efﬁciency of the code you designed in (a)?(c) What is the entropy of the source encoder output (the code entropy),À´ µ? Compare your answer with the source entropy, À´Ëµ. Will thisalways be the case?*12. Suppose a long binary message contains half as many 1’s as 0’s. Find a binaryHuffman coding strategy which uses:
166 Fundamentals of Information Theory and Coding Design(a) at most 0.942 bits per symbol.(b) at most 0.913 bits per symbol.13. What is the efﬁciency of a binary source Ë in which 1 has a probability of0.85? Find an extension of Ë with efﬁciency at least 95%.*14. A binary source emits 0 with probability 3/8. Find a ternary coding schemethat is at least 90% efﬁcient.15. A source emits symbol Ü a third of the time and symbol Ý the rest of the time.Devise a ternary coding scheme which is:(a) at most 0.65 ternary units per symbol(b) at most 0.60 ternary units per symbol(c) at most 0.55 ternary units per symbolWhat is the efﬁciency of the code you devised for the above cases?16. You are required to design a ternary source encoder that is at least 97% ef-ﬁcient. Observations of the ternary source over a certain time interval revealthat symbol ×½ occurred 15 times, ×¾ occurred 6 times and ×¿ occurred 9 times.Design the coding scheme. What is the efﬁciency of your code?17. A binary message has been modeled as a 1st order Markov source:0 10.10.30.7 0.9(a) Derive a binary compact code for the 2nd extension of the source. Whatis the efﬁciency of your code?(b) Repeat (a) for the 3rd extension of the source.18. A binary source emits an equal number of 0’s and 1’s but emits the samesymbol as the previous symbol twice as often as emitting a different symbolthan the previous symbol. Derive a binary encoding which is at least 95%efﬁcient.19. The output of an unknown binary information source transmitting at 100 bps(bits per second) produces a different output (from the previous output) threetimes as often as producing the same output. Devise a binary source codingstrategy for each of the following cases, so that the channel bit rate is:
Source Coding 167(a) at most 95 bps.(b) at most 80 bps.*20. Analysis of a binary information source reveals that the source is three timesmore likely to emit the same bit if the last two bits were the same; otherwise itis equally likely to produce a 0 or a 1.(a) Design a coding scheme which is at least 85% efﬁcient.(b) Design a coding scheme which is at least 90% efﬁcient.*21. Consider the source of Qu. 10:Symbol ×½ ×¾È 0.1 0.9(a) Perform arithmetic coding on the sequence ×¾×½×¾ to produce the codedoutput. Compare your answer with the Huffman code of the 3rd exten-sion of the source from 10(c).(b) Repeat (a) for the sequence ×½×½×½(c) Repeat (a) for the sequence ×½×¾×½(d) From the above cases, is arithmetic coding of a 3 letter message compa-rable to, superior to or inferior to Huffman coding of the corresponding3rd extension of the source?NOTE Do not use any scaling. To convert from base 10 to base 2 (and viceversa) you will need a number base converter that can deal with non-integervalues. One such tool available from the Internet ishttp://www.math.com/students/converters/source/base.htm.22. Perform arithmetic decoding of your coded output values from Qu. 21(a),(b),(c)and conﬁrm that the original 3 letter messages are produced.23. Repeat Qu. 21 but this time use scaling.24. Repeat Qu. 22 but this time use scaling.*25. Huffman codes are guaranteed to be instantaneous codes. Can the same state-ment be made about arithmetic codes?26. Consider Example 3.26:(a) Perform arithmetic coding of the sequence 011 and compare with thecorresponding Huffman code from Example 3.25.(b) Repeat (a) for the sequence 000.
168 Fundamentals of Information Theory and Coding Design(c) Repeat (a) for the sequence 001.(d) From the above cases, is arithmetic coding comparable to, superior to orinferior to Huffman coding?27. Consider Example 3.26:(a) Perform arithmetic decoding of the coded binary integer value 10101assuming the original message was 3 bits long.(b) Repeat (a) but for the coded binary integer value of 1.*28. After observing a typical sample of a binary source the following ¾nd ordermodel counts are produced:´¼¼¼µ ¾¼´¼¼½µ ½¼´¼½¼µ´¼½½µ ½´½¼¼µ ½´½¼½µ ½¼´½½¼µ ¾¼´½½½µ(a) Derive the binary Huffman code table for the 3rd extension of the sourcesuch that there is minimum variance in the individual code word lengths.What is the average length of your code?(b) Perform arithmetic coding on the following 3 letter messages and com-pare with the corresponding Huffman code:i. 110ii. 111iii. 1003.11 References ASCII, retrieved January 25, 2002 fromhttp://www.webopedia.com/TERM/A/ASCII.html
Source Coding 169 N. Abramson, Information Theory and Coding, McGraw-Hill, New York,1953. T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms,MIT Press, Cambridge, MA, 1990. R.M. Fano, Transmission of Information, MIT Press, Cambridge, MA, 1961. S.W. Golomb, R.E. Peile, and R.A. Scholtz, Basic Concepts in InformationTheory and Coding, Plenum Press, New York, 1994. D. Hankerson, G.A. Harris, and P.D. Johnson, Jr., Introduction to InformationTheory and Data Compression, CRC Press, Boca Raton, FL, 1998. D.A. Huffman, A method for the construction of minimum redundancy codes,Proc. IRE, 40, 1098-1101, 1952. L.G. Kraft, A device for quantizing, grouping and coding amplitude modulatedpulses, M.S. thesis, Dept. of E.E., MIT, Cambridge, MA, 1949. J. C. A. van der Lubbe, Information Theory, Cambridge University Press, Lon-don, 1997. B. McMillan, Two unequalities implied by unique decipherability”, IRE Trans.Inform. Theory, IT-2, 115-116, 1956. C.E. Shannon, A mathematical theory of communication, Bell System Tech. J.,28, 379-423, 623-656, 1948. I.H. Witten, R.M. Neal and J.G. Cleary, Arithmetic coding for compression,Communications of the ACM, 30(6), 520-540, 1987. I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, Morgan Kauf-mann Publishers, San Francisco, 2nd ed., 1999.
Chapter 4Data Compression4.1 IntroductionWe have seen that entropy or information content is a measure of predictability orredundancy. In situations where there is redundancy in a body of information, itshould be possible to adopt some form of coding which exploits the redundancy inorder to reduce the space which the information occupies. This is the idea underlyingapproaches to data compression.In the previous chapter, we have looked at encoding methods that assume that suc-cessive characters in a sequence are independent or that the sequence was generatedby a Markov source. In this chapter, we extend these ideas to develop data com-pression techniques that use the additional redundancy that arises when there arerelationships or correlations between neighbouring characters in the sequence. Thisadditional redundancy makes it possible to achieve greater compression, though atthe cost of extra computational effort.EXAMPLE 4.1Suppose we have a message that consists of only four letters, say A, B, C, D. Tomeasure the information content of such a message, it is convenient to code theletters as binary digits, and count the total number of digits to give an estimate of theinformation content in bits.If there is no redundancy in the sequence, so that the letters occur at random, then wemust use two binary digits to code each letter, say ¼¼ for A, ¼½ for B, ½¼ for and½½ for D. A sequence such aswill be coded as¼½¼¼½½½½¼¼½¼½½½¼¼¼½½½¼½¼¼¼¼½¼½½¼171
172 Fundamentals of Information Theory and Coding DesignThe information content will be two bits per letter.Suppose instead that the occurrence of letters is not completely random, but is con-strained in some way, for example, by the rule that A is followed by B or C withequal probability (but never by D) , B is followed by C or D with equal probability,C is followed by D or A with equal probability and D is followed by A or B withequal probability. We can encode a sequence that satisﬁes these rules by using twobinary digits to encode the ﬁrst letter as above, and then using one binary digit toencode each successive letter. If the preceding letter is A, we can use ¼ to encodethe following letter if it is B and ½ to encode it if it is C, and so on. In this case, asequence such aswill be coded as¼½½½¼½½¼½¼¼½½½½¼½¼¼and the information content will be ´Ò· ½µ bits if there are Ò letters in the sequence.If we had used the coding that we used for the completely random sequence, it wouldbe coded as¼½½½¼½½¼¼¼½¼½½¼½½¼½½¼½½½¼½½½¼¼½¼½½¼¼which is approximately twice as long.The redundancyin the second sequence has enabled us to reduce the number of binarydigits used to represent it by half.The example above illustrates how structure can be used to remove redundancy andcode sequences efﬁciently. In the following sections we will look at various kindsof structure in sequence and image data and the ways in which they can be used fordata compression.4.2 Basic Concepts of Data CompressionWe have seen in the previous chapter that simple coding schemes can be devisedbased on the frequency distribution of characters in an alphabet. Frequently occur-ring characters are assigned short codes and infrequently occurring characters are
Data Compression 173assigned long codes so that the average number of code symbols per character isminimised.Repetition is a major source of redundancy. In cases where single characters occurmany times in succession, some form of run-length coding can be an effective datacompression technique. If there are strings of characters or some other patterns thatoccur frequently, some form of dictionary coding may be effective.In some cases, it may be possible to create a statistical model for the data and usethe predictions of the model for compression purposes. Techniques that use a sin-gle model to compress all kinds of data are known as static compression techniques.Such techniques are likely to be inefﬁcient, because of the mismatch between themodel and the actual statistics of the data. As an alternative, semi-adaptive compres-sion techniques construct a model for each set of data that is encoded and store ortransmit the model along with the compressed data. The overhead involved in stor-ing or transmitting the model can be signiﬁcant. For this reason, semi-adaptive tech-niques are rarely used. The best compression algorithms use adaptive compressiontechniques, in which the model is built up by both the compressor and decompressorin the course of the encoding or decoding process, using that part of the data that hasalready been processed.When sequences of characters representing text or similar material have to be com-pressed, it is important that the original material is reproduced exactly after decom-pression. This is an example of lossless compression, where no information is lostin the coding and decoding. When images are compressed, it may be permissiblefor the decompressed image not to have exactly the same pixel values as the originalimage, provided the difference is not perceptible to the eye. In this case, some formof lossy compression may be acceptable. This involves a loss of information betweenthe coding and decoding processes.4.3 Run-length CodingRun-length coding is a simple and effective means of compressing data in which itis frequently the case that the same character occurs many times in succession. Thismay be true of some types of image data, but it is not generally true for text, whereit is rare for a letter of the alphabet to occur more than twice in succession.To compress a sequence, one simply replaces a repeated character with one instanceof the character followed by a count of the number of times it occurs. For example,the sequencecould be replaced by¾ ¿ ½ ¾ ¿
174 Fundamentals of Information Theory and Coding Designreducing the number of characters from 24 to 16. To decompress the sequence, eachcombination of a character and a count is replaced by the appropriate number ofcharacters.Protocols need to be established to distinguish between the characters and the countsin the compressed data. While the basic idea of run-length coding is very simple,complex protocols can be developed for particular purposes. The standard for fac-simile transmission developed by the International Telephone and Telegraph Con-sultative Committee (CCITT) (now the International Telecommunications Union) involves such protocols.4.4 The CCITT Standard for Facsimile TransmissionFacsimile machines have revolutionised the way in which people do business. Send-ing faxes now accounts for a major part of the trafﬁc on telephone lines. Part ofthe reason why facsimile machines gained wide acceptance quickly may be due tothe fact that they integrated into the existing telecommunications infrastructure veryeasily. However, a major factor in their gaining acceptance was the adoption ofthe world-wide standard proposed by the CCITT, which made it possible for everyfacsimile machine to communicate with every other facsimile machine, as they allcomplied with the standard.The CCITT Group 3 standard is used to send faxes over analogue telephone lines.It speciﬁes two run-length coding operations that are used to compress the data thatrepresents the image that is being transmitted. As the standard was intended primar-ily for the transmission of pages of text, it only deals with binary images, where asingle bit is used to specify whether a pixel is black or white.The Group 3 standard is designed to transmit A4-sized pages (210 mm by 297 mm).The image is broken into horizontal scan lines which are 215 mm long and contain1,728 pixels. The scan lines are either 0.26 mm apart for images transmitted at thestandard resolution or 0.13 mm apart for images transmitted at high resolution. Thismeans that the information content of a page is just over 2 million bits at standardresolution and over 4 million bits at high resolution.The images are coded one line of pixels at a time, with a special end-of-line characterbeing used to ensure that the lines are not confused. The standard speciﬁes two run-length coding algorithms, a one-dimensional algorithm that codes a single line ofpixels in isolation, and a two-dimensional one that codes a line of pixels in terms ofthe differences between it and the preceding line of pixels.The one-dimensional algorithm is used to code the ﬁrst line of each page. It treatsthe line of pixels as a succession of runs of white and black pixels in alternation,
Data Compression 175Run length Run of white pixels Run of black pixels0 00110101 00001101111 000111 0102 0111 113 1000 104 1011 0115 1100 00116 1110 00107 1111 000118 10011 000101Table 4.1 Huffman codes of the pixel run lengths.and codes the number of pixels in each run into a number of bits using a modiﬁedHuffman coding technique. It assumes that the ﬁrst run of pixels is white; if this isnot the case, it transmits the code for zero. Separate Huffman codes are used for thenumbers of black and white pixels, as these have different statistics.The Huffman codes were developed from a set of test documents. They providerepresentations for run lengths from 0 to 63, and for multiples of 64 up to 1728.Run lengths up to 63 are represented by a single code. Run lengths of 64 or moreare represented by concatenating the code for the greatest multiple of 64 that is lessthan the run length with the code for the remaining number of pixels in the run. Thefollowing example illustrates the coding process.EXAMPLE 4.2Table 4.1 shows the Huffman codes for run lengths from 0 to 8. Using them, a lineof pixels that startsÏÏÏÏ ÏÏÏwhere Ï denotes a white pixel and denotes a black pixel would be encoded as½¼½½ ¼¼¼½½ ½¼¼¼ ¼¼½¼where the spaces have been introduced to show where each code word ends. Thenumber of bits has been reduced from 21 to 17; this is not a great reduction, butsubstantial compression can be achieved when the run lengths are over 64.The two-dimensional code describes the differences between the runs of pixels in theline being encoded with the runs in the preceding line. Where a run of pixels in thecurrent line ends within three pixels of the end of a run of pixels of the same colourin the preceding line, this is encoded. Otherwise, the number of pixels in the currentrun is coded using the one-dimensional code. There is also provision for ignoring
176 Fundamentals of Information Theory and Coding Designruns in the previous line. Details of the codes used in the two-dimensional algorithmcan be found in the standard .If there is a transmission error, continued use of the two-dimensional coding algo-rithm can cause it to propagate down the page. For this reason, the one-dimensionalalgorithm is used to code every second line in standard resolution transmissions andevery fourth line in high resolution transmissions, with the intervening lines beingencoded using the two-dimensional code. This limits the extent to which errors canpropagate.The decoding process simply generates runs of pixels with the appropriate run lengthsto re-create the lines of pixels.There is also a Group 4 standard that is intended for sending faxes over digital tele-phone lines. It makes provision for the compression of greyscale and colour imagesas well as binary images.4.5 Block-sorting CompressionThe Block-sorting Compression technique is designed for compressing texts wherethe letters that precede each letter form a context for that letter and letters oftenappear in the same context. It was introduced by Burrows and Wheeler  and hasbeen implemented in the bzip2 compression program .The technique works by breaking the text up into blocks and rearranging the lettersin each block so that the resulting sequence can be compressed efﬁciently using run-length coding or some other simple technique. For decompression, the rearrangedsequence is recovered and the original sequence is restored. What makes the tech-nique work is the way in which the original sequence can be restored with a minimumof effort.The technique is best discussed by way of an example. Suppose we have a block ofﬁfteen letters, say, the fat cat sat. We list the ﬁfteen cyclic permutations of the block,
Data Compression 177separating the last letter of the sequence from the rest, to get:Ø Ø Ø × ØØ Ø × Ø ØØ Ø × ØØØ Ø × ØØØ Ø × ØØØ Ø × ØØØ Ø × ØØØ × ØØ ØØ × ØØ ØØ × ØØ ØØ × ØØ Ø× ØØ Ø Ø× ØØ Ø ØØØ Ø Ø ×ØØ Ø Ø ×where the underscore character has been used to represent the spaces between thewords. The strings in the left hand column are the contexts of the letters in the righthand column. We now sort these permutations in alphabetical order of the contexts:Ø Ø × ØØØØ Ø Ø ×Ø × ØØ Ø× ØØ Ø ØØ × ØØ ØØ Ø Ø × ØØ × ØØ ØØ Ø × ØØØ Ø × ØØØ Ø × ØØØØ Ø Ø ×× ØØ Ø ØØ × ØØ ØØ Ø × Ø ØØ Ø × ØØwhere the ordering is from right to left and the underscore character precedes thealphabetic characters in the ordering.The sequence of letters in the right hand column is the required rearrangement. Thesequence the fat cat sat has become fscttta aea th. In this case, threeof the t’s and two of the underscore characters appear together in the rearrangedsequence. In larger blocks, there will be more long runs of the same character. Therearranged sequence can be compressed with a run-length coder or a move-to-frontcoder.
178 Fundamentals of Information Theory and Coding Design(A move-to-front coder is a simple way of generating a code based on a probabilitydistribution. A frequency distribution of the letters is constructed by passing throughthe sequence and counting the number of times each letter occurs. After each letteris counted, the order of letters in the frequency distribution is changed by movingthe current letter to the top of the list. In the end, the list is in approximate order ofdecreasing frequency, with the more frequently occurring letters appearing near thetop of the list. The letters are then encoded using short codes for letters at the top ofthe list and longer codes for the less frequently occurring letters lower down the list.)To restore the sequence to the original order, we need to know where the sequencestarts and which letter follows which. In the example above, the ﬁrst letter is thesecond last letter of the rearranged sequence. The order in which letters follow eachother can be found by aligning the last letter in the left hand column of the tableabove with the letter in the right hand column. This matching can be accomplishedvery simply by sorting the rearranged sequence into alphabetical order. If we do this,we get:×ØØØ×ØØØ ØØTo reconstruct the original sequence, we start with the letter in the right hand columnof the second-last row of the table, knowing that this is the ﬁrst letter of the originalsequence. This is the last t in the right hand column. To ﬁnd the letter that followsit, we look for the last t in the left hand column. The letter in the right hand columnof that row is the following letter, h. There is only one h in the ﬁrst column, and itis followed by e. We can trace our way through the whole sequence in this way andconstruct the original text.In this coding technique, we use the information about structure that is contained incontexts to eliminate redundancy and achieve compression. This is a very powerfulapproach to compression, and has been used in a number of other ways, as we shallsee in the next sections.
Data Compression 1794.6 Dictionary CodingTexts such as books, newspaper articles and computer programs usually consist ofwords and numbers separated by punctuation and spaces. An obvious way of com-pressing these is to make a list of all the words and numbers appearing in the text,and then converting the text to a sequence of indices which point to the entries in thelist. (The list of words is referred to as the dictionary. This is perhaps a different us-age of the word from the common one, where a dictionary lists both words and theirmeanings. Lexicon might be a better name, but the methods that are based on thisidea are generally referred to as dictionary coding methods.) This can be a simpleand powerful means of data compression, but it has some drawbacks.First, it is suitable primarily for text and not for images or other data. It wouldprobably not work well on purely numerical data. Second, both the compressor andthe decompressor need to have access to the dictionary. This can be achieved if thedictionary is stored or transmitted along with the compressed text, but this couldrequire large amounts of overhead, especially for small texts. Alternatively, a largedictionary could be constructed and held by both the compressor and decompressorand used for all texts. This is likely to be inefﬁcient, as any given text will use onlya small proportion of the words in the dictionary.It is therefore better to construct a dictionary for each text. The techniques thatwe describe in this section enable both the compressor and decompressor to do thisin a way that does not require any overhead for the transmission or storage of thedictionary. The trick is to use the part of the text that has already been compressedor decompressed to construct the dictionary.This trick was proposed by Ziv and Lempel (, ). It has been used in a num-ber of algorithms which have been implemented in the compression software that iscommonly available today.The ﬁrst algorithm was proposed by Ziv and Lempel in 1977 , and will be referredto as LZ77. In this algorithm, the compressed text consists of a sequence of triples,with each triple consisting of two numbers, Ñ and Ò, and a character, . The triple,´Ñ Ò µ, is interpreted as an instruction indicating that the next Ñ · ½ charactersmatch the Ñ characters located Ò characters back plus the next character, . Forexample, the triple ´¾¿ ½¾ Õµ is interpreted to mean: the next ½¾ characters matchthe ½¾ characters found ¾¿ characters back in the text, copy them from there andthen append a “q”to the text.The compressor scans the text, trying to match the next characters to a sequence ofcharacters earlier in the text. If it succeeds, it creates the appropriate triple. If it fails,it creates a triple in which the numbers are zero, indicating this failure. The decom-pressor reverses this process, appending sequences of characters from the section of
180 Fundamentals of Information Theory and Coding Designthe text that has already been decompressed and following them with the charactersin the triples.EXAMPLE 4.3The text the fat cat sat on the mat. can used to illustrate the LZ77 algorithm. It is tooshort for the algorithm to compress it effectively, but the repetitions in the text makeit good for the purposes of illustration.We will use the underscore character to represent the space character, so the text isthe fat cat sat on the mat.When we start the coding, there is no earlier text to refer to, so we have to code theﬁrst nine characters by themselves. The compressed text begins:´¼ ¼ Øµ ´¼ ¼ µ ´¼ ¼ µ ´¼ ¼ µ ´¼ ¼ µ ´¼ ¼ µ ´¼ ¼ Øµ ´¼ ¼ µ ´¼ ¼ µAt this point we can now refer back to the sequence at , so we can code the nexteight characters in two triples:´ ¿ ×µ ´ ¿ ÓµThe next two characters have to be coded individually:´¼ ¼ Òµ ´¼ ¼ µFor the last seven characters, we can make two references back into the text:´½ Ñµ ´½½ ¾ µThe references in this example are shown diagrammatically Figure 4.1.Implementations of the LZ77 algorithm usually set limits on the amount of text that issearched for matching sequences and on the number of matching characters copiedat each step. There are also minor variations that have been used to improve itsefﬁciency. For example, the gzip  program uses a hash table to locate possiblestarting points of matching sequences.The algorithm proposed by Ziv and Lempel in 1978 , which will be referred to asLZ78, builds a set of sequences that is used as a dictionary as the compression anddecompression progress. The compressed text consists of references to this dictio-nary instead of references to positions in the text. The sequences can be stored in atree structure, which makes it easy to carry out the matching. In this algorithm, thecompressed text consists of a sequence of pairs, with each pair ´ µ interpreted as
Data Compression 181Ø Ø Ø × Ø ÓÒ Ø Ñ ØFIGURE 4.1Diagrammatic representation of references in the LZ77 algorithm.the th item in the dictionary followed by the character . The compression processadds an entry to the dictionary at each stage. The decompression process builds thesame dictionary in a similar way to the compression algorithm using the original textthat is being generated from the coded sequence. The success of this operation relieson the fact that references to the dictionary are made after that part of the dictionaryhas been built.EXAMPLE 4.4To compress the textthe fat cat sat on the mat.using LZ78,both the compressor and decompressor begin with a dictionary consistingof the empty sequence. This is item number zero in the dictionary.The compressed text consists of pairs consisting of a dictionary reference and acharacter. If there is no matching item in the dictionary, item ¼ is referenced and thenext character given. Thus the ﬁrst ﬁve characters will be coded as´¼ Øµ ´¼ µ ´¼ µ ´¼ µ ´¼ µ ´¼ µand the following items will be added to the dictionary:Item 1: tItem 2: hItem 3: eItem 4:Item 5: fItem 6: a
182 Fundamentals of Information Theory and Coding DesignThe next character is already in the dictionary, so it is coded as´½ µandItem 7: tis added to the dictionary.The next character has not occurred before, so it is coded as´¼ µandItem 8: cis added to the dictionary.The next two characters are coded as´ ØµandItem 9: atis added to the dictionary.The steps above and the rest of the coding are summarised in Table 4.2.EXAMPLE 4.5The compression capabilities of LZ78 can be better appreciated by the followingsequence of textabababababababababab.where the coding is summarised in Table 4.3. It should be noted that at each stage thelongest sequence in the dictionary that has been so far is selected. Thus it is evidentfrom the table that the LZ78 algorithm builds successively longer string patternsas more of the sequence is processed achieving a corresponding improvement incompression.
Data Compression 183Item sequence Code Item number Current sequencet (0,t) 1 th (0,h) 2 the (0,e) 3 the(0, ) 4 thef (0,f) 5 the fa (0,a) 6 the fat (1, ) 7 the fatc (0,c) 8 the fat cat (6,t) 9 the fat cats (4,s) 10 the fat cat sat (9, ) 11 the fat cat sato (0,o) 12 the fat cat sat on (0,n) 13 the fat cat sat ont (4,t) 14 the fat cat sat on the (2,e) 15 the fat cat sat on them (4,m) 16 the fat cat sat on the mat. (9,.) 17 the fat cat sat on the mat.Table 4.2 Example of coding using LZ78.Item sequence Code Item number Current sequencea (0,a) 1 ab (0,b) 2 abab (1,b) 3 abababa (3,a) 4 ababababa (2,a) 5 abababababab (5,b) 6 abababababababab (4,b) 7 abababababababababab. (7,.) 8 abababababababababab.Table 4.3 Another example of coding using LZ78.
184 Fundamentals of Information Theory and Coding DesignCharacters Code Item number Item sequence Current sequencet 116 128 th th 104 129 he the 101 130 e the32 131 f thef 102 132 fa the fa 97 133 at the fat 116 134 t the fat32 135 c the fatc 99 136 ca the fat cat 133 137 at the fat cat32 138 s the fat cats 115 139 sa the fat cat sat 137 140 at o the fat cat sato 111 141 on the fat cat sat on 110 142 n the fat cat sat on32 143 t the fat cat sat onth 128 144 the the fat cat sat on the 130 145 e m the fat cat sat on them 109 146 ma the fat cat sat on the mat 133 147 at. the fat cat sat on the mat. 46 the fat cat sat on the mat.Table 4.4 Example of coding using LZW.Welch  proposed a variation of LZ78 in which the compressor and decompressorbegin with the set of characters in the ASCII alphabet and extend this to sequencesof two or more characters. This means that no characters are transmitted, only theircodes. The algorithm codes the longest sequence of characters from the currentposition in the text that already exists in the dictionary and forms a new dictionaryentry comprising the just coded sequence plus the next character. This algorithm isknown as LZW. It is the algorithm implemented in the UNIX utility compress.EXAMPLE 4.6To compress the textthe fat cat sat on the mat.thecompressorand decompressorbegin with adictionaryconsisting of127 items. Thecharacters t, h, e and space have ASCII values of ½½ , ½¼ , ½¼½ and ¿¾, respectively;so these are the ﬁrst four codes. Table 4.4 summarises the coding process, showingthe items that are added to the dictionary at each stage.
Data Compression 185Characters Code Item number Item sequence Current sequencea 97 128 ab ab 98 129 ba abab 128 130 aba abababa 130 131 abab ababababa 129 132 bab abababababab 132 133 baba abababababababab 131 134 ababa abababababababababab 131 135 abab. abababababababababab. 46 abababababababababab.Table 4.5 Another example of coding using LZW.In this case, the LZW algorithm outputs more code words than the correspondingLZ78. However it should be noted that for LZW the code words are transmitted as 8bit numbers whereas LZ78 will require more than 8 bits to transmit both the characterand corresponding index into the dictionary.EXAMPLE 4.7Compression of the text:abababababababababab.from Example 4.5 is summarised in Table 4.5. The coded sequence 97, 98, 128, 130,129, 132, 131, 131, 46 is transmitted in ¢ ¾ bits whereas the original sequencetakes up ¾½ ¢ ½ bits. This represents a compression ratio of at least 50 %.4.7 Statistical CompressionStatistical Compression or Model-based Coding techniques use probabilistic modelswhich predict the next character in the data on the basis of probability distributionsthat are constructed adaptively from the data. The compressor builds up its modelsas it compresses the data, while the decompressor builds up identical models as itdecompresses the data. Figure 4.2 is a block diagram of a statistical compressionsystem. The system consists of a compressor and a decompressor, each with its ownpredictor.
186 Fundamentals of Information Theory and Coding Design¹Data¹¹CompressorPredictorPrediction¹CompressedDataDecompressorPredictorPrediction¹¹DataFIGURE 4.2Block diagram of a statistical compression system.As each character enters the compressor, its predictor generates a probability dis-tribution which is used by the compressor to encode the character, using Huffmancoding or arithmetic coding. The predictor then updates its models.When the encoded character enters the decompressor, its predictor generates thesame probability distribution as was used to encode the character, which the decom-pressor uses to decode it. The decoded character is then used by the decompressor’spredictor to update its models.As long as the predictors start off with the same models and update them as eachcharacter is processed, the models used by the compressor and decompressor willremain synchronised.Any predictor that makes predictions in the form of a probability distribution overthe set of possible characters can be used in a statistical compressor. For example,n-gram models can be used.4.8 Prediction by Partial MatchingPrediction by Partial Matching (PPM) is the most commonly used statistical com-pression technique. It uses a collection of n-gram models to provide the requiredprobability distributions.A PPM compressor or decompressor builds up n-gram models for values of Ò from ½to some maximum, usually ¿ or . These n-gram models consist of frequency countsfor each of the n-grams that have appeared in the text so far. The n-gram models
Data Compression 187are used to estimate the probability of the next character given the previous Ò ½characters.The reason why it is necessary to have a sequence of n-gram models instead ofjust one is that the sequence of characters comprising the previous Ò ½ charactersfollowed by the next character may not have appeared before in the text. Whenthis occurs, the n-gram model cannot be used to estimate the probability of the nextcharacter. This problem is usually dealt with by implementing an escape mechanism,where an attempt is made to estimate the probability using the n-gram models withsuccessively smaller contexts until one is found for which the probability is non-zero.In the encoding process, the contexts in the n-gram model with the largest contextsthat match the current context are found. If the frequency of the n-gram that is madeup of the current context followed by the next character in the text has a frequency ofzero, a special character, called the escape character, is encoded and the contexts ofthe next largest n-gram model are examined. This process continues until a non-zerofrequency is found. When this happens, the corresponding probability distribution isused to encode the next character, using arithmetic coding. The frequency counts ofall the n-gram models maintained by the compressor are then updated.In the decoding process, the decoder selects the appropriate n-gram model on thebasis of the number of escape characters that precede the coded character. The prob-ability distribution derived from that n-gram model is used to decode the character,and the frequency counts of all the n-gram models maintained by the decompressorare updated.There are a number of variations on the basic PPM scheme. Most of them differin the formulae that they use to convert the frequency counts in the n-gram modelsto probabilities. In particular, the probability that should be assigned to the escapecharacter may be computed in a number of different ways.PPM compression has been shown to achieve compression ratios as good as or betterthan the dictionary-based methods and block-sorting compression. It has also beenshown that given a dictionary encoder, it is possible to construct a statistical encoderthat achieves the same compression.4.9 Image CodingThe compression techniques that have been discussed above were designed for thecompression of text and similar data, where there is a sequence of characters whichmakes the data essentially one-dimensional. Images, however, are inherently two-dimensional and there are two-dimensional redundancies that can be used as thebasis of compression algorithms.
188 Fundamentals of Information Theory and Coding DesignNevertheless, the one-dimensional techniques have been shown to work quite effec-tively on image data. A two-dimensional array of pixel values can be converted to aone-dimensional array by concatenating successive rows of pixels. Text compressionmethods can be applied to this one-dimensional array, even though the statistics ofthe pixel values will be very different from the statistics of characters in texts. Com-pression programs such as compress and zip have been found to reduce image ﬁlesto about half their uncompressed size.The Graphics Interchange Format (GIF) has been in use since 1987 as a standardformat for image ﬁles. It incorporates a compression scheme to reduce the size of theimage ﬁle. The GIF uses a palette of 256 colours to describe the pixel values of eachimage. The palette represents a selection of colours from a large colour space and istailored to each image. Each pixel in the image is given an 8-bit value that speciﬁesone of the colours in the palette. The image is converted to a one-dimensional arrayof pixel values and the LZW algorithm is applied to this array.The GIF is being supplanted by the Portable Network Graphics (PNG) format. Thisuses the gzip compression scheme instead of LZW. It also has a number of otherimprovements, including preprocessing ﬁlters that are applied to each row of pixelsbefore compression and extended colour palettes.Images also come in various colour formats. The simplest are the binary images,where each pixel is black or white, and each pixel value can be speciﬁed by a singlebit. Greyscale images use integer values to specify a range of shades of grey fromblack to white. The most common practice is to use eight-bit integers to specifygreyscale values, except in medical imaging where twelve bits or sixteen bits may beused to achieve the desired resolution in grey levels. Colour images usually requirethree numbers to specify pixel values. A common format is to use eight-bit integersto specify red, green and blue values, but other representations are also used. Mul-tispectral images, generated by systems that collect visible, infrared or ultravioletlight, may generate several integer values for each pixel.The different colour formats have their own statistical characteristics. Binary imagesusually consist of runs of white pixels alternating with runs of black pixels. Run-length encoding is an effective means of compressing such images, and the CCITTGroup standard for facsimile transmissions described above is an example of run-length coding applied to binary images.Greyscale images are less likely to have runs of pixels with the same greyscale val-ues. However, there may be similarities between successive rows of the image whichwill make dictionary-based methods of statistical compression effective. Colour im-ages and multispectral images can be compressed as three or more separate greyscaleimages. There may be correlations between the component images in these cases, butit is difﬁcult to use the resulting redundancy to improve compression performance.One-dimensional compression techniques cannot make use of the redundancies thatarise from the two-dimensional structure of images if they are applied to single rowsor columns of pixels. It is possible, however, to implement dictionary-based com-pression or PPM compression using two-dimensional contexts. Figure 4.3 shows a
Data Compression 189PCCPCPCC C CPCC CFIGURE 4.3Two-dimensional contexts of a pixel for PPM compression.set of contexts consisting of four pixels, three pixels, two pixels or one pixel thatcould be used in a PPM compressor. The pixels marked “C” represent the contextpixels used to predict the centre pixel (marked “P”). The positions of the pixels thatare used to predict the centre pixel are such that these pixels would be encoded anddecoded before the pixel in the centre. The full contexts would not be available forpixels at the edge of the image; so the escape mechanism would have to be invoked.Another way in which to take advantage of the two-dimensional structure of imagesis to use pyramid coding. This generates a set of approximations to the image byaggregating adjacent pixel values. A common practice is to divide the image into¾ ¢¾ subimages so that each image is half as wide and half as high as the one fromwhich it is derived. The approximation process is repeated to form a sequence ofimages which may end with an image consisting of a single pixel.Reversing the process creates a sequence of images that show increasing detail. Thisis a useful feature for applications where images are downloaded from a remotesource. The sequence of images can be displayed to give an indication of the progressof the download operation.Generating the sequence of images actually increases the number of pixel valuesrequired to specify the image. Compression techniques can be used to reduce thetotal size of the images to less than the size of the original image.
190 Fundamentals of Information Theory and Coding DesignEXAMPLE 4.8Figure 4.4 illustrates the construction of a 4-level image pyramid, starting with an8-pixel ¢ 8-pixel array. Each image is divided into an array of 2-pixel ¢ 2-pixelsubimages. The pixel values in each of these subimages are used to compute thepixel value of the corresponding image at the next level down. There are a numberof ways of doing this; in Figure 4.4 the median of the four pixel values (rounded tothe nearest integer) is taken.The image pyramid contains a total of 85 pixel values, in comparison to the 64 pixelvalues in the original image. To make it possible to achieve some compression, wecan replace the pixel values at each level with the differences between the pixel valuesand the value of the corresponding pixel in the level below.Figure 4.5 shows the differences. It is possible to recreate the original image bystarting with the single-pixel image and adding and subtracting the differences ateach level. Taking differences can reduce the range of pixel values in the images. Inthe original image, the maximum pixel value is ½ and the minimum pixel value is, a range of ½¿ . In the corresponding difference image, the maximum differenceis · ¼ and the minimum difference is ¿ , a range of . The range of pixel valuesin the ¢ image is , while the range of differences is . In both these cases, therange has been reduced by about half; so it should be possible to code the differencesusing one less bit per difference than it would take to code the pixel values.Full use is made of the two-dimensional structure of images in transform codingmethods. A two-dimensional transform is used to concentrate the information con-tent of an image into particular parts of the transformed image, which can then becoded efﬁciently. The most common transform used for this purpose is the DiscreteCosine Transform (DCT).To use the DCT for transform coding, the image is broken up into square subimages,usually 8 pixels by 8 pixels or 16 pixels by 16 pixels. The pixel values in thesearrays are then subjected to the two-dimensional DCT, which produces a square arrayof transform coefﬁcients. Even though the pixel values are integers, the transformcoefﬁcients are real numbers.The DCT acts to concentrate the information contained in the array of pixels into acorner of the transform array, usually at the top left corner. The largest transformcoefﬁcients appear in this corner, and the magnitude of the coefﬁcients diminishesdiagonally across the array so that the transform coefﬁcients near the bottom rightcorner are close to zero.If the transform coefﬁcients are scanned into a one-dimensional array using a zig-zag pattern like that shown in Figure 4.6, the zero coefﬁcients come at the end of thearray. They can be thresholded and discarded. The remaining coefﬁcients can thenbe coded using Huffman coding or arithmetic coding.
194 Fundamentals of Information Theory and Coding DesignTo reverse the compression process, the coefﬁcients are restored to an array, the re-maining coefﬁcients are set to zero and the inverse DCT is applied. This results ina lossy reconstruction, the amount of information lost depending on the number ofDCT coefﬁcients that were set to zero. Differences in the way neighbouring subim-ages are treated can also result in the reconstructed image exhibiting discontinuitiesacross the boundaries of the subimages.The use of the DCT in transform coding can result in very high compression ratios, atthe cost of some loss of ﬁdelity. The JPEG standard  for still image compression(named after its creators, the Joint Photographic Experts Group) uses DCT-basedtransform coding. The original JPEG standard is being superceded by the JPEG2000 standard, which adds many features.For compressing sequences of images, such as those generated by video cameras,a further level of redundancy can be exploited because each image is very similarto the one the precedes it, except when there is an abrupt change of scene. This isthe two-dimensional analogue of the process in the CCITT Group 3 standard wherelines are coded in terms of the differences between them and the preceding line. TheMPEG standard  for video compression (devised by the Motion Picture ExpertsGroup) uses this to compress video sequences. The MPEG standard also includesalgorithms for coding of digitized audio data.4.10 Exercises1. The sequenceis generated by a process that follows the following rules: (1) may be fol-lowed by , , or ; (2) may be followed by or ; (3) is alwaysfollowed by ; (4) may be followed by , , or ; (5) may be fol-lowed by or ; and (6) the sequence must begin with or . Devise anencoding scheme that uses these rules to encode the sequence in as few bits aspossible. How many bits are required to encode the sequence? How many bitswould be required if the occurrence of the letters in the sequence was random?2. The following table describes the behaviour of a Markov source with fourstates ½, ¾, ¿, and alphabet .
Data Compression 195Initial Final Probability Outputstate state of transition symbol½ ½ 0.25 A½ ¾ 0.25 B½ ¿ 0.25 C½ 0.25 D¾ ½ 0.50 A¾ 0.50 B¿ ¾ 0.50 C¿ 0.50 D¿ 0.50 A0.50 BGiven that the source always begins in state ½, devise an efﬁcient way ofcoding the output of the source using the alphabet ¼ ½ .3. Use the one-dimensional run-length coding algorithm of the CCITT Group 3standard and the Huffman codes in Table 4.1 to code the following sequencesof pixelsÏÏÏÏÏ ÏÏÏÏÏÏÏ ÏÏÏÏÏÏÏÏÏÏÏÏ ÏÏÏÏÏwhere Ï denotes a white pixel and denotes a black pixel. How much com-pression is achieved by the coding?4. Apply the block-sorting process to the string hickory dickory dock.(The underscore characters denote spaces between the words.)5. The string bbs aaclaakpee abh is the result of block-sorting a stringof sixteen letters and three spaces (denoted by the underscore character, ).Given that the ﬁrst letter is the last b in the sequence and that the underscorecharacter precedes the letters in the sorting order, restore the sequence to theoriginal order.6. Use the LZ77 algorithm to compress and decompress the stringshe sells seashells on the seashore.7. Use the LZ78 algorithm to compress and decompress the stringshe sells seashells on the seashore.8. Use the LZW algorithm to compress and decompress the stringshe sells seashells on the seashore.9. Figure 4.7 shows pixel values in an ¢ image. Construct a four-level imagepyramid from these values and compute the differences in the pyramid.
196 Fundamentals of Information Theory and Coding Design142 142 142 142 0 0 0 0142 142 142 142 0 0 0 0142 142 142 142 0 0 0 0142 142 142 142 0 0 0 00 0 0 0 142 142 142 1420 0 0 0 142 142 142 1420 0 0 0 142 142 142 1420 0 0 0 142 142 142 142FIGURE 4.7Pixel values in an ¢ image.10. Apply the two-dimensional Discrete Cosine Transform to the pixel valuesshown in Figure 4.7. Rearrange the output of the transform in a one-dimensionalarray using the zig-zag scanning pattern shown in Figure 4.6. How many ofthe transform coefﬁcients are small enough to be ignored? Set the values of thelast ﬁfteen transform coefﬁcients to zero and apply the inverse Discrete CosineTransform to recover the image. How much have the pixel values changed?4.11 References M. Burrows and D. J. Wheeler, A Block-sorting Lossless Data CompressionAlgorithm, Technical Report 124, Digital Equipment Corporation, Palo Alto, CA,1994. The bzip2 and libbzip2 ofﬁcial home page,http://sources.redhat.com/bzip2/index.html
Data Compression 197 The gzip home page, http://www.gzip.org The International Telecommuications Union,http://www.itu.int/home/index.html The JPEG home page, http://www.jpeg.org The MPEG home page, http://mpeg.telecomitalialab.com T. A. Welch, A technique for high performance data compression, IEEE Comput.,17(6), 8–20, 1984. J. Ziv and A. Lempel, A universal algorithm for sequential data compression,IEEE Trans. Inform. Theory, IT-23(3), 337–343, 1977. J. Ziv and A. Lempel, Compression of individual sequences via variable ratecoding, IEEE Trans. Inform. Theory, IT-24(5), 530–536, 1978.
Chapter 5Fundamentals of Channel Coding5.1 IntroductionIn Chapter 2 we discussed the discrete memoryless channel as a model for the trans-mission of information from a transmitting source to a receiving destination. Themutual information, Á´ µ of a channel with input and output measuresthe amount of information a channel is able to convey about the source. If thechannel is noiseless then this is equal to the information content of the source ,that is Á´ µ À´ µ. However in the presence of noise there is an uncer-tainty or equivocation, À´ µ, that reduces the mutual information to Á´ µÀ´ µ À´ µ. In practice the presence of noise manifests itself as random errorsin the transmission of the symbols. Depending on the information being transmittedand its use, random errors may be tolerable (e.g., occasional bit errors in the trans-mission of an image may not yield any noticeable degradation in picture quality) orfatal (e.g., with Huffman codes discussed in Chapter 3, any bit errors in the encodedvariable-length output will create a cascade of errors in the decoded sequence dueto the loss of coding synchronisation). Furthermore the errors tolerable at low errorrates of 1 in a million (or a probability of error of ½¼ ) may become intolerable athigher error rates of 1 in 10 or 1 in a 100. When the errors introduced by the infor-mation channel are unacceptable then channel coding is needed to reduce the errorrate and improve the reliability of the transmission.The use of channel coders with source coders in a modern digital communicationsystem to provide efﬁcient and reliable transmission of information in the presenceof noise is shown in Figure 5.1. Our discussion of channel codes will be principallyconcerned with binary block codes, that is, both the channel coder inputs and outputswill be in binary and ﬁxed-length block codes will be used. Since a digital commu-nication system uses a binary channel (most typically a BSC) and the source coderwill encode the source to a binary code, then the intervening channel coder will codebinary messages to binary codes. The operation of a channel coder is provided bythe following deﬁnition.199
200 Fundamentals of Information Theory and Coding DesignDEFINITION 5.1 Channel Coding The channel encoder separates or segmentsthe incoming bit stream (the output of the source encoder) into equal length blocksof Ä binary digits and maps each Ä-bit message block into an Æ-bit code wordwhere Æ Äand the extra Æ Ächeck bits provide the required error protection.There are Å ¾Ä messages and thus ¾Ä code words of length Æ bits. The channeldecoder maps the received Æ-bit word to the most likely code word and inverselymaps the Æ-bit code word to the corresponding Ä-bit message.SourceEncoderChannelEncoderModulatorDemodulatorCHANNELDIGITALDecoderChannel SourceDecoder0100 010010 011010 0100message codewordmessagesource..abc.. ..abc..channelfeedbackmessagewordbit errors no errorserror detected?NoisemessagereceivedSource ReceiverL− N− N− L−bit bit bit bitFIGURE 5.1Noisy communication system.The channel coder performs a simple mapping operation from the input Ä-bit mes-sage to the corresponding Æ-bit code word where Æ Ä. The rationale behind thismapping operation is that the redundancy introduced by the extra Æ Ä bits can beused to detect the presence of bit errors or indeed identify which bits are in error andcorrect them (by inverting the bits). How does this work? Since Æ Ä then thatmeans there are ¾Æ ¾Ä received words of length Æ bits that are not code words(where ¾Æ is the space of all words of length Æ-bits and ¾Ä is the subset correspond-ing to the code words). The key idea is that a bit error will change the code word toa non-code word which can then be detected. It should be obvious that this is onlypossible if Æ Ä.The channel decoder has the task of detecting that there has been a bit error and, ifpossible, correcting the bit error. The channel decoder can resolve bit errors by twodifferent systems for error-control.DEFINITION 5.2 Automatic-Repeat-Request (ARQ) If the channel decoderperforms error detection then errors can be detected and a feedback channel from thechannel decoder to the channel encoder can be used to control the retransmissionof the code word until the code word is received without detectable errors.
Fundamentals of Channel Coding 201DEFINITION 5.3 Forward Error Correction (FEC) If the channel decoderperforms error correction then errors are not only detected but the bits in error canbe identiﬁed and corrected (by bit inversion).In this chapter we present the theoretical framework for the analysis and design ofchannel codes, deﬁne metrics that can be used to evaluate the performance of chan-nel codes and present Shannon’s Fundamental Coding Theorem, one of the mainmotivators for the continuing research in the design of ever more powerful channelcodes.5.2 Code RateA price has to be paid to enable a channel coder to perform error detection and errorcorrection. The extra Æ Äbits require more bits to be pushed into the channel thanwere generated from the source coder, thus requiring a channel with a higher bit-rateor bandwidth.Assume that the source coder generates messages at an average bit rate of Ò× bitsper second, that is, 1 bit transmitted every Ì×½Ò×seconds. If the channel encodermaps each Ä-bit message (of information) into an Æ-bit code word then the channelbit rate will be Ò ÆÄ Ò× and since Æ Ä there will be more bits transmittedthrough the channel than bits entering the channel encoder. Thus the channel musttransmit bits faster than the source encoder can produce.DEFINITION 5.4 Code Rate (Channel Codes) In Chapter 2 the general ex-pression for the code rate was given as:Ê À´ÅµÒ (5.1)For the case of channel coding we assume Åequally likely messages where Å ¾Äand each message is transmitted as the code word of length Æ. Thus À´ÅµÐÓ Å Äand Ò Æ yielding the code rate for channel codes:Ê ÄÆÒ×ÒÌÌ×(5.2)The code rate, Ê, measures the relative amount of information conveyed in eachcode word and is one of the key measures of channel coding performance. A highervalue for Ê (up to its maximum value of 1) implies that there are fewer redundantÆ Ä check bits relative to the code word length Æ. The upside in a higher valuefor the code rate is that more message information is transmitted with each code
202 Fundamentals of Information Theory and Coding Designword since for ﬁxed Æ this implies larger values of Ä. The downside is that withfewer check bits and redundancy a higher code rate will make it more difﬁcult tocope with transmission errors in the system. In Section 5.7 we introduce Shannon’sFundamental Theorem which states that Ê must not exceed the channel capacityfor error-free transmission.In some systems the message rate, Ò×, and channel rates, Ò , are ﬁxed design pa-rameters and for a given message length Äthe code word length is selected based onÆ ÄÒ Ò× , that is, the largest integer less than ÄÒ Ò×. If ÄÒ Ò× is not aninteger then “dummy” bits have to be transmitted occasionally to keep the messageand code streams synchronised with each other.EXAMPLE 5.1Consider a system where the message generation rate is Ò× ¿ bps and the channelbit rate is ﬁxed at Ò bps. This requires a channel coder with an overall Ê ¿code rate.If we choose to design a channel coder with Ä ¿ then obviously we must chooseÆ without any loss of synchronisation since ÄÒ Ò× is an integer value. Thecode rate of this channel coder is Ê ¿ .Let us now attempt to design a channel coder with Ä . Then ÄÒ Ò×¾¿ andÆ and the channel coder has an apparent Ê code rate. However this systemwill experience a cumulative loss of synchronisation due to the unaccounted for gapof ¾¿ bit per code word transmitted. To compensate for this, 2 “dummy” bits needto be transmitted for every 3 code word transmissions. That is, for every 3 messageblocks of Ä ¢ ¿ ½ bits duration the channel encoder will transmit 3 code words ofÆ ¢ ¿ ½ bits duration plus 2 “dummy” bits, that is, 20 bits in total for an overallÊ ½¾¼¿ code rate. This is depicted in Figure 5.2.d dL=5 L=5 L=5N=6 N=6 N=6 2 dummydigits20 bits15 bitsR=3/4FIGURE 5.2Channel encoder mapping from message bits (represented by the X’s) to thecode bits (represented by the circles), including dummy digits for Example 5.1.
Fundamentals of Channel Coding 2035.3 Decoding RulesWhen a code word is transmitted through a channel it may be subject to bit errorsdue to the presence of noise. Thus the received word may be different from the trans-mitted code word and the receiver will need to make a decision on which code wordwas transmitted based on some form of decoding rule. The form of decoding rulewe adopt will govern how channel codes are used and the level of error protectionthey provide. We begin the discussion by providing the framework for discussingdecoding rules in general and examine two speciﬁc decoding rules of interest.Let ½ ¾ Æ be the th Æ-bit code word (for ½ Å) that is trans-mitted through the channel and let ½ ¾ Æ be the corresponding Æ-bit wordproduced at the output of the channel. Let Å represent the set of Å valid Æ-bitcode words and let the complement of this set be Å, the set of remaining Æ-bitbinary sequences which are not code words. Then Æ Å Å is the set ofall possible Æ-bit binary sequences and ¾ Æ whereas ¾ Å. The chan-nel decoder has to apply a decoding rule on the received word to decide whichcode word was transmitted, that is, if the decision rule is ´ µ then ´ µ.Let ÈÆ´ µ be the probability of receiving given was transmitted. For a dis-crete memoryless channel this probability can be expressed in terms of the channelprobabilities as follows:ÈÆ´ µÆØ ½È´ Ø Øµ (5.3)Let ÈÆ´ µ be the a priori probability for the message corresponding to the codeword . Then by Bayes’ Theorem the probability that message was transmittedgiven was received is given by:ÈÆ´ µ ÈÆ´ µÈÆ´ µÈÆ´ µ (5.4)If the decoder decodes into the code word then the probability that this is correctis ÈÆ´ µ and the probability that it is wrong is ½ ÈÆ´ µ. Thus to minimisethe error the code word should be chosen so as to maximise ÈÆ´ µ. This leadsto the minimum-error decoding rule.
204 Fundamentals of Information Theory and Coding DesignDEFINITION 5.5 Minimum-Error Decoding Rule We choose:Å ´ µ£(5.5)where £ ¾ Å is such that:ÈÆ´£µ ÈÆ´ µ (5.6)Using Equation 5.4 and noting that ÈÆ´ µ is independent of the code word thecondition simpliﬁes to:ÈÆ´£µÈÆ´£µ ÈÆ´ µÈÆ´ µ (5.7)requiring both knowledge of the channel probabilities and channel input probabil-ities. This decoding rule guarantees minimum error in decoding.An alternative decoding rule, the maximum-likelihood decoding rule, is based onmaximising ÈÆ´ µ, the likelihood that is received given was transmitted.This decoding rule is not necessarily minimum error but it is simpler to implementsince the channel input probabilities are not required.DEFINITION 5.6 Maximum-Likelihood Decoding Rule We choose:ÅÄ´ µ£ (5.8)where £ ¾ Å is such that:ÈÆ´£µ ÈÆ´ µ (5.9)requiring only knowledge of the channel probabilities. This decoding rule does notguarantee minimum error in decoding.NOTE The maximum-likelihood decoding rule is the same as the minimum-errordecoding rule when the channel input probabilities are equal.Assume we have a decoding rule ´ µ. Denote as the set of all possible receivedwords such that ´ µ , that is the set of all Æ-bit words that are decodedto the code word , and deﬁne its complement as . Then the probability of adecoding error given code word was transmitted is given by:È´ µ¾ÈÆ´ µ (5.10)
Fundamentals of Channel Coding 205and the overall probability of decoding error is:È´ µÅ½È´ µÈ´ µ (5.11)EXAMPLE 5.2Consider a BSC with the following channel probabilities:È¼ ¼¼ ¼and channel encoder with ´Ä Æµ ´¾ ¿µ, that is using a channel code with Å¾¾ code words of Æ ¿ bits in length. Assume the code words transmitted intothe channel, and their corresponding probability of occurrence, are:Code word ÈÆ´ µ½ ´¼¼¼µ 0.4¾ ´¼½½µ 0.2¿ ´½¼½µ 0.1´½½¼µ 0.3Say the received word at the output of the channel is ½½½. If we apply the maxi-mum likelihood decoding rule of Equation 5.9 then we choose the that maximisesÈÆ´ µ. Calculating these probabilities for all of the possible Å code wordsgives:ÈÆ´ ½µ ÈÆ´½½½ ¼¼¼µ È´½ ¼µ ¢È´½ ¼µ ¢È´½ ¼µ ¼ ¼ÈÆ´ ¾µ ÈÆ´½½½ ¼½½µ È´½ ¼µ ¢È´½ ½µ ¢È´½ ½µ ¼ ½ÈÆ´ ¿µ ÈÆ´½½½ ½¼½µ È´½ ½µ ¢È´½ ¼µ ¢È´½ ½µ ¼ ½ÈÆ´ µ ÈÆ´½½½ ½½¼µ È´½ ½µ ¢È´½ ½µ ¢È´½ ¼µ ¼ ½from which we have that code words ¾, ¿ and are equally likely to have beentransmitted in the maximum likelihood sense if ½½½ was received. For thepurposes of decoding we need to make a decision and choose one of the ¾, ¿ and, and we choose ÅÄ´ ½½½µ ¾. If we apply the minimum error decodingrule of Equation 5.7 we choose the that maximises ÈÆ´ µÈÆ´ µ. Calculatingthese probabilities for all of the Å code words and using the provided a prioriÈ´ ) gives:ÈÆ´ ½µÈÆ´ ½µ ¼ ¼ ¢¼ ¼ ¼¾ÈÆ´ ¾µÈÆ´ ¾µ ¼ ½ ¢¼ ¾ ¼ ¼¾ÈÆ´ ¿µÈÆ´ ¿µ ¼ ½ ¢¼ ½ ¼ ¼½ÈÆ´ µÈÆ´ µ ¼ ½ ¢¼ ¿ ¼ ¼ ¿¾
206 Fundamentals of Information Theory and Coding Designfrom which we have that code word minimises the error in decoding the receivedword ½½½, that is Å ´ ½½½µ .Although the minimum error decoding rule guarantees minimum error it is rarelyused in practice, in favour of maximum likelihood decoding, due to the unavailabil-ity of the a priori probabilities, ÈÆ´ µ. Since with arbitrary sources and/or efﬁcientsource coding (see Section 3.5.3) one can assume, without serious side effects, thatmessages are equiprobable then the use of maximum likelihood decoding will usu-ally lead to minimum error.5.4 Hamming DistanceAn important parameter for analysing and designing codes for robustness in the pres-ence of errors is the number of bit errors between the transmitted code word, , andthe received word, , and how this relates to the number of bit “errors” or differ-ences between two different code words, and , This measure is provided by theHamming distance on the space of binary words of length Æ. The properties of theHamming distance, attributed to R.W. Hamming  who established the fundamen-tals of error detecting and error correcting codes, are instrumental in establishingthe operation and performance of channel codes for both error detection and errorcorrection.DEFINITION 5.7 Hamming Distance Consider the two Æ-length binary words½ ¾ Æ and ½ ¾ Æ. The Hamming distance between and ,´ µ, is deﬁned as the number of bit positions in which and differ. TheHamming distance is a metric on the space of all binary words of length Æ sincefor arbitrary words, , , , the Hamming distance obeys the following conditions:1. ´ µ ¼ with equality when2. ´ µ ´ µ3. ´ µ · ´ µ ´ µ (triangle inequality)EXAMPLE 5.3Let Æ and let:½½¼½¼¼¼½¼¼¼½¼¼½¼¼½¼½¼¼½½
Fundamentals of Channel Coding 207Then have that ´ µ since differs from in the following 4 bit locations:½ ½, ¾ ¾, , . Similarly, ´ µ ¾ since and differ by2 bits, bits 2 and 8, and ´ µ ¾ since and also differ by 2 bits, bits 1 and 7.We can also verify the triangle inequality since:´ µ · ´ µ ´ µ µ · ¾ ¾5.4.1 Hamming Distance Decoding Rule for BSCsConsider the maximum likelihood decoding rule where the code word, , whichmaximises the conditional probability ÈÆ´ µ (or likelihood), has to be found.Consider a BSC with bit error probability Õ. Let the Hamming distance ´ µrepresent the number of bit errors between the transmitted code word and receivedword . Then this gives the following expression for the conditional probability:È´ µ ´Õµ ´½ ÕµÆ (5.12)If Õ ¼ then Equation 5.12 is maximised when is chosen such that ´ µ isminimised.RESULT 5.1 Hamming Distance Decoding RuleThe binary word, , of length Æ is received upon transmission of one of theÅ possible Æ-bit binary code words, ¾ Å, through a BSC. Assuming themaximum likelihood decoding rule we choose the most likely code word as follows:¯ if for a particular then the code word was sent¯ if for any , we ﬁnd the code word £ ¾ Å which is closest to inthe Hamming sense:´£µ ´ µ (5.13)– if there is only one candidate £ then the Ø-bit error,where Ø ´ £ µ,is corrected and £ was sent– if there is more than one candidate £then the Ø-bit error, where Ø´ £ µ, can only be detectedEXAMPLE 5.4Consider the following channel code:
208 Fundamentals of Information Theory and Coding DesignMessage (Ä ¾µ Code word (Æ ¿)00 00001 00110 01111 111There are Å ¾Ä messages and 4 corresponding code words. With Æ ¿length code words there are ¾Æ possible received words, 4 of these will be thecorrect code words and 4 of these will be non-code words. If the received word, ,belongs to the set of code words ¼¼¼ ¼¼½ ¼½½ ½½½ then the Hamming distancedecoding rule would imply that we decode the received word as the code word (i.e.,£ is the same as ). That is if ¼¼¼ then £ ¼¼¼, and so on. If the receivedword, , belongs to the set of non-code words ¼½¼ ½¼¼ ½¼½ ½½¼ then the Hammingdistance decoding rule would operate as follows:½ ¾ ¿ Closest code word Action010 000 ( ¾ in error), 011 ( ¿ in error) 1-bit error detected100 000 ( ½ in error) 1-bit error corrected101 001 ( ½ in error), 111 ( ¾ in error) 1-bit error detected110 111 ( ¿ in error) 1-bit error corrected5.4.2 Error Detection/Correction Using the Hamming DistanceAn important indication of the error robustness of a code is the Hamming distancebetween two different code words, and . From Example 5.4 it is apparent thaterrors are detected when the received word, , is equi-distant from more than onecode word and errors are corrected when the received word is closest to only onecode word. Both forms of error robustness rely on there being sufﬁcient distancebetween code words so that non-code words can be detected and even corrected.The analysis for speciﬁc behaviour of a code to errors is tedious and unproductive.The most useful result is when we consider the general error robustness of a code.That is, whether a code can detect or correct all errors up to Ø-bits no matter wherethe errors occur and in which code words. To this end we need to deﬁne the followingimportant measure of a code’s error performance.DEFINITION 5.8 Minimum Distance of a Code The minimum distance of ablock code ÃÒ, where ÃÒ identiﬁes the set of length Ò code words, is given by:´ÃÒµ Ñ Ò ´ µ ¾ ÃÒ Ò (5.14)that is, the smallest Hamming distance over all pairs of distinct code words.
Fundamentals of Channel Coding 209Ø-bit error detectionA block code ÃÒ is said to detect all combinations of up to Ø errors provided that foreach code word and each received word obtained by corrupting up to Ø bits in, the resulting is not a code word (and hence can be detected via the Hammingdistance decoding rule). This is an important property for ARQ error control schemeswhere codes for error detection are required.RESULT 5.2 Error Detection PropertyA block code, ÃÒ, detects up to Ø errors if and only if its minimum distance is greaterthan Ø:´ÃÒµ Ø (5.15)PROOF Let £ be the code word transmitted from the block code ÃÒ and thereceived word. Assume there are Ø bit errors in the transmission. Then ´ £ µ Ø.To detect that is in error it is sufﬁcient to ensure that does not correspond to anyof the code words, that is, ´ µ ¼ . Using the triangle inequality we havethat for any code word :´£ µ · ´ µ ´£ µ ÓÖ ´ µ ´£ µ ´£ µTo ensure that ´ µ ¼ we must have that ´ £ µ ´ £ µ ¼ or ´ £ µ´ £ µ . Since ´ £ µ Ø we get the ﬁnal result that:´£ µ Øwhich is guaranteed to be true if and only if ´ÃÒµ Ø.Figure 5.3 depicts the two closest code words, and , at a distance of Ø · ½ (i.e.,´ÃÒµ Ø·½). The hypersphere of distance Ø·½ drawn around each code word maytouch another code word but no code word falls within the hypersphere of anothercode word since this would violate ´ÃÒµ Ø · ½. Clearly any received wordof distance Ø from any code word will fall within the hypersphere of radius Ø · ½from one of more code words and hence be detectable since it can never be mistakenfor a code word.Ø-bit error correctionA block code ÃÒ is said to correct all combinations of up to Ø errors provided thatfor each code word and each received word obtained by corrupting up to Ø bitsin , the Hamming distance decoding rule leads uniquely to . This is an importantproperty for FEC error control schemes where codes for error correction are required.
210 Fundamentals of Information Theory and Coding Designt+1t+1FIGURE 5.3Diagram of Ø-bit error detection for ´ÃÒµ Ø · ½.RESULT 5.3 Error Correction PropertyA block code, ÃÒ, corrects up to Ø errors if and only if its minimum distance isgreater than ¾Ø:´ÃÒµ ¾Ø (5.16)PROOF Let £ be the code word transmitted from the block code ÃÒ and thereceived word. Assume there are Ø bit errors in the transmission. Then ´ £ µ Ø.To detect that is in error and ensure that the Hamming distance decoding ruleuniquely yields £ (the error can be corrected) then it is sufﬁcient that ´ £ µ´ µ . Using the triangle inequality we have that for any code word :´£µ · ´ µ ´£µ ÓÖ ´ µ ´£µ ´£µTo ensure that ´ £ µ ´ µ, or ´ µ ´ £ µ, we must have that´ £ µ ´ £ µ ´ £ µ or ´ £ µ ¾ ´ £ µ . Since ´ £ µ Ø weget the ﬁnal result that:´£ µ ¾Øwhich is guaranteed to be true if and only if ´ÃÒµ ¾Ø.Figure 5.4 depicts the two closest code words, and , at a distance of ¾Ø · ½ (i.e.,´ÃÒµ ¾Ø · ½). The hyperspheres of distance Ø drawn around each code word donot touch each other. Clearly any received word of distance Ø from any codeword will only fall within the hypersphere of radius Ø from that code word and hencecan be corrected.EXAMPLE 5.5The even-parity check code has Æ Ä · ½ and is created by adding a parity-checkbit to the original Ä-bit message such that the resultant code word has an even numberof 1’s. We consider the speciﬁc case for Ä ¾ where the code table is:
Fundamentals of Channel Coding 211t 1tFIGURE 5.4Diagram of Ø-bit error correction for ´ÃÒµ ¾Ø · ½.Message Code word00 00001 01110 10111 110By computing the Hamming distance between all pairs of distinct code words weﬁnd that ´ÃÆµ ¾, that is, no two code words are less than a distance of 2 fromeach other. By Result 5.2 since ´ÃÆµ ¾ Ø ½ the even-parity check code isable to detect all single-bit errors. It can be shown that ´ÃÆµ ¾ for any lengtheven-parity check code.EXAMPLE 5.6The repetition code is deﬁned for single-bit messages (Ä ½) where the message bitis repeated (Æ ½) times (for Æ odd) and then the errors are corrected by using amajority vote decision rule (which is functionallyequivalentto the Hamming distancedecoding rulebutmoreefﬁcientlyimplemented). Considerthe Æ ¿ repetition code:Message Code word0 0001 111where the message is repeated twice to form a code word of length Æ ¿. Theminimum distance of the code is, by inspection, ´Ã¿µ ¿, and from Result 5.2,´Ã¿µ Ø with Ø ¾, and this code can detect all double-bit errors. Furthermore byResult 5.3, ´Ã¿µ ¾Ø with Ø ½, and this code can also correct all single-bit errors.The decoder can be designed for either 2-bit error detection or 1-bit error correction.The Hamming decoding rule as described will perform 1-bit error correction. Tosee how the Hamming and majority vote decoding rules work to provide 1-bit errorcorrection we consider what happens when any of the 8 possible words are received:
212 Fundamentals of Information Theory and Coding DesignReceived word Closest code word Majority bit Message?000 000 0 0001 000 0 0010 000 0 0011 111 1 1100 000 0 0101 111 1 1110 111 1 1111 111 1 1The majority vote decoding rule simply selects the bit that occurs most often in thereceived word (the majority bit). With Æ odd then this will be guaranteed to alwaysproducea clear majority (decision). From the above table if, for example, the receivedword is ¼½½ then we have two 1’s, one 0 and the majority bit is 1, and hencethe message is 1. It can be easily seen that the majority vote selects the code wordthat is closest to the received word and hence is equivalent to the Hamming distancedecoding rule.For an Æ-bit repetition code it can be shown that ´ÃÆµ Æ and hence will be ableto perform¤Æ¾¥-bit error correction, where the operator Ü is the largest integer lessthan or equal to Ü, or detect up to Æ ½ errors if only error detection is considered.EXAMPLE 5.7Consider the code from Example 5.4:Message (Ä ¾) Code word (Æ ¿)00 00001 00110 01111 111We see that ´Ã¿µ ½ since the ﬁrst code word 000 is of distance 1 from thesecond code word 001. This implies that from Results 5.2 and 5.3 this code cannotdetect all single-bit errors, let alone correct any errors. In Example 5.4 the Hammingdistance decoding rule was either providing 1-bit detection or 1-bit correction. Thediscrepancy is explained by noting that Results 5.2 and 5.3 are restricted to codeswhich are able to detect or correct all Ø-bit errors. With this in mind it is obvious thiscode cannot detect a single-bit error in bit 3 of the code word 000 since this wouldproduce the code word 001 and not be detected.EXAMPLE 5.8Consider the following code Ã for Ä ¿-bit messages:
Fundamentals of Channel Coding 213Message (Ä ¿) Code word (Æ )000 000000001 001110010 010101011 011011100 100011101 101101110 110110111 111000By computingthe distance between all pairs of distinct code words, requiring¾¾ computations of the Hamming distance, we ﬁnd that ´Ã µ ¿and this code cancorrect all 1-bit errors. For example, if ¼½¼½¼½ is sent and ¼½¼½½½ isreceived then the closest code word is ¼½¼½¼½ and the most likely single-biterror in is corrected. This will always be the case no matter which code word issent and which bit is in error; up to all single-bit errors will be corrected. Howeverthe converse statement is not true. All possible received words of length Ædo not simply imply a code word (no errors) or a code word with a single-bit error.Consider ½½½½½½. The closest code words are ½½¼½½¼, ½¼½½¼½ and¼½½¼½½, implying that a 2-bit error has been detected. Thus a code for Ø-bit errorcorrection may sometimes provide greater than Ø-bit error detection. In FEC errorcontrol systems where there is no feedback channel this is undesirable as there is nomechanism for dealing with error detection.5.5 Bounds on Å, Maximal Codes and Perfect Codes5.5.1 Upper Bounds on Å and the Hamming BoundThe ´ÃÆµ for a particular block code ÃÆ speciﬁes the code’s error correction anderror detection capabilities as given by Results 5.2 and 5.3. Say we want to design acode, ÃÆ, of length Æ with minimum distance ´ÃÆµ. Is there a limit (i.e., upperbound) on the number of code words (i.e., messages), Å, we can have? Or say wewant to design a code of length Æ for messages of length Ä. What is the maximumerror protection (i.e., maximum ´ÃÆµ) that we can achieve for a code with theseparameters? Finally, say we want to design a code with messages of length Ä that iscapable of Ø-bit error correction. What is the smallest code word length, Æ, that wecan use? The answer to these important design questions is given by the parameter´Æ ´ÃÆµµ deﬁned below.
214 Fundamentals of Information Theory and Coding DesignDEFINITION 5.9 Upper Bound on ÅFor a block code ÃÆ of length Æ and minimum distance ´ÃÆµ the maximumnumber of code words, and hence messages, Å, to guarantee that such a code canexist is given by:Å ´Æ ´ÃÆµµ (5.17)RESULT 5.4The following are elementary results for ´Æ ´ÃÆµµ:1. ´Æ ½µ ¾Æ2. ´Æ ¾µ ¾Æ ½3. ´Æ ¾Ø· ½µ ´Æ · ½ ¾Ø· ¾µ4. ´Æ Æµ ¾where ´Æ ¾Ø·½µ ´Æ·½ ¾Ø·¾µ indicates that if we know the bound for odd´ÃÆµ ¾Ø·½ then we can obtain the bound for the next even ´ÃÆµ·½ ¾Ø·¾,and vice versa.EXAMPLE 5.9Using the relation ´Æ ¾Ø· ½µ ´Æ · ½ ¾Ø· ¾µ with Ø ¼ gives ´Æ ½µ´Æ·½ ¾µ. If we are given that ´Æ ½µ ¾Æ then this also implies ´Æ·½ ¾µ¾Æ, and making the substitution Æ¼Æ · ½ and then replacing Æ¼by Æ yields´Æ ¾µ ¾Æ ½.The proof of Result 5.4 can be found in . Additional results for ´Æ ´ÃÆµµ canbe found by considering the important Hamming or sphere-packing bound stated inthe following theorem:THEOREM 5.1 Hamming BoundIf the block code of length Æ is a Ø-bit error correcting code, then the number ofcode words, Å, must the satisfy the following inequality:Å ¾ÆÈØ¼Æ (5.18)which is an upper bound (the Hamming bound) on the number of code words.
Fundamentals of Channel Coding 215PROOF Let Î ´Æ Øµ be deﬁnedas the numberof words of length Æ that are withina Hamming distance of Ø from a code word . There will beÆsuch words at aHamming distance of , and by adding up the number of words of Hamming distancefor ¼ ½ ¾ Ø we obtain:Î ´Æ ØµØ¼Æ(5.19)An alternative interpretation which is sometimes useful is to consider Î ´Æ Øµ as thevolume of the hypersphereof radius Ø centred at in the space of all Æ-length words.Since the code is a Ø-bit error correcting code then for any pair of code words,and , this implies ´ µ ¾Ø and thus no word will be within a Hammingdistance of Ø from more than one code word. Consider all Å code words and the setof words that are within a Hamming distance of Ø, Î ´Æ Øµ, from each code word,that is ÆØ Å Î ´Æ Øµ. To guarantee that no word is within a Hamming distance ofØ from more than one code word we must ensure that the possible number of distinctsequences of length Æ, ¾Æ, be at least ÆØ , that is:ÅØ¼Æ¾Æ(5.20)which proves the Hamming bound. This is shown in Figure 5.5 where it is evidentthat, since for Ø-bit error correction the hyperspheres of length Ø around each codeword must not touch, this can only happen if the space of all words of length Æ is atleast the sum of all the hyperspheres.It should be noted that the condition for equality with the Hamming bound occurswhen all words of length Æ reside within one of the hyperspheres of radius Ø, thatis, when each word of length Æ is a code word or of distance Ø or less from a codeword.We know from Result 5.3 that for a Ø-bit error correcting code ´ÃÆ µ ¾Ø. Thus aØ-bit error correcting code will have a minimum distance of ´ÃÆ µ ¾Ø·½ or more.Consider the problem of ﬁnding the upper bound for a code with ´ÃÆ µ ¾Ø · ½(i.e., ´Æ ¾Ø· ½µ). Since ´ÃÆ µ ¾Ø this is a Ø-bit error correcting code and hencesubject to the Hamming bound from Theorem 5.1 we expect that:Å ´Æ ¾Ø · ½µ¾ÆÈØ¼Æ(5.21)
216 Fundamentals of Information Theory and Coding DesignttttÅ spheres with volume Î´Æ Øµ¾Æ space of all wordsof length ÆFIGURE 5.5Proof of Hamming bound for Ø-bit error correcting codes. Solid dots representcode words; open dots represent all other words of length Æ.EXAMPLE 5.10Consider the important case of 1-bit error correcting codes where ´ÃÆµ ¿. TheHamming bound for 1-bit error correction is:Å¾ÆÆ¼·Æ½¾ÆÆ · ½and hence the upper bound on the number of messages with ´ÃÆµ ¿ is:´Æ ¿µ¾ÆÆ · ½Using ´Æ ¾Ø · ½µ ´Æ · ½ ¾Ø · ¾µ for Ø ½ then gives:´Æ µ¾Æ ½Æwhich is the upper bound on the number of messages for a code with ´ÃÆµ .It should be noted that the Hamming bound of Theorem 5.1 is a necessary but notsufﬁcient condition. That is, if we ﬁnd values of Æ, Å and Ø that satisfy Equation5.18 this is not sufﬁcient to guarantee that such a code actually exists.
Fundamentals of Channel Coding 217EXAMPLE 5.11Consider designing a 1-bit error correcting code (Ø ½) using code words of lengthÆ . We can satisfy Equation 5.18 with Å ¿. However no code with Å ¿code words of length Æ and a minimum distance of at least ´Ã µ ¾Ø ·½ ¿can actually be designed. In fact the maximum possible value of Å is 2, that is,´ ¿µ ¾.5.5.2 Maximal Codes and the Gilbert BoundThe code Ã¿ ¼¼¼ ¼½½ has ´Ã¿µ ¾and contains Å ¾code words of lengthÆ ¿ implying Ä ½ and a code rate Ê ½ ¿. The code can be made more efﬁ-cient (i.e., higher code rate) by augmenting it to the code Ã¿ ¼¼¼ ¼½½ ½¼½ ½½¼which is also ´Ã¿µ ¾ but contains Å code words of length Æ ¿ implyingÄ ¾and a code rate Ê ¾ ¿. The second code is more efﬁcient than the ﬁrst codewithout sacriﬁcing the minimum distance. We can deﬁne such codes as follows:DEFINITION 5.10 Maximal CodesA code ÃÆ of length Æ and minimum distance ´ÃÆµ with Å code words is saidto be maximal if it is not part of, or cannot be augmented to, another code ÃÆ oflength Æ, minimum distance ´ÃÆµ but with Å ·½ code words. It can be shownthat a code is maximal if and only if for all words of length Æ there is a codeword such that ´ µ ´ÃÆ µ.Thus maximal codes are more efﬁcient in that for the same ´ÃÆ µ they provide themaximum code rate possible.EXAMPLE 5.12The code Ã¿ ¼¼¼ ¼½½ with ´Ã¿µ ¾ mentioned above is not maximal sincefor word ½¼½, ´ ½ µ ´¼¼¼ ½¼½µ ¾ and ´ ½ µ ´¼½½ ½¼½µ ¾and thus there is no code word such that ´ µ ¾.If code ÃÆ is a maximal code then it satisﬁes the Gilbert bound ﬁrst proposed byGilbert .
218 Fundamentals of Information Theory and Coding DesignTHEOREM 5.2 Gilbert BoundIf a block code ÃÆ of length Æ is a maximal code with minimum distance ´ÃÆµthen the number of code words, Å, must satisfy the following inequality:Å ¾ÆÈ ´ÃÆµ ½¼Æ (5.22)which provides a lower bound (the Gilbert bound) on the number of code words.PROOF If code ÃÆ is a maximal code then each word of length Æ must be ofdistance ´ÃÆµ ½ or less from at least one code word. The number of words withina Hamming distance of ´ÃÆ µ ½ from a code word is given by Equation 5.19:Î´Æ ´ÃÆ µ ½µ´ÃÆµ ½¼Æ (5.23)Consider all Å code words and the set of words that are within a Hamming distance of´ÃÆ µ ½ from each code word. If a code is maximal then to ensure that all possibledistinct sequences of length Æ, ¾Æ, are guaranteed to reside within a distance ´ÃÆ µfrom at least one code word we must have that ¾Æ be no greater than the total numberof words, Å Î´Æ ´ÃÆ µ ½µ, that are within a distance ´ÃÆ µ of at least one codeword, that is:Å´ÃÆµ ½¼Æ ¾Æ (5.24)which proves the Gilbert bound. This is shown in Figure 5.6 where it is evident thatto ensure that each word of length Æ falls within one of the hyperspheres of radius´ÃÆ µ ½ surrounding a code word the space of all words of length Æ must be nogreater than the sum of the volumes of all the hyperspheres.The condition for equality with the Gilbert bound occurs when the hyperspheres ofradius ´ÃÆ µ ½ are ﬁlled uniquely with words of length Æ. That is, each word oflength Æ will be of distance less than ´ÃÆ µ from precisely one code word.EXAMPLE 5.13Consider the code Ã¿ ¼¼¼ ¼½½ with ´Ã¿µ ¾ just discussed. The Gilbert
Fundamentals of Channel Coding 219d−1d−1d−1 d−1Å spheres with volume Î´Æ Øµ¾Æ space of all wordsof length ÆFIGURE 5.6Proof of Gilbert bound for a code with minimum distance . Solid dots representcode words; open dots represent all other words of length Æ.bound is satisﬁed since:Å ¾ ¾ÆÈ ´ÃÆµ ½¼Æ¾However we know that this code is not maximal. Hence the Gilbert bound is anecessary but not sufﬁcient condition for maximal codes.5.5.3 Redundancy Requirements for Ø-bit Error CorrectionInstead of asking how many errors a code can correct, we can ask how much re-dundancy needs to be present in a code in order for it to be able to correct a givennumber of errors. Let Ö Æ Ä be the number of check bits (the redundancy) thatare added by the channel coder. The Hamming bound from Equation 5.18 providesthe following lower bound on the redundancy needed for a Ø-bit error correcting code:Ö ÐÓ ¾´Î ´Æ Øµµ ÓÖ ¾Ö Î ´Æ Øµ (5.25)where Î ´Æ Øµ is given by Equation 5.19. The Gilbert bound from Equation 5.22provides an upper bound on the redundancy if the code is to be maximal. For a Ø-bit
220 Fundamentals of Information Theory and Coding Designerror correcting code the minimum distance must be at least ´ÃÆµ ¾Ø · ½, andthus:Ö ÐÓ ¾´Î ´Æ ¾Øµµ ÓÖ ¾ÖÎ ´Æ ¾Øµ (5.26)EXAMPLE 5.14Suppose we want to construct a code with Ä ½¾ which is capable of 3-bit errorcorrection. How many check bits are needed for a maximal code? Deﬁne Ö Æ ½¾and Ø ¿. A computation of ¾Ö together with Î ´Æ ¿µ and Î ´Æ µ for a range ofvalues of Æ is shown by the table below:Æ Ö Æ ½¾ Î ´Æ ¿µ ¾Ö Î ´Æ µ¾¾ ½¼ ½ ½¼¾ ½¼¼¿¼¾¾¿ ½½ ¾¼ ¾¼ ½¿¿½¼¾¾ ½¾ ¾¿¾ ¼ ½¾ ½¿ ¾ ¾ ½ ¾ ¾¾ ½¼¾ ½ ¾ ¾ ½ ¿ ¾ ¾¾ ½ ¿¿¼ ¿¾ ¿ ¿¾ ½ ¿ ¿ ¿ ¿¾ ½ ¼ ¼ ½¿½¼ ¾ ¿¿¼ ½ ¾ ¾ ¾½ ½ ¼¿½ ½ ¾ ¾ ¾ ½ ½¿¾ ¾¼ ½¼ ½¼ ¼¿¿ ¾½ ¼½ ¾¼ ½ ¾ ½¿¼ ¾We see that we need between 11 and 20 check bits in order to design a maximal codeto correct three errors. Since both the Hamming and Gilbert bounds are necessarybut not sufﬁcient then all we can say is that a code may exist and if it does it will havebetween 11 and 20 check bits.5.5.4 Perfect Codes for Ø-bit Error CorrectionA block code with Ä-bit length messages will allocate Å ¾Ä words of lengthÆ-bits (where Æ Ä) as the code words. Assume the code is designed for Ø-biterror correction. The space of all words of length Æ, ¾Æ, will fall into one of thefollowing two mutually exclusive sets:Set C The code words themselves and all the words formed by corrupting each codeword by all combinations of up to Ø-bit errors. That is, all the words fallinguniquely within one of the hyperspheres of radius Ø centred at each code word.Since the code can correct Ø-bit errors these words are guaranteed to be ofdistance Ø or less from precisely one of the code words and the errors can becorrected.
Fundamentals of Channel Coding 221Set Non-code words of distance greater than Ø from one or more code words. Thatis, all the words which fall outside the hyperspheres of radius Ø surroundingeach code word. Since such words may be equidistant to more than one codeword errors may only be detected.If all the words of length Æ belong to Set C then the code is a perfect code.DEFINITION 5.11 Perfect Codes Perfect codes for Ø-bit error correction pos-sess the following properties:1. All received words are either a code word or a code word with up to Ø-biterrors which, by deﬁnition, can be corrected. Thus there is no need to handledetectable errors.2. The code is maximal.3. The code provides maximum code rate (and minimum redundancy) for Ø-biterror correction.4. The minimum distance is ¾Ø ·½.A code is deﬁned as a perfect code for Ø-bit errors provided that for every wordof length Æ, there exists precisely one code word of distance Ø or less from . Thisimplies equality with the Hamming bound. That is, a perfect code satisﬁes:Å¾ÆÈØ¼Æµ ¾Æ ÄØ¼Æ(5.27)EXAMPLE 5.15Consider the ´Æ Äµ ´ ¿µ block code from Example 5.8:Message (Ä ¿) Code word (Æ )000 000000001 001110010 010101011 011011100 100011101 101101110 110110111 111000This is a code for 1-bit error correction since ´Ã µ ¿. However this is not a perfectcode for 1-bit error correction since:
222 Fundamentals of Information Theory and Coding Design¯ ¾Æ Ä ¾¿ È½¼ ¼ · ½¯ the received word 111111 is not at a distance of 1 from any of the code wordsand using the Hamming decoding ruleit is ata distanceof2 fromthree candidatecode words: 011011, 101101 and 110110. Thus there is a 2-bit error detected.EXAMPLE 5.16Most codes will not be perfect. Perfect codes for 1-bit error correction form a familyof codes called the Hamming codes, which are discussed in Section 7.7. An exampleof the ´Æ Äµ ´ µ Hamming code is given by:Message Code word Message Code word0000 0000000 1000 10000110001 0001111 1001 10011000010 0010110 1010 10101010011 0011001 1011 10110100100 0100101 1100 11001100101 0101010 1101 11010010110 0110011 1110 11100000111 0111100 1111 1111111This is a code for 1-bit error correction since ´Ã µ ¿. This is also a perfect codefor 1-bit error correction since:¯ ¾Æ Ä ¾¿ È½¼ ¼ · ½¯ any 7-bit received word will be of distance 1 from precisely one code word andhence 1-bit error correction will always be performed, even if there has been,say, a 2-bit error. Consider received word 0000011 as the code word 0000000with a 2-bit error. It will be decoded incorrectly as the code word 1000011with a 1-bit error. It should be noted that a 1-bit error is more likely than a 2-biterror and the Hamming decoding rule chooses the most likely code word.5.6 Error ProbabilitiesThere are two important measures for a channel code:
Fundamentals of Channel Coding 2231. The code rate. Codes with higher code rate are more desirable.2. The probability of error. Codes with lower probability of error are more desir-able. For error correcting codes the block error probability is important. Forerror detecting codes the probability of undetected error is important.There is usually a trade-off between the code rate and probability of error. To achievea lower probability of error it may be necessary to sacriﬁce code rate. In the next sec-tion Shannon’s Fundamental Coding Theorem states what limits are imposed whendesigning channel codes. In this section we examine the different error probabilitiesand the tradeoffs with code rate.5.6.1 Bit and Block Error Probabilities and Code RateThe bit error probability, È ´ÃÒµ, is the probability of bit error between the mes-sage and the decoded message. For a BSC system without a channel coder we haveÈ ´ÃÒµ Õ. Otherwise for a channel coder with Ä ½ the calculation of thebit error probability is tedious since all combinations of message blocks, and thecorresponding bit errors, have to be considered.The block error probability, È ´ÃÒµ, is the probability of a decoding error by thechannel decoder. That is, it is the probability that the decoder picks the wrong codeword when applying the Hamming distance decoding rule. For the case of Ä ½ thebit error and block error probabilities are the same; otherwise they are different. ForÄ ½ the calculation of the block error probability is straightforward if we knowthe minimum distance of the code.EXAMPLE 5.17Consider Code Ã from Example 5.8. Say message 000 is transmitted as code word000000. Two bit errors occur in the last two bits and the received word is 000011.The channel decoder will then select the nearest code word which is 100011. Thus ablock error has occurred. If this always occurs then È ´ÃÒµ ½. However the codeword 100011 is decoded as message 100 which has only the ﬁrst bit wrong and theremaining two bits correct when compared with the original message. If this alwaysoccurs then È ´ÃÒµ ¼ ¿¿¿.The block error probability is used to assess and compare the performance of codessince it is easier to calculate and provides the worst-case performance.
224 Fundamentals of Information Theory and Coding DesignRESULT 5.5 Block Error Probability for a Ø-bit Error Correcting CodeÈ ´ÃÒµ ½ Ø¼ÒÕ ´½ ÕµÒ (5.28)PROOF The probability of exactly Ø-bit errors in a word of length Ò isÒØÕØ´½ ÕµÒ Ø. A Ø-bit error correcting code is able to handle up to Ø-bit errors. Thus theprobability of no decoding error is È ´ÃÒµÈØ¼ÒÕ ´½ ÕµÒ and theprobability of error is È ´ÃÒµ ½ È ´ÃÒµ.For some codes there is a direct trade-off between the block error probability and thecode rate as shown in the following example.EXAMPLE 5.18Assume a BSC with Õ ¼ ¼¼½. Then for the no channel code case È ´Ã½µ¼ ¼¼½ ½ ¢½¼ ¿ but at least we have Ê ½.Consider using a Æ ¿ repetition code:Message Code word0 0001 111Since ´Ã¿µ ¿ this is a 1-bit error correcting code andÈ ´Ã¿µ ½ Ø¼ÒÕ ´½ ÕµÒ ½ ¿¼´¼ µ¿ ¿½´¼ µ¾´¼ ¼¼½µ½¿ ¢½¼ Since Ä ½ then È ´Ã¿µ È ´Ã¿µ ¿ ¢½¼ . We see that the block error (or biterror) probability has decreased from ½ ¢ ½¼ ¿ to ¿ ¢ ½¼ , but the code rate hasalso decreased from ½ to ½¿.Now consider using a Æ repetition code:Message Code word0 000001 11111
Fundamentals of Channel Coding 225Since ´Ã µ this is a 2-bit error correcting code andÈ ´Ã µ ½ Ø¼ÒÕ ´½ ÕµÒ ½ ¼´¼ µ ½´¼ µ ´¼ ¼¼½µ½ ¾´¼ µ¿´¼ ¼¼½µ¾½ ¢½¼ We now see that the block error (or bit error) probability has decreased from ½¢½¼ ¿to ¿ ¢ ½¼ to ½ ¢ ½¼ , but the code rate has also decreased from ½ to ½¿ to ½ ,respectively. Thus there is a clear trade-off between error probability and code ratewith repetition codes.5.6.2 Probability of Undetected Block ErrorAn undetected block error occurs if the received word, , is a different code wordthan was transmitted. Let ¾ Å ½ ¾ Å and assume that code wordis transmitted and is received where , that is, the received word isalso a code word, but different to the transmitted code word. The Hamming decodingrule will, of course, assume there has been no error in transmission and select asthe code word that was transmitted. This is a special case of a decoding error wherethere is no way the decoder can know that there is an error in transmission since avalid code word was received. Thus the error is undetected. For error detecting codesthe probability of undetected block error is an important measure of the performanceof the code. It should be noted that the probability of undetected block error is alsorelevant for error correcting codes, but such codes are more likely to experiencedecoding errors which would otherwise be detectable.EXAMPLE 5.19Consider the Hamming code of Example 5.16. If the code word 0000000 is trans-mitted and the word 0000011 is received the decoder will decode this as code word1000011; thus the actual 2-bit errorhas been treated as a more likely 1-bit error. Thereis a decoding error, but this is not an undetected block error since the received wordis not a code word and an error is present, be it 1-bit or 2-bit. Indeed if the Hammingcode were to operate purely as an error detecting code it would be able to detect the2-bit error (since ´Ã µ ¿, all 2-bit errors can be detected). However if the codeword 0000000 is transmitted and the word 1000011 is received then since 1000011is a code word the decoder will assume there has been no error in transmission.
226 Fundamentals of Information Theory and Coding DesignThe probability of undetected block error is calculated as:ÈÙ´ÃÒµÅ½È´ µÅ½È´ µÅ½È´ µÅ½´½ ÕµÒ ´ µ Õ ´ µ½ÅÅ½Å½´½ ÕµÒ ´ µ Õ ´ µ (5.29)where it is assumed that code words are equally likely to occur, that is, È´ µ½Å .The main complication is the double summation and the need to consider all pairs ofcode words and their Hamming distance, that is, having to enumerate the Hammingdistance between all pairs of distinct code words. For some codes, in particular thebinary linear codes discussed in Chapter 6, the distribution of the Hamming distance,´ µ, for given is independent of , and this simpliﬁes the calculation asfollows:RESULT 5.6 Probability of Undetected Block Error for Binary Linear CodesLet be the Hamming distance between the transmitted code word, , and thereceived code word, , where . Let be the number of choices of whichyield the same , where it is assumed this is independent of the transmitted codeword . Then the probability of undetected block error is:ÈÙ´ÃÒµÒ´ÃÒµ´½ ÕµÒ Õ (5.30)EXAMPLE 5.20Consider the Hamming code of Example 5.16. The Hamming code is an example ofbinary linear code. The reader can verify that for any choice of then:¯ there are 7 choices of such that ´ µ ¿¯ there are 7 choices of such that ´ µ¯ there is 1 choice of such that ´ µ
Fundamentals of Channel Coding 227¯ there are no choices of for andThus the probability of undetected block error is:ÈÙ´ÃÒµ ´½ Õµ Õ¿· ´½ Õµ¿Õ ·ÕFor comparison the block error probability of the Hamming code, from Equation5.28, is:È ´ÃÒµ ½ ½¼Õ ´½ Õµ ½ ´½ Õµ Õ´½ ÕµAssume a BSC with Õ ¼ ¼¼½; then if the Hamming code is used as a 1-bit errorcorrecting code it will suffer a block error decoding with probability È ´ÃÒµ¾¢½¼ , or an incorrect decoding once every 50,000 code words. On the other handif the Hamming code is used as a 2-bit error detecting code it will suffer from anundetected block error with probability ÈÙ´ÃÒµ ¢ ½¼ , an undetected blockerror once every 14,000,000 or so code words. The fact the error detection is so muchmore robust than error correction explains why practical communication systems areARQ when a feedback channel is physically possible.5.7 Shannon’s Fundamental Coding TheoremIn the previous section we saw, especially for repetition codes, that there is a trade-offbetween the code rate, Ê, and the block error probability, È ´ÃÒµ. This is generallytrue, but we would like a formal statement on the achievable code rates and blockerror probabilities. Such a statement is provided by Shannon’s Fundamental CodingTheorem .THEOREM 5.3 Shannon’s Fundamental Coding Theorem for BSCsEvery BSC of capacity ¼can be encoded with an arbitrary reliability and withcode rate, Ê´ÃÒµ , arbitrarily close to for increasing code word lengthsÒ. That is, there exist codes Ã½ Ã¾ Ã¿ such that È ´ÃÒµ tends to zero andÊ´ÃÒµ tends to with increasing Ò:Ð ÑÒ ½È ´ÃÒµ ¼ Ð ÑÒ ½Ê´ÃÒµ (5.31)The proof of Shannon’s Fundamental Coding Theorem can be quite long and tedious.However it is instructive to examine how the theorem is proved (by concentrating on
228 Fundamentals of Information Theory and Coding Designthe structure of the proof) since this may lead to some useful insight into achievingthe limits posed by the theorem (achieving error-free coding with code rate as closeto channel capacity as possible).PROOF Let ¯½ ¼ be an arbitrary small number and suppose we have to designa code ÃÒ of length Ò such that Ê´ÃÒµ ¯½. To do this we need messages oflength Ä Ò´ ¯½µ since:Ê´ÃÒµÄÆÒ´ ¯½µÒ ¯½ (5.32)This means we need Å ¾Ò´ ¯½µcode words.We prove the theorem by making use of random codes. Given any number Ò we canpick Å out of the ¾Ò binary words in a random way and obtain the random codeÃÒ. If Å ¾Ò´ ¯½µthen we know that Ê´ÃÒµ ¯½ but È ´ÃÒµ is a randomvariable denoted by:È ´Òµ È ´ÃÒµwhich is the expected value of È ´ÃÒµ for a ﬁxed value of Ò, but a completely randomchoice of Å code words. The main, and difﬁcult part of the proof (see [1, 8]) is toshow that:Ð ÑÒ ½È ´Òµ ¼We can intuitively see this by noting that a smaller value of È ´ÃÒµ results for codeswith a larger ´ÃÒµ and a larger minimum distance is possible if Å ¾Ò. Since fora BSC, ½, ¯½ ¼ and Å ¾Ò´ ¯½µthen we see that only for sufﬁciently largeÒ can we get Å ¾Ò and thus a smaller È ´ÃÒµ.The surprising part of the proof is that we can achieve a small È ´ÃÒµ and largeÊ´ÃÒµ with a random choice of ÃÒ. Thus the theorem is only of theoretical im-portance since no practical coding scheme has yet been devised that realises thepromises of Shannon’s Fundamental Coding Theorem for low error, but high coderate codes. Another stumbling block is that a large Ò is needed. However this is a suf-ﬁcient, not necessary, condition of the proof. It may be possible, with clever codingschemes, to approach the limits of the theorem but without resorting to excessivelylarge values of Ò. Indeed the family of turbo codes discovered in 1993 and discussedin Chapter 9 nearly achieves the limits of the theorem without being overly complexor requiring large values of Ò.The converse of the theorem also exists:
Fundamentals of Channel Coding 229THEOREM 5.4 Converse of Shannon’s Fundamental Coding Theorem for BSCsFor every BSC of capacity wherever codes ÃÒ of length Ò have code ratesÊ´ÃÒµ then the codes tend to be unreliable, that is:Ð ÑÒ ½È ´ÃÒµ ½ (5.33)EXAMPLE 5.21Consider designing a channel coder for BSC with Õ ¼ ¼¼½. The channel capacityof the BSC is:Ë ½ ´½ Õµ ÐÓ½´½ Õµ· Õ ÐÓ½Õ¼By Shannon’s Fundamental Coding Theorem we should be able to design a code withÅ ¾Ä code words and code word length Æ such that if the code rate Ê ÄÆsatisﬁes Ê then for sufﬁciently large Æ we can achieve an arbitrary small blockerror probability È ´ÃÒµ and make Ê as close to as we like. However neitherthe Theorem nor its proof provide any practical mechanism or coding scheme forconstructing such codes.Conversely if we choose Å ¾Ä and Æ such that Ê then we can neverﬁnd a code with arbitrarily small block error probability and indeed the block errorprobability will tend to 1 (completely unreliable codes). However this does not meanthat we can’t ﬁnd a speciﬁc code with Ê which will have a reasonably smallblock error probability.5.8 Exercises1. The input to the channel encoder arrives at 4 bits per second (bps). The binarychannel can transmit at 5 bps. In the following for different values of the blockencoder length, Ä, determine the length of the code word and where and whenany dummy bits are needed and how error protection can be encoded:(a) Ä ¾(b) Ä ¿(c) Ä
230 Fundamentals of Information Theory and Coding Design(d) Ä2. Consider the following binary channel:0101A B1/23/41/32/3(a) Find the maximum likelihood decoding rule and consequent error proba-bility.(b) Find the minimum error decoding rule and consequent error probability.(c) How useful is the above channel and source with a minimum error de-coding rule?3. A BSC has the following channel matrix and input probability distribution:0101A B0.90.10.90.10.20.8(a) Find the maximum likelihood decoding rule and consequent error prob-ability.(b) Find the minimum error decoding rule and consequent error probability.(c) For practical BSCs Ô Õ and the inputs can be assumed to be closeto equiprobable (i.e., È´¼µ È´½µ). Show why this implies that themaximum likelihood decoding rule is indeed minimum error.*4. Consider the following BEC:0101A0.90.9?0.70.3B
Fundamentals of Channel Coding 231Design a decoding rule for each pair of outputs in terms of the correspondingpair of inputs (e.g., ´ ¼µ ¼¼ means a followed by a ¼ at the outputdecides in favour of a ¼ followed by a ¼ being transmitted at the input of thechannel):(a) Design the decoding rule assuming maximum likelihood decoding.(b) Design the decoding rule assuming minimum error decoding.Assume inputs are produced independently of one another (e.g., È´¼¼µÈ´¼µÈ´¼µ ¼ )5. Consider the following Ä ¿ and Æ channel code:Message Code word0 0 0 0 0 0 0 0 00 0 1 0 0 1 1 1 00 1 0 0 1 0 0 1 10 1 1 0 1 1 1 0 11 0 0 1 0 0 1 1 11 0 1 1 0 1 0 0 11 1 0 1 1 0 1 0 01 1 1 1 1 1 0 1 0(a) Perform Hamming channel decoding for the following received words,indicating the Hamming distance between the received word and de-coded word, and any error detection/correction that arises:i. 0 1 1 1 0 1ii. 1 1 1 0 0 1iii. 0 1 0 0 1 0iv. 0 0 1 1 1 0v. 1 0 0 0 1 0(b) Calculate the minimum distance of the code and hence comment on theerror detection/correction properties.6. What is the minimum code word length you can use for designing a single-biterror correcting code for a binary stream that is block encoded on blocks oflength 4?7. Repeat Qu. 6 for the case of double-bit error correction.8. Prove the following result for binary codes:´Æ µ ¾ ´Æ ½ µby considering a code of length Æ with Å ´Æ µ code words and thenextracting a code of length Æ ½ from this code.
232 Fundamentals of Information Theory and Coding Design9. Prove the following result for binary codes:´Æ ¾Ø · ½µ ´Æ · ½ ¾Ø· ¾µby considering a code of length Æ · ½ with Å ´Æ · ½ µ code wordsand then removing one bit from each code word in such a way that a new codeof length Æ with the same number of code words but distance ½ is created.10. Consider a channel coding with Ä ½ and Æ ¾½. What is the best errordetection and error correction that can be designed?11. A channel coding system uses 4-byte code words and all triple-bit or less errorshave to be detected. What is the Ä for maximum code rate?*12. A channel coding system can only process message blocks and code blockswhich occupy an integer number of bytes (i.e., 8, 16, 32, etc.). Specify a Äand Æ for maximum code rate and best error detection / correction. What isthe error detection / correction capabilities of your design?13. Is it possible to design a maximal code for Ø ¿ bit error correction if codewords are 16 bits in length and only 4 code words are used? How about if 5code words are used?*14. Write a program that attempts to generate up to Å words of length Æ suchthat the minimum distance between the Å words is ´ÃÆµ. One way to dothis is as follows:(a) Start with Ñ ¼.(b) Randomly generate a word of length Æ, ÖÆ.(c) Calculate the Hamming distance ´ÖÆ µ between the generated word,ÖÆ , and each of the Ñ words found so far, ½ ¾ Ñ .(d) If ´ÖÆ µ ´ÃÆ µ for any , reject ÖÆ ; otherwise deﬁne Ñ·½ ÖÆand let Ñ Ñ · ½.(e) Repeat steps (b) to (d) and terminate when Ñ Å.Can a code with Æ , Å ¾ and ´ÃÆ µ ¿ exist? Use your program toanswer this question!15. You are required to design a single-bit error correcting code for messages oflength Ä ½½. What are the minimum number of check bits that you need?Is this is a maximal code? Is this a perfect code?16. Prove that a perfect code for Ø-bit error correction is also a maximal code.*17. Prove that a perfect code for Ø-bit error correction has ´ÃÒµ ¾Ø· ½.18. Consider the block code of length Ò:
Fundamentals of Channel Coding 233(a) If the repetition code is used, derive the expression for the block errorprobability È ´ÃÒµ as a function of Ò.(b) A BSC has error probability Õ ¼ ½. Find the smallest length of arepetition code such that È ´ÃÒµ ½¼ ¾ ¾. What is the code rate?*19. Calculate the block error probability, È ´Ã µ, for code Ã from Example 5.8for a BSC with Õ ¼ ¼¼½. Now provide a reasonable estimate for the bit-errorprobability, È ´Ã µ, and compare this value with È ´Ã µ.20. Consider the following design parameters for different channel codes:Code Ã Code Ã Code Ã Code Ã´Ãµ 3 2 4 5Ä 26 11 11 21Æ 31 12 16 31(a) Indicate whether the code implies an FEC system, an ARQ system orboth. In each case specify the error detection/correction properties of thesystem.(b) Derive a simple worst-case expression for the block error probability,È ´ÃÒµ, as a function of the BSC channel error probability, Õ, for bothtypes of error-control systems.(c) Will any of the FEC systems fail to operate (i.e., not be able to correcterrors, only detect them)?21. A channel encoder receives 3 bits per second and is connected to a BSC. IfÕ ¼ ¾ , how many bits per second can we send through the channel for the-oretical error-free transmission according to Shannon’s Fundamental CodingTheorem?22. Consider three single-bit error correcting codes of length Ò ½¾ Ò ½ ,respectively. Calculate the block-error probability È ´ÃÒµ for each code as-suming Õ ¼ ¼½ and a code rate of Ê ¼ . What happens to È ´ÃÒµfor increasing length Ò? Is this contrary to Shannon’s Fundamental CodingTheorem?*23. Let us explore Shannon’s Fundamental Coding Theorem further. Considererror correcting codes of length Ò ½¾ ¾ and a code rate ofÊ ½ ¾. What is the best Ø-bit error correction that can be achieved for each Òand what is the corresponding block error probability È ´ÃÒµ for Õ ¼ ¼½? Isthis in line with the expectations of Shannon’s Fundamental Coding Theorem?Explain! Repeat your analysis for different values of Ê and Õ.24. A BSC has a probability of error of Õ ¼ ¼¼¿ and can transmit no more than3 bits per second. Can an error-free coding be devised if:
234 Fundamentals of Information Theory and Coding Design(a) 165 bits per minute enter the channel encoder?(b) 170 bits per minute enter the channel encoder?(c) 175 bits per minute enter the channel encoder?25. A BSC transmits bits with error probability, Õ. Letters of the alphabet arriveat 2,100 letters per second. The letters are encoded with a binary code ofaverage length 4.5 bits per letter. The BSC is transmitting at 10,000 bps. Canan arbitrarily error-free coding be theoretically devised if:(a) Õ ¼ ½(b) Õ ¼ ¼½(c) Õ ¼ ¼¼(d) Õ ¼ ¼¼½5.9 References T.M. Cover, and J.A. Thomas, Elements of Information Theory, John Wiley Sons, New York, 1991. E.N. Gilbert, A comparison of signalling alphabets, Bell System Tech. J., 31,504-522, 1952. R.W. Hamming, Error detecting and error correcting codes, Bell System Tech.J., 29, 147-160, 1950. S. Haykin, Communication Systems, John Wiley Sons, New York, 4th ed.,2001. J. C. A. van der Lubbe, Information Theory, Cambridge University Press, Lon-don, 1997. S. Roman, Coding and Information Theory, Springer-Verlag, New York, 1992. C.E. Shannon, A mathematical theory of communication, Bell System Tech. J.,28, 379-423, 623-656, 1948. R. Wells, Applied Coding and Information Theory for Engineers, Prentice-Hall, New York, 1999.
Chapter 6Error-Correcting Codes6.1 IntroductionWe have seen that the existence of redundancy makes it possible to compress data andso reduce the amount of space or time taken to transmit or store it. A consequenceof removing redundancy is that the data become more susceptible to noise.Conversely, it is possible to increase the redundancy of a set of data by adding toit. The added redundancy can be used to detect when the data have been corruptedby noise and even to undo the corruption. This idea is the the basis of the theory oferror-correcting codes.EXAMPLE 6.1The simplest example of an error-correcting code is the parity check. (A special caseof this was discussed in Example 5.5 of Chapter 5.) Suppose that we have a sequenceof binary digits that is to be transmitted over a noisy channel that may have the effectof changing some of the ½’s to ¼’s and some of the ¼’s to ½’s. We can implement aparity check by breaking the sequence into blocks of, say, seven bits and adding aredundant bit to make a block of eight bits.The added bit is a parity bit. It is ¼ if there is an even number of ½s in the seven-bitblock and ½ if there is an odd number of ½s in the seven-bit block. Alternatively, it isthe result of ANDing all the seven bits together.When the eight bits are transmitted over a noisy channel, there may be one or moredifferences between the block that was transmitted and the block that is received.Suppose, ﬁrst that the parity bit is different. It will then not be consistent with theother seven bits. If the parity bit is unchanged, but one of the other seven bits ischanged, the parity bit will once again be inconsistent with the other seven bits.Checking to see if the parity bit is consistent with the other seven bits enables us tocheck whether there has been a single error in the transmission of the eight bit block.It does not tell us where the error has occurred, so we cannot correct it.The parity check also cannot tell us whether more than one error has occurred. If twobits are changed, the parity bit will still be consistent with the other seven bits. The235
236 Fundamentals of Information Theory and Coding Designsame is true if four or six bits are changed. If three bits are changed, the parity bit willnot be consistent with the other seven bits, but this is no different from the situationwhere only one bit is changed. The same is true if ﬁve or seven bits are changed.The parity check can be carried out easily by ANDing all eight bits together. If theresult is ¼, then either no error has occurred, or two, four or six errors have occurred.If the result is ½, then one, three, ﬁve or seven errors have occurred.The parity check is a simple error-detection technique with many limitations. Moreeffective error-correcting capabilities require more sophisticated approaches.EXAMPLE 6.2Another simple way of introducing redundancy (discussed in Chapter 5, Example5.6) is by repetition. If we have a sequence of bits, we can send each bit three times,so that the sequence ¼½½¼¼¼½½ is transmitted as ¼¼¼½½½½½½¼¼¼¼¼¼¼¼¼½½½½½½. Atthe receiving end, the sequence is broken up into blocks of three bits. If there is anerror in that block, one of the bits will be different, for example ¼¼½ instead of ¼¼¼. Inthis case, we take the most frequently occurring bit as the correct one. This schemeallows us to detect and correct one error. If two errors occur within a block, they willbe detected, but the correction procedure will give the wrong answer as the majorityof bits will be erroneous. If all three bits are changed, this will not be detected.If it is likely that errors will occur in bursts, this scheme can be modiﬁed by breakingthe original sequence into blocks and transmitting the blocks three times each. Theerror correction procedure will then be applied to the corresponding bits in each ofthe three copies of the block at the receiving end.This is an inefﬁcient way of adding redundancy, as it triples the amount of data butonly makes it possible to correct single errors.The theory of error-correcting codes uses many concepts and results from the math-ematical subject of abstract algebra. In particular, it uses the notions of rings, ﬁeldsand linear spaces. The following sections give an introduction to these notions. Arigorous treatment of this material can be found in any textbook on abstract algebra,such as .6.2 GroupsA group is a collection of objects that can be combined in pairs to produce anotherobject of the collection according to certain rules that require that there be an object
Error-Correcting Codes 237that does not change anything it combines with, and objects that undo the changesresulting from combining with other objects. The formal deﬁnition is as follows.DEFINITION 6.1 Group A group, ´ £µ, is a pair consisting of a set andan operation £ on that set, that is, a function from the Cartesian product ¢ to, with the result of operating on and denoted by £ , which satisﬁes1. associativity: £´ £ µ ´ £ µ £ for all ¾ ;2. existence of identity: there exists ¾ such that £ and £for all ¾ ;3. existence of inverses: foreach ¾ thereexists ½ ¾ such that £ ½and ½ £ .If the operation also satisﬁes the condition that £ £ for all ¾(commutativity), the group is called a commutative group or an Abelian group. It iscommon to denote the operation of an Abelian group by ·, its identity element by ¼and the inverse of ¾ by .EXAMPLE 6.3The simplest grouphas just two elements. It is Abelian, so we can denotethe elementsby ¼ and and the operation by ·. From the deﬁnition, we must have ¼ · ¼ ¼,¼ · , and · ¼ . must be its own inverse, so · ¼. We can describethe operation by the following table where the ﬁrst operand is listed in the columnon the left, the second operand is listed in the row at the top and the results of theoperation appear in the body of the table.· ¼¼ ¼¼This table will be symmetric about the main diagonal if and only if the group isAbelian.When we use an operation table to deﬁne a group operation, each element of thegroup will appear exactly once in each row and each column of the body of the table.DEFINITION 6.2 Order of a Group The order of a group is the number ofelements in it.
238 Fundamentals of Information Theory and Coding DesignEXAMPLE 6.4It is easy to show that there is only one group of order 3. Let us denote the elementsof the group by ¼ ½ ¾ and the operation by ·.If we let ¼ denote the identity element, we can start building the operation table asfollows:· ¼ ½ ¾¼ ¼ ½ ¾½ ½¾ ¾Now suppose that ½ · ¾ ½. This would make ½ appear twice in the second row ofthe table. This breaks the rule that each element appears exactly once in each row ofthe body of the table, so we cannot have ½·¾ ½. If ½·¾ ¾, ¾ would appear twicein the third column of the table, so we cannot have this either. This leaves ½ · ¾ ¼,which does not break any rules. To complete the second row of the table we have tohave ½ · ½ ¾, and the table now looks like this:· ¼ ½ ¾¼ ¼ ½ ¾½ ½ ¾ ¼¾ ¾We can now ﬁll in the second and third columns with the missing elements to completethe table:· ¼ ½ ¾¼ ¼ ½ ¾½ ½ ¾ ¼¾ ¾ ¼ ½This is the operation table for addition modulo ¿.In the example above, it is claimed that there is “only one group of order 3.” Thisis true in the sense that any set of three elements with an operation that satisﬁes thegroup axioms must have the same structure as the group· ¼ ½ ¾¼ ¼ ½ ¾½ ½ ¾ ¼¾ ¾ ¼ ½regardless of the names of the elements. We regard
Error-Correcting Codes 239· « ¬ « « ¬ ¬ ¬ « « ¬as the same group because, if we replace « by ¼, ¬ by ½ and by ¾ everywhere inthe table above, we will ﬁnish up with the previous table.To make these ideas precise, we need to consider functions that preserve the structureof the group in the following way.DEFINITION 6.3 Group Homomorphism Let ´ £µ and ´À Æµ be two groupsand let À be a function deﬁned on with values in À. is a grouphomomorphism if it preserves the structure of , that is, if´ £ µ ´ µÆ ´ µ (6.1)for all ¾ .DEFINITION 6.4 Group Isomorphism A group homomorphism that is bothone-one and onto is a group isomorphism.Two groups that are isomorphic have the same structure and the same number ofelements.If À is a homomorphism, the image of the identity of is the identity ofÀ.EXAMPLE 6.5There are two groups of order 4. We will denote the elements of the ﬁrst by ¼ ½ ¾ ¿and use · for the operation. The operation table is:· ¼ ½ ¾ ¿¼ ¼ ½ ¾ ¿½ ½ ¾ ¿ ¼¾ ¾ ¿ ¼ ½¿ ¿ ¼ ½ ¾For the other, we will use for the elements and £ for the operation. Theoperation table is
240 Fundamentals of Information Theory and Coding Design£These groups obviously do not have the same structure. In the ﬁrst group ½ ·½ ¾and ¿ · ¿ ¾, but in the second group combining any element with itself under £gives the identity, .EXAMPLE 6.6There is one group of order 5. Its operation table is· ¼ ½ ¾ ¿¼ ¼ ½ ¾ ¿½ ½ ¾ ¿ ¼¾ ¾ ¿ ¼ ½¿ ¿ ¼ ½ ¾¼ ½ ¾ ¿EXAMPLE 6.7There are two groups of order 6. The operation table of the ﬁrst is· ¼ ½ ¾ ¿¼ ¼ ½ ¾ ¿½ ½ ¾ ¿ ¼¾ ¾ ¿ ¼ ½¿ ¿ ¼ ½ ¾¼ ½ ¾ ¿¼ ½ ¾ ¿For the second, we will use À Á Â for the elements and £ for the operation.The operation table is
Error-Correcting Codes 241£ À Á ÂÀ Á ÂÁ Â ÀÀ Â ÁÀ À Â ÁÁ Á Â ÀÂ Â Á ÀThe table is not symmetric, so this group is not Abelian. This is the smallest exampleof a non-commutative group.For any positive integer Ô, there is at least one group of order Ô. There may be morethan one group of order Ô.DEFINITION 6.5 The Cyclic Groups For each positive integer Ô, there is agroup called the cyclic group of order Ô, with set of elementsÔ ¼ ½ ´Ô ½µand operation ¨ deﬁned by¨ · (6.2)if · Ô, where · denotes the usual operation of addition of integers, and¨ · Ô (6.3)if · Ô, where denotes the usual operation of subtraction of integers. Theoperation in the cyclic group is addition modulo Ô. We shall use the sign ·instead of¨ to denote this operation in what follows and refer to “the cyclic group ´ Ô ·µ ”or simply “the cyclic group Ô.”There are also plenty of examples of inﬁnite groups.EXAMPLE 6.8The set of integers, ¿ ¾ ½ ¼ ½ ¾ ¿ , with the operation ofaddition is a group. ¼ is the identity and is the inverse of . This is an Abeliangroup.EXAMPLE 6.9The set of real numbers Ê with the operation of addition is another Abelian group.Again, ¼ is the identity and Ö is the inverse of Ö ¾ Ê.
242 Fundamentals of Information Theory and Coding DesignEXAMPLE 6.10The set of positive real numbers Ê· with the operation of multiplication is also anAbelian group. This time, the identity is ½ and the inverse of Ö ¾ Ê· is ½ Ö.EXAMPLE 6.11The set of Ò ¢ Ò real matrices with the operation of matrix addition is a group. Thezero matrix is the identity and the inverse of a matrix Å is the matrix Å, consistingof the same elements as Å but with their signs reversed.EXAMPLE 6.12The set of non-singular Ò¢Ò real matrices with the operation of matrix multiplicationis a group for any positive integer Ò. The identity matrix is the group identity and theinverse of a matrix is its group inverse. These groups are not commutative.Groups are not used directly in the construction of error-correcting codes. They formthe basis for more complex algebraic structures which are used.6.3 Rings and FieldsHaving two operations that interact with each other makes things more interesting.DEFINITION 6.6 Ring A ring is a triple ´Ê · ¢µ consisting of a set Ê, andtwo operations · and ¢, referred to as addition and multiplication, respectively,which satisfy the following conditions:1. associativity of ·: · ´ · µ ´ · µ · for all ¾ Ê;2. commutativity of ·: · · for all ¾ Ê;3. existence of additive identity: there exists ¼ ¾ Ê such that ¼ · and· ¼ for all ¾ Ê;4. existence of additive inverses: for each ¾ Ê there exists ¾ Ê such that· ´ µ ¼ and ´ µ · ¼;5. associativity of ¢: ¢ ´ ¢ µ ´ ¢ µ ¢ for all ¾ Ê;6. distributivity of ¢ over ·: ¢ ´ · µ ´ ¢ µ´ ¢ µ for all ¾ Ê.
Error-Correcting Codes 243The additive part of a ring, ´Ê ·µ, is an Abelian group.In any ring ´Ê · ¢µ, ¼ ¢Ö ¼ and Ö ¢¼ ¼ for all Ö ¾ Ê.In many cases, it is useful to consider rings with additional properties.DEFINITION 6.7 Commutative Ring A ring ´Ê · ¢µ is a commutative ring if¢ is a commutative operation, that is, ¢ ¢ for all ¾ Ê.DEFINITION 6.8 Ring with Unity A ring ´Ê · ¢µ is a ring with unity if it hasan identity element of the multiplication operation, that is, if there exists ½ ¾ Ê suchthat ½ ¢ and ¢ ½ for all ¾ Ê.DEFINITION 6.9 Division Ring A ring ´Ê · ¢µ is a division ring if it is a ringwith unity and every non-zero element has a multiplicative inverse, that is, for every¾ Ê, ¼, there exists ½ ¾ Ê such that ¢ ½ ½ and ½ ¢ ½.DEFINITION 6.10 Field A commutative division ring in which ¼ ½ is a ﬁeld.EXAMPLE 6.13There is one ring with two elements. The additive part of the ring is the cyclic groupof order 2, with operation table· ¼ ½¼ ¼ ½½ ½ ¼and the multiplicative part has the operation table¢ ¼ ½¼ ¼ ¼½ ¼ ½These tables represent the arithmetic operations of addition andmultiplicationmodulo2. The ring will be denoted by ´ ¾ · ¢µ or simply by ¾.If we consider ¼ and ½ to represent truth values, then these tables are the truth tablesof the bit operations Æ and ÇÊ, respectively.
244 Fundamentals of Information Theory and Coding Design¾ is a commutative ring, a ring with unity and a division ring, and hence a ﬁeld. It isthe smallest ﬁeld and the one of most relevance to the construction of error-correctingcodes.We can add a multiplicative structure to the cyclic groups to form rings.DEFINITION 6.11 The Cyclic Rings For every positive integer Ô, there is aring ´ Ô · ¢µ, called the cyclic ring of order Ô, with set of elementsÔ ¼ ½ ´Ô ½µand operations · denoting addition modulo Ô, and ¢ denoting multiplication mod-ulo Ô.The previous example described the cyclic ring of order ¾. The following examplesshow the operation tables of larger cyclic rings.EXAMPLE 6.14The operation tables of the cyclic ring of order 3 are· ¼ ½ ¾¼ ¼ ½ ¾½ ½ ¾ ¼¾ ¾ ¼ ½¢ ¼ ½ ¾¼ ¼ ¼ ¼½ ¼ ½ ¾¾ ¼ ¾ ½¿ is a ﬁeld.EXAMPLE 6.15The operation tables of the cyclic ring of order 4 are· ¼ ½ ¾ ¿¼ ¼ ½ ¾ ¿½ ½ ¾ ¿ ¼¾ ¾ ¿ ¼ ½¿ ¿ ¼ ½ ¾¢ ¼ ½ ¾ ¿¼ ¼ ¼ ¼ ¼½ ¼ ½ ¾ ¿¾ ¼ ¾ ¼ ¾¿ ¼ ¿ ¾ ½is not a ﬁeld; ¾ does not have a multiplicative inverse.
Error-Correcting Codes 245EXAMPLE 6.16The operation tables of the cyclic ring of order 5 are· ¼ ½ ¾ ¿¼ ¼ ½ ¾ ¿½ ½ ¾ ¿ ¼¾ ¾ ¿ ¼ ½¿ ¿ ¼ ½ ¾¼ ½ ¾ ¿¢ ¼ ½ ¾ ¿¼ ¼ ¼ ¼ ¼ ¼½ ¼ ½ ¾ ¿¾ ¼ ¾ ½ ¿¿ ¼ ¿ ½ ¾¼ ¿ ¾ ½is a ﬁeld.The cyclic ring of order Ô is a ﬁeld if and only if Ô is a prime number.The following are examples of inﬁnite rings.EXAMPLE 6.17The integers ´ · ¢µwith the usual operations of addition and multiplication forma commutative ring with unity. It is not a division ring.EXAMPLE 6.18The real numbers ´Ê · ¢µ with the usual operations of addition and multiplicationform a ﬁeld.EXAMPLE 6.19Thecomplex numbers´ · ¢µwiththeoperationsofcomplexadditionandcomplexmultiplication form a ﬁeld.EXAMPLE 6.20The set of Ò ¢Ò real matrices with operations matrix operation and matrix multipli-cation forms a ring with unity. The zero matrix is the additive identity and the identitymatrix is the multiplicative identity. It is not commutative, and is not a division ringas only non-singular matrices have multiplicative inverses.Just as in the case of groups, we can talk about rings having the same structure andabout functions which preserve structure.
246 Fundamentals of Information Theory and Coding DesignDEFINITION 6.12 Ring Homomorphism Let ´Ê · ¢µ and ´Ë ¨ ªµ be tworings and let Ê Ë be a function deﬁned on Ê with values in Ë. is a ringhomomorphism if it preserves the structure of Ê, that is, if´ · µ ´ µ ¨ ´ µ (6.4)and´ ¢ µ ´ µ ª ´ µ (6.5)for all ¾ .DEFINITION 6.13 Ring Isomorphism A ring homomorphism that is both one-one and onto is a ring isomorphism.As with groups, two rings are isomorphic if they have the same structure and thesame number of elements.6.4 Linear SpacesLinear spaces consist of things which can be added, subtracted and re-scaled.DEFINITION 6.14 Linear Space A linear space over a ﬁeld is a 6-tuple´Î ¨ · ¢ Æµ, where ´Î ¨µ is a commutative group, ´ · ¢µ is a ﬁeld andÆ ¢Î Î is a function that satisﬁes the following conditions:1. Æ´ ÆÚµ ´ ¢ µ ÆÚ for all ¾ and all Ú ¾ Î ;2. ´ · µ ÆÚ ´ ÆÚµ ¨´ ÆÚµ for all ¾ and all Ú ¾ Î ;3. Æ´Ú ¨Ûµ ´ ÆÚµ ¨´ ÆÛµ for all ¾ and all Ú Û ¾ Î ;4. ½ ÆÚ Ú for all Ú ¾ Î.Most of the elementary examples of linear spaces are spaces of geometric vectors; soan alternative name for a linear space is a vector space. The elements of Î are calledvectors and the operation ¨ is called vector addition, while the elements of arecalled scalars and the function Æ ¢Î Î is called multiplication by scalars.While we have been careful to use different symbols for the various operations in thedeﬁnition above, we shall hereafter abuse notation by using the same symbol, ·, foraddition in Î as well as for addition in , and by omitting the symbol Æand denoting
Error-Correcting Codes 247multiplication by a scalar by juxtaposition of a scalar and a vector. We shall also usethe same symbol, ¼, for both the group identity in Î and the additive identity in .The following properties are simple consequences of the deﬁnition above:1. ¼Ú ¼ for all Ú ¾ Î , where the ¼ on the left hand side belongs to and the ¼on the right hand side belongs to Î ;2. ¼ ¼ for all ¾ , where the ¼ on both sides belongs to Î ;3. ´ µÚ ´ Úµ ´ Úµ for all ¾ and Ú ¾ Î .The archetypal linear spaces are the n-dimensional real vector spaces.EXAMPLE 6.21Let ÊÒ denote the n-fold Cartesian product Ê ¢ Ê ¢ ¢ Ê, consisting of Ò-tuples´Ü½ Ü¾ ÜÒµ for Ü ¾ Ê, ½ Ò. If we deﬁne vector addition by´Ü½Ü¾ÜÒµ · ´Ý½Ý¾ÝÒµ ´Ü½· Ý½Ü¾· Ý¾ÜÒ · ÝÒµand multiplication by a scalar by´Ü½Ü¾ÜÒµ ´ Ü½Ü¾ÜÒµfor ´Ü½ Ü¾ ÜÒµ and ´Ý½ Ý¾ ÝÒµ ¾ ÊÒ and ¾ Ê, ÊÒ becomes a linearspace over Ê.EXAMPLE 6.22For an example of a linear space that is not ÊÒ, consider the set of continuousfunctions on the real line. For and ¾ , deﬁne the sum · by´ · µ´Üµ ´Üµ · ´Üµfor all Ü ¾ Ê, and the product of Ö ¾ Ê and ¾ by´Ö µ´Üµ Ö ´Üµfor all Ü ¾ Ê, where the right hand side of the equation denotes the product of Ö and´Üµ as real numbers.If and are continuous, so are · and Ö . The other properties that additionin and multiplication by scalars must satisfy follow from the properties of the realnumbers.Parts of linear spaces can be linear spaces in their own right.
248 Fundamentals of Information Theory and Coding DesignDEFINITION 6.15 Linear Subspace A subset Ë of a linear space Î over a ﬁeldis a linear subspace of Î if for all Ú Û ¾ Ë and all ¾ , Ú ·Û ¾ Ë and Ú ¾ Ë.EXAMPLE 6.23If Ü ´Ü½ Ü¾ ÜÒµ ¾ ÊÒ then the setË½ ÖÜ Ö ¾ Êis a linear subspace of ÊÒ. Geometrically, this is a line through the origin.If Ý ´Ý½ Ý¾ ÝÒµ also belongs to ÊÒ then the setË¾ ÖÜ ·×Ý Ö × ¾ Êis a linear subspace of ÊÒ. This is a plane through the origin.EXAMPLE 6.24The set ¾ ´¼µ ¼ is a linear subspace of . To check this, note that if´¼µ ¼ and ´¼µ ¼, then ´ · µ´¼µ ´¼µ · ´¼µ ¼ and if ¾ Ê, then´Ö ´¼µµ Ö ´¼µ ¼.It is quite easy to create subspaces of a linear space.DEFINITION 6.16 Linear Combination If Î is a linear space over , a (ﬁnite)linear combination of elements of Î is a sum½× (6.6)where is a positive integer, the ¾ and the × ¾ Î .This deﬁnition includes the case where ½.DEFINITION 6.17 Linear Span Let Ë be a subset of the linear space Î . Thelinear span of Ë is the set×Ô Ò Ë´½× ½ ¾ × ¾ Ëµ(6.7)consisting of all ﬁnite linear combinations of elements of Ë.
Error-Correcting Codes 249The linear span of Ë is also known as the linear subspace generated by Ë.It is possible for different sets to generate the same linear subspace.EXAMPLE 6.25Consider the linear subspace of Ê¿ generated by the vectors ´½ ¼ ¼µ and ´¼ ½ ¼µ.This is the set ´Ü Ý ¼µ ¾ Ê¿ Ü Ý ¾ Ê . It is also generated by the vectors ´½ ½ ¼µ,and ´½ ½ ¼µ.The same subspace is also generated by the vectors ´½ ¼ ¼µ, ´¼ ½ ¼µ and ´¿ ¾ ¼µ.There is a redundancy in this, since we can express ´¿ ¾ ¼µ as a linear combinationof the other two vectors,´¿ ¾ ¼µ ¿´½ ¼ ¼µ · ´ ¾µ´¼ ½ ¼µand so we can reduce any linear combination of the three vectors to a linear combi-nation of the ﬁrst two by putting´½ ¼ ¼µ · ´¼ ½ ¼µ · ´¿ ¾ ¼µ ´ · ¿ µ´½ ¼ ¼µ · ´ ¾ µ´¼ ½ ¼µfor any ¾ Ê.It is important to distinguish sets that have the redundancy property illustrated in theexample above from sets that do not possess this property.DEFINITION 6.18 Linearly Independent A subset Ë of a linear space Î overis linearly independent if for any set of vectors ×½ ×¾ ×Ò contained in Ë, theequationÒ½× ¼ (6.8)implies that all the ¼.A set that is not linearly independent is linearly dependent.EXAMPLE 6.26´½ ½ ½µ ´½ ½ ½µ is a linearly independent subset of Ê¿ . The equation½´½ ½ ½µ · ¾´½ ½ ½µ ´¼ ¼ ¼µimplies the three equations½ · ¾ ¼½ ¾ ¼½ · ¾ ¼
250 Fundamentals of Information Theory and Coding DesignThe ﬁrst and third equation are identical, but the only solution of the ﬁrst and secondequations is ½ ¼, ¾ ¼.´½ ½ ½µ ´½ ½ ½µ ´¼ ½ ¼µ is a linearly dependent subset of Ê¿ . The equation½´½ ½ ½µ· ¾´½ ½ ½µ· ¿´¼ ½ ¼µ ´¼ ¼ ¼µimplies the three equations½ · ¾ ¼½ ¾ ¿ ¼½ · ¾ ¼This set of equations has inﬁnitely many solutions, for example, ½ ½, ¾ ½,¿ ¾.A linearly independent set that generates a vector space has the property that remov-ing any vector from the set will produce a set that no longer generates the space whileadding a vector to the set will produce a set that is no longer linearly independent.Such sets are important enough to be given a special name.DEFINITION 6.19 Basis If Î is a linear space over , a basis for Î is alinearly independent subset of Î that generates the whole of Î .EXAMPLE 6.27Thesetofvectors ´½ ¼ ¼ ¼µ, ´¼ ½ ¼ ¼µ, ´¼ ¼ ½ ¼µ ´¼ ¼ ¼ ½µ ,forms a basis for ÊÒ, known as the standard basis. There are lots of others.For Ò ¿, the standard basis is ´½ ¼ ¼µ ´¼ ½ ¼µ ´¼ ¼ ½µ . Other bases are´½ ½ ¼µ ´½ ½ ¼µ ´¼ ¼ ½µ , and ´½ ¾ ¿µ ´½ ¾ ½µ ´ ½ ¾µ .It can be shown that every basis of a linear space contains the same number of vec-tors. This makes the following deﬁnition unambiguous.DEFINITION 6.20 Dimension The dimension of a linear space is the numberof elements in any basis for the space.EXAMPLE 6.28The existence of the standard basis for ÊÒ shows that its dimension is Ò.
Error-Correcting Codes 251The dimension of a linear space is also the maximum number of elements that canbe contained in a linearly independent subset of that space. There are linear spacesin which it is possible to ﬁnd arbitrarily large linearly independent subsets. Thesespaces do not have a ﬁnite basis.EXAMPLE 6.29Consider the space of continuous functions on the real line. Deﬁne the polynomialfunctions byÔ¼´Üµ ½Ô½´Üµ ÜÔ¾´Üµ Ü¾...ÔÒ´Üµ ÜÒfor Ü ¾ Ê.Any collection of these functions is a linearly independent subset of , but none ofthem generates it. therefore does not have a ﬁnite basis.6.5 Linear Spaces over the Binary FieldWe have seen that the ring ¾, with addition and multiplication tables· ¼ ½¼ ¼ ½½ ½ ¼¢ ¼ ½¼ ¼ ¼½ ¼ ½is the smallest example of a ﬁeld. We will be using this ﬁeld almost exclusively inthe development of the theory of error-correcting codes. From now on, we will referto it as the binary ﬁeld and denote it by .For any Ò, there are exactly ¾Ò n-tuples of elements of Ò. For convenience, wewill denote these simply by concatenating the bits, without commas in between orround brackets before and after them. (This also makes the elements of Ò look likesequences of bits.) Using these conventions, we have¼ ½¾¼¼ ¼½ ½¼ ½½
252 Fundamentals of Information Theory and Coding Design¿¼¼¼ ¼¼½ ¼½¼ ¼½½ ½¼¼ ½¼½ ½½¼ ½½½¼¼¼¼ ¼¼¼½ ¼¼½¼ ¼¼½½ ¼½¼¼ ¼½¼½ ¼½½¼ ¼½½½½¼¼¼ ½¼¼½ ½¼½¼ ½¼½½ ½½¼¼ ½½¼½ ½½½¼ ½½½½and so on.We can deﬁne addition on Ò component-wise, so that, for example, in ,¼¼½½·¼½¼½ ¼½½¼while multiplication by elements of is deﬁned very simply by setting ¼ ¼ and½ for all ¾ Ò. These operations make Ò into a linear space over .We can represent the members of Ò as the vertices of the unit cube in Ò-dimensionalspace.The deﬁnitions of linear combinations, linear independence, bases and dimensiongiven above for general linear spaces all apply to linear spaces over . A linearsubspace of Ò of dimension has exactly ¾ elements.EXAMPLE 6.30The following are linear subspaces of :¼¼¼¼¼¼¼¼ ½½½½¼¼¼¼ ½¼¼¼ ¼¼½¼ ½¼½¼¼¼¼¼ ½½¼¼ ¼½½¼ ¼¼½½ ½¼½¼ ½½½½ ¼½¼½ ½¼¼½Their dimensions are zero, one, two and three, respectively.The following are not linear subspaces of :¼½¼½¼½¼½ ½¼½¼¼½¼½ ½¼½¼ ½½½½½¼¼¼ ¼½¼¼ ¼¼½¼ ¼¼¼½¼¼¼¼ ½¼¼¼ ¼¼½¼ ½¼½¼ ½½½½
Error-Correcting Codes 253DEFINITION 6.21 Coset If Ä is a linear subspace of , and Ü ¾ , then the setof vectorsÜ ·Ä Ü ·Ý Ý ¾ Ä (6.9)is a coset of Ä.Cosets are also known as afﬁne subspaces. Ü · Ä Ä if and only if Ü ¾ Ä, andÝ·Ä Þ·Ä if and only if Ý ¾ Þ·Ä. In particular, ¼·Ä Ä as ¼always belongsto Ä.EXAMPLE 6.31Ä ¼¼ ½½ is a subspace of ¾ . The cosets of Ä are:¼¼·Ä ½½·Ä Ä¼½·Ä ½¼·Ä ¼½ ½¼EXAMPLE 6.32Ä ¼¼¼ ½½½ is a subspace of ¿ . The cosets of Ä are:¼¼¼·Ä ½½½·Ä Ä¼¼½·Ä ½½¼·Ä ¼¼½ ½½¼¼½¼·Ä ½¼½·Ä ¼½¼ ½¼½½¼¼·Ä ¼½½·Ä ½¼¼ ¼½½EXAMPLE 6.33Ä ¼¼¼ ½¼½ ¼½¼ ½½½ is a subspace of ¿ . The cosets of Ä are:¼¼¼·Ä ½¼½·Ä ¼½¼·Ä ½½½·Ä Ä¼¼½·Ä ½¼¼·Ä ¼½½·Ä ½½¼·Ä ¼¼½ ½¼¼ ¼½½ ½½¼DEFINITION 6.22 Weight The weight of an element of Ò is the number of onesin it.
254 Fundamentals of Information Theory and Coding DesignWe repeat the following deﬁnition from Chapter 5.DEFINITION 6.23 Hamming Distance The Hamming distance between Ü andÝ ¾ Ò is the number of places where they differ.The Hamming distance between Ü and Ý is equal to the weight of Ü Ý.EXAMPLE 6.34The following table shows the weights of the elements of ¾ .Ð Ñ ÒØ Ï Ø¼¼ ¼¼½ ½½¼ ½½½ ¾The following table shows the Hamming distance between the pairs of elements of¾ .¼¼ ¼½ ½¼ ½½¼¼ ¼ ½ ½ ¾¼½ ½ ¼ ¾ ½½¼ ½ ¾ ¼ ½½½ ¾ ½ ½ ¼EXAMPLE 6.35The following table shows the weights of the elements of ¿ .Ð Ñ ÒØ Ï Ø¼¼¼ ¼¼¼½ ½¼½¼ ½¼½½ ¾½¼¼ ½½¼½ ¾½½¼ ¾½½½ ¿
Error-Correcting Codes 255The following table shows the Hamming distance between the pairs of elements of¿ .¼¼¼ ¼¼½ ¼½¼ ¼½½ ½¼¼ ½¼½ ½½¼ ½½½¼¼¼ ¼ ½ ½ ¾ ½ ¾ ¾ ¿¼¼½ ½ ¼ ¾ ½ ¾ ½ ¿ ¾¼½¼ ½ ¾ ¼ ½ ¾ ¿ ½ ¾¼½½ ¾ ½ ½ ¼ ¿ ¾ ¾ ½½¼¼ ½ ¾ ¾ ¿ ¼ ½ ½ ¾½¼½ ¾ ½ ¿ ¾ ½ ¼ ¾ ½½½¼ ¾ ¿ ½ ¾ ½ ¾ ¼ ½½½½ ¿ ¾ ¾ ½ ¾ ½ ½ ¼6.6 Linear CodesWe construct error-correcting codes by ﬁnding subsets of Ò with desirable proper-ties.DEFINITION 6.24 Binary Block Code A binary block code is a subset of Òfor some Ò. Elements of the code are called code words.EXAMPLE 6.36We can encode the alphabet using a subset of . We let ¼¼¼½ stand for , let ¼¼½¼stand for , and so on, until ½½¼½¼ stands for . The remaining six elements ofare not included in the code.Such simple codes do not have error-correcting properties. We need to have codeswith more structure.DEFINITION 6.25 Linear Code A linear code is a linear subspace of Ò.Linear codes are subsets of Ò with a linear structure added. Because multiplicationis trivial in linear spaces over a binary ﬁeld, a binary code is a linear code if and onlyif the sum of two code words is also a code word.
256 Fundamentals of Information Theory and Coding DesignDEFINITION 6.26 Minimum Distance The minimum distance of a linear codeis the minimum of the weights of the non-zero code words.The minimum distance of a linear code is the minimum of the Hamming distancesbetween pairs of code words.The relationships between the minimum distance of a code and its capabilities in re-spect of detecting and correcting and detecting errors that were discussed in Chapter5 also hold for linear codes. The additional structure gives us systematic ways ofconstructing codes with good error detecting and error correcting properties. Detailsof such results can be found in Chapter 4 of , Chapter 7 of  and Chapter 4 of.EXAMPLE 6.37In ¾ , ¼¼ ¼½ and ¼¼ ½¼ are linear codes with minimum distance ½ while ¼¼ ½½is a linear code with minimum distance ¾.EXAMPLE 6.38The following two-dimensional codes in ¿ have minimum distance ½:¼¼¼ ¼¼½ ¼½¼ ¼½½¼¼¼ ¼¼½ ½¼¼ ½¼½¼¼¼ ¼½¼ ½¼¼ ½½¼¼¼¼ ¼¼½ ½½¼ ½½½¼¼¼ ¼½¼ ½¼½ ½½½¼¼¼ ½¼¼ ¼½½ ½½½The code ¼¼¼ ¼½½ ½¼½ ½½¼ has minimum distance ¾.Let Ä be a linear code. If Ä is a -dimensional subspace of Ò, then we can ﬁnda basis for Ä consisting of code words ½, ¾ in Ò. Every code word inÄ is a linear combination of these basis code words. There is a convenient matrixnotation for linear codes that uses this fact.DEFINITION 6.27 Generator Matrix A generator matrix for a linear code is abinary matrix whose rows are the code words belonging to some basis for the code.A generator matrix for a -dimensional linear code in Ò is a ¢ Ò matrix whoserank is . Conversely, any ¢ Ò binary matrix with rank is a generator matrix forsome code.
Error-Correcting Codes 257We can compute the code words from the generator matrix by multiplying it on theleft by all the row vectors of dimension .EXAMPLE 6.39Thecode ¼¼¼¼ ¼¼¼½ ½¼¼¼ ½¼¼½ isatwo-dimensionallinearcodein . ¼¼¼½ ½¼¼¼is a basis for this code, which gives us the generator matrix¼ ¼ ¼ ½½ ¼ ¼ ¼To ﬁnd the code words from the generator matrix, we perform the following matrixmultiplications:¢¼ ¼£ ¼ ¼ ¼ ½½ ¼ ¼ ¼¢¼ ¼ ¼ ¼£¢¼ ½£ ¼ ¼ ¼ ½½ ¼ ¼ ¼¢½ ¼ ¼ ¼£¢½ ¼£ ¼ ¼ ¼ ½½ ¼ ¼ ¼¢¼ ¼ ¼ ½£¢½ ½£ ¼ ¼ ¼ ½½ ¼ ¼ ¼¢½ ¼ ¼ ½£This gives us the four code words, ¼¼¼¼, ½¼¼¼, ¼¼¼½ and ½¼¼½, with which we started.EXAMPLE 6.40¼¼¼¼ ¼¼½½ ¼½½¼ ½½¼¼ ¼½¼½ ½½½½ ½¼½¼ ½¼¼½ is a three-dimensional linear codein .¼¼½½ ¼½½¼ ½½¼¼ is a basis for this code, which gives us the generator matrix¾¼ ¼ ½ ½¼ ½ ½ ¼½ ½ ¼ ¼¿
258 Fundamentals of Information Theory and Coding DesignTo recover the code words from the generator matrix,we perform the following matrixmultiplications: ¢¼ ¼ ¼£ ¢¼ ¼ ¼ ¼£¢¼ ¼ ½£ ¢½ ½ ¼ ¼£¢¼ ½ ¼£ ¢¼ ½ ½ ¼£¢¼ ½ ½£ ¢½ ¼ ½ ¼£¢½ ¼ ¼£ ¢¼ ¼ ½ ½£¢½ ¼ ½£ ¢½ ½ ½ ½£¢½ ½ ¼£ ¢¼ ½ ¼ ½£¢½ ½ ½£ ¢½ ¼ ¼ ½£¼½¼½ ½¼¼½ ½¼½¼ is also a basis for the code. It gives us the generator matrix¼¾¼ ½ ¼ ½½ ¼ ¼ ½½ ¼ ½ ¼¿The code words can also be recovered from this matrix, but in a different order.¢¼ ¼ ¼£ ¼¢¼ ¼ ¼ ¼£¢¼ ¼ ½£ ¼¢½ ¼ ½ ¼£¢¼ ½ ¼£ ¼¢½ ¼ ¼ ½£
Error-Correcting Codes 259¢¼ ½ ½£ ¼¢¼ ¼ ½ ½£¢½ ¼ ¼£ ¼¢¼ ½ ¼ ½£¢½ ¼ ½£ ¼¢½ ½ ½ ½£¢½ ½ ¼£ ¼¢½ ½ ¼ ¼£¢½ ½ ½£ ¼¢¼ ½ ½ ¼£EXAMPLE 6.41¾¼ ¼ ¼ ¼ ½¼ ¼ ½ ½ ½½ ½ ½ ½ ½¿is a ¿¢ binary matrix of rank ¿. The code words of the three-dimensionallinear codeforwhich isageneratormatrix can befound by thefollowingmatrixmultiplications.¢¼ ¼ ¼£ ¢¼ ¼ ¼ ¼ ¼£¢¼ ¼ ½£ ¢½ ½ ½ ½ ½£¢¼ ½ ¼£ ¢¼ ¼ ½ ½ ½£¢¼ ½ ½£ ¢½ ½ ¼ ¼ ¼£¢½ ¼ ¼£ ¢¼ ¼ ¼ ¼ ½£
260 Fundamentals of Information Theory and Coding Design¢½ ¼ ½£ ¢½ ½ ½ ½ ¼£¢½ ½ ¼£ ¢¼ ¼ ½ ½ ¼£¢½ ½ ½£ ¢½ ½ ¼ ¼ ½£The code is¼¼¼¼¼ ¼¼¼¼½ ¼¼½½¼ ¼¼½½½ ½½¼¼¼ ½½¼¼½ ½½½½¼ ½½½½½Because a linear space can have many bases, a linear code can have many generatormatrices. This raises the question of when two generator matrices generate the samecode.DEFINITION 6.28 Elementary Row Operation An elementary row operationon a binary matrix consists of replacing a row of the matrix with the sum of that rowand any other row.If we have a generator matrix for a linear code Ä, all other generator matrices forÄ can be obtained by applying a sequence of elementary row operations to .EXAMPLE 6.42In a previous example, we saw that the linear code¼¼¼¼ ¼¼½½ ¼½½¼ ½½¼¼ ¼½¼½ ½½½½ ½¼½¼ ½¼¼½has the following generator matrices:¾¼ ¼ ½ ½¼ ½ ½ ¼½ ½ ¼ ¼¿and¼¾¼ ½ ¼ ½½ ¼ ¼ ½½ ¼ ½ ¼¿We can change to ¼by the following sequence of elementary row operations.
Error-Correcting Codes 2611. Replace the third row with the sum of the second and third rows. This gives½¾¼ ¼ ½ ½¼ ½ ½ ¼½ ¼ ½ ¼¿2. Replace the ﬁrst row with the sum of the ﬁrst and second rows. This gives¾¾¼ ½ ¼ ½¼ ½ ½ ¼½ ¼ ½ ¼¿3. Replace the second row with the sum of the ﬁrst and second rows. This gives¿¾¼ ½ ¼ ½¼ ¼ ½ ½½ ¼ ½ ¼¿4. Finally, replace the second row with the sum of the second and third rows. Thisgives¼¾¼ ½ ¼ ½½ ¼ ¼ ½½ ¼ ½ ¼¿For error-correcting purposes, two codes that have the same minimum distance havethe same properties. One way in which we can change a linear code without changingits minimum distance is to change the order of the bits in all the code words.DEFINITION 6.29 Equivalent Codes Two codes are equivalent if each can beconstructed from the other by reordering the bits of each code word in the same way.The generator matrices of equivalent codes can be obtained from each other by in-terchanging columns.DEFINITION6.30CanonicalForm Thegeneratormatrix ofa -dimensionallinear code in Ò is in canonical form if it is of the formÁwhere Á is a ¢ identity matrix and is an arbitrary ¢´Ò µbinary matrix.
262 Fundamentals of Information Theory and Coding DesignIt is possible to convert the generator matrix for a code using elementary row op-erations (which do not change the set of code words) and column interchanges intothe generator matrix of an equivalent code in the canonical form. The code wordsderived from a generator matrix in canonical form are also said to be in canonicalform or in systematic form.If the generator matrix is in canonical form, and Û is any -bit word, the codeword × Û is in systematic form and the ﬁrst bits of × are the same as the bitsof Û.EXAMPLE 6.43We have seen in a previous example that¼ ¼ ¼ ½½ ¼ ¼ ¼is a generator matrix for the code ¼¼¼¼ ¼¼¼½ ½¼¼¼ ½¼¼½ .To reduce this to canonical form, we ﬁrst interchange the ﬁrst and second columns toget½¼ ¼ ¼ ½¼ ½ ¼ ¼and then interchange the ﬁrst and last columns to get¾½ ¼ ¼ ¼¼ ½ ¼ ¼This is now in canonical form, with¼ ¼¼ ¼The code generated by the canonical form of the generator matrix is¼¼¼¼ ½¼¼¼ ¼½¼¼ ½½¼¼Note that both codes have minimum distance ½.EXAMPLE 6.44¾¼ ¼ ½ ½¼ ½ ½ ¼½ ½ ¼ ¼¿
Error-Correcting Codes 263isageneratormatrix ofthelinearcode ¼¼¼¼ ¼¼½½ ¼½½¼ ½½¼¼ ¼½¼½ ½½½½ ½¼½¼ ½¼¼½ .To reduce to canonical form, we begin by applying the elementary row operationof replacing the second row of by the sum of the ﬁrst and second rows to give½¾¼ ¼ ½ ½¼ ½ ¼ ½½ ½ ¼ ¼¿Next we replace the third row by the sum of the second and third rows to give¾¾¼ ¼ ½ ½¼ ½ ¼ ½½ ¼ ¼ ½¿Finally we interchange the ﬁrst and third columns to give¿¾½ ¼ ¼ ½¼ ½ ¼ ½¼ ¼ ½ ½¿This is in canonical form with ¾½½½¿The code generated by the canonical form of is the same as the code generated by.EXAMPLE 6.45¾¼ ¼ ¼ ¼ ½¼ ¼ ½ ½ ½½ ½ ½ ½ ½¿is a generator matrix of the linear code¼¼¼¼¼ ¼¼¼¼½ ¼¼½½¼ ¼¼½½½ ½½¼¼¼ ½½¼¼½ ½½½½¼ ½½½½½To reduce to canonical form we start by replacing the third row with the sum ofthe second and third rows to give½¾¼ ¼ ¼ ¼ ½¼ ¼ ½ ½ ½½ ½ ¼ ¼ ¼¿
264 Fundamentals of Information Theory and Coding DesignNext, we replace the second row with the sum of the ﬁrst and second rows to give¾¾¼ ¼ ¼ ¼ ½¼ ¼ ½ ½ ¼½ ½ ¼ ¼ ¼¿We interchange the ﬁrst and last columns, obtaining¿¾½ ¼ ¼ ¼ ¼¼ ¼ ½ ½ ¼¼ ½ ¼ ¼ ½¿and ﬁnally interchange the second and third columns, to get¾½ ¼ ¼ ¼ ¼¼ ½ ¼ ½ ¼¼ ¼ ½ ¼ ½¿is now in canonical form, with¾¼ ¼½ ¼¼ ½¿It generates the code¼¼¼¼¼ ½¼¼¼¼ ¼½¼½¼ ¼¼½¼½ ½½¼½¼ ½¼½¼½ ¼½½½½ ½½½½½DEFINITION 6.31 Parity Check Matrix The parity check matrix of a linearcode with ¢ Ò generator matrix is the ¢ Ò matrix À satisfyingÀÌ ¼ (6.10)where ÀÌ is the transpose of À and ¼ denotes the ¢ ´Ò µ zero matrix.If is in canonical form, Á , with Á the ¢ identity matrix, thenÀ Ì Á , where Á is the ´Ò µ ¢ ´Ò µ identity matrix. (This is true ifand À are binary matrices. In general, we should have À Ì Á . In thebinary case, .) If is not in canonical form, we can ﬁnd À by reducingto canonical form, ﬁnding the canonical form of À using the equation above, andthen reversing the column operations used to convert to canonical form to convertthe canonical form of À to the parity check matrix of .
Error-Correcting Codes 265EXAMPLE 6.46¾½ ¼ ¼ ¼ ½¼ ½ ¼ ½ ½¼ ¼ ½ ½ ¼¿is in canonical form with ¾¼ ½½ ½½ ¼¿À is obtained by transposing and adjoining a ¾¢¾ identity matrix to getÀ ¼ ½ ½ ½ ¼½ ½ ¼ ¼ ½As expected, we haveÀÌ¾½ ¼ ¼ ¼ ½¼ ½ ¼ ½ ½¼ ¼ ½ ½ ¼¿¾¼ ½½ ½½ ¼½ ¼¼ ½¿¾¼ ¼¼ ¼¼ ¼¿EXAMPLE 6.47We have seen in a previous example that¾¼ ¼ ¼ ¼ ½¼ ¼ ½ ½ ½½ ½ ½ ½ ½¿is a generator matrix of the linear code¼¼¼¼¼ ¼¼¼¼½ ¼¼½½¼ ¼¼½½½ ½½¼¼¼ ½½¼¼½ ½½½½¼ ½½½½½which can be reduced to the canonical form¾½ ¼ ¼ ¼ ¼¼ ½ ¼ ½ ¼¼ ¼ ½ ¼ ½¿by the following operations:1. Replace the third row by the sum of the second and third rows.
266 Fundamentals of Information Theory and Coding Design2. Replace the second row by the sum of the ﬁrst and second rows.3. Interchange the ﬁrst and ﬁfth columns.4. Interchange the second and third columns.The canonical form of the parity check matrix isÀ ¼ ½ ¼ ½ ¼¼ ¼ ½ ¼ ½We haveÀÌ¾¼ ¼¼ ¼¼ ¼¿To ﬁnd the parity check matrix of , we apply the column operations used to reduceto canonical form to À in reverse order. We start by interchanging the second andthird columns to getÀ½¼ ¼ ½ ½ ¼¼ ½ ¼ ¼ ½We interchange the ﬁrst and ﬁfth columns to getÀ ¼ ¼ ½ ½ ¼½ ½ ¼ ¼ ¼We now check thatÀÌ¾¼ ¼¼ ¼¼ ¼¿EXAMPLE 6.48½ ¼ ½¼ ½ ½is a generator matrix of the linear code ¼¼¼ ½¼½ ¼½½ ½½¼ . It is in canonical form;so the parity check matrix isÀ ¢½ ½ ½£
Error-Correcting Codes 267Let us see what happens to elements of ¿ when they are multiplied by ÀÌ :¼¼¼ÀÌ ¼¼¼½ÀÌ ½¼½¼ÀÌ ½¼½½ÀÌ ¼½¼¼ÀÌ ½½¼½ÀÌ ¼½½¼ÀÌ ¼½½½ÀÌ ½The product of H and any of the code words in the code generated by is zero, whilethe remaining products are non-zero.The example above should not be surprising, for we have the following result.RESULT 6.1If Ä is a -dimensional linear code in Ò, and is a generator matrix for Ä, everycode word in Ä can be obtained by taking some ¾ and multiplying it by , toget the code word . If we now multiply this by the transpose of the parity checkmatrix À, we get´ µÀÌ ´ ÀÌ µ ¼ ¼ (6.11)The parity check matrix gives us an easy way of determining if a word Û ¾ Òbelongs to the linear code Ä: we compute ÛÀØ. If the result is the zero matrix, Ûis a code word. If not, then Û is not a code word. As we shall see, the parity checkmatrix enables us to do more than just check if a word is a code word.EXAMPLE 6.49¼ ¼ ½ ½½ ½ ¼ ¼is a generator matrix of the linear code ¼¼¼¼ ¼¼½½ ½½¼¼ ½½½½ . We reduce it tocanonical form by interchanging the ﬁrst and last columns to get½ ¼ ½ ¼¼ ½ ¼ ½which generates the code ¼¼¼¼ ½¼½¼ ¼½¼½ ½½½½ .
268 Fundamentals of Information Theory and Coding DesignThe canonical form of the parity check matrix isÀ ½ ¼ ½ ¼¼ ½ ¼ ½If we look at the products¼¼¼¼ÀÌ¼¼¼¼¼½ÀÌ¼½¼¼½¼ÀÌ½¼¼¼½½ÀÌ½½¼½¼¼ÀÌ¼½¼½¼½ÀÌ¼¼¼½½¼ÀÌ½½¼½½½ÀÌ½¼½¼¼¼ÀÌ½¼½¼¼½ÀÌ½½½¼½¼ÀÌ¼¼½¼½½ÀÌ¼½½½¼¼ÀÌ½½½½¼½ÀÌ½¼½½½¼ÀÌ¼½½½½½ÀÌ¼¼we see that only the code words in the code generated by have products equal to¼¼.If we now interchange the ﬁrst and last columns of À to get the parity check matrixof , we getÀ ¼ ¼ ½ ½½ ½ ¼ ¼and the products are¼¼¼¼ÀÌ¼¼¼¼¼½ÀÌ½¼¼¼½¼ÀÌ½¼¼¼½½ÀÌ¼¼¼½¼¼ÀÌ¼½¼½¼½ÀÌ½½¼½½¼ÀÌ½½
Error-Correcting Codes 269¼½½½ÀÌ ¼½½¼¼¼ÀÌ ¼½½¼¼½ÀÌ ½½½¼½¼ÀÌ ½½½¼½½ÀÌ ¼½½½¼¼ÀÌ ¼¼½½¼½ÀÌ ½¼½½½¼ÀÌ ½¼½½½½ÀÌ ¼¼and again only the code words in the code generated by have products equal to ¼¼.In this example, we have À and À . A generator matrix can be the sameas its parity check matrix.(Although we have not introduced any terminology regarding linear transformations,readers who are familiar with this topic will realise that a linear code Ä is the rangeof the linear transformation determined by the generator matrix, and the kernel ofthe linear transformation determined by the transpose of the associated parity checkmatrix.)6.7 Encoding and DecodingThe use of a linear code for error correction involves an encoding step to add redun-dancy and a later decoding step which attempts to correct errors before it removesthe redundancy. This is essentially the same process as was described in Chapter 5,but the additional structure of linear codes enables us to ﬁnd new algorithms for theencoding and decoding processes.In the simplest case, the encoding step uses a generator matrix of a linear code,while the decoding step uses the corresponding parity check matrix.The encoding process takes a string, , bits in length and produces a code word,Û, Ò bits in length, Û . The ﬁrst bits of Û are the message bits, and theremaining bits are the check bits. The latter represent the redundancy which hasbeen added to the message.If is in canonical form, the ﬁrst bits of Û are the bits of and the code is said tobe in systematic form. In this case, code words can be decoded by simply removingthe last ´Ò µ bits.
270 Fundamentals of Information Theory and Coding DesignSuppose that a code word is corrupted by noise between encoding and decoding.One or more bits will be changed, zeros becoming ones and ones becoming zeros.It is possible that the result will be another code word, in which case it will not bepossible to detect that corruption has occurred. Otherwise, the corrupted string willnot be a code word, and attempts may be made to restore the uncorrupted code wordas part of the decoding process.The decoding process therefore has two stages. If the received string is not a codeword, the uncorrupted code word has to be restored. The code word is then decodedto recover the original string.If we assume that the corruption occurred while transmission through a binary sym-metric channel, and use the maximum likelihood decoding strategy, it follows that acorrupted string should be restored to the code word which is closest to it (in termsof the Hamming distance). (Other assumptions about the characteristics of the noiseprocess may lead to other procedures.)We suppose that the generator matrix generates the linear code Ä, and that the codeword Û is corrupted by a noise vector , and the result is the vector Ü Û · . IfÜ is not a code word, the de