The document discusses entropy and redundancy in the English language. It provides examples of how rules of grammar, parts of speech, and the inability to make up words make English a redundant language. Shannon calculated the entropy of English using N-grams and by taking the weighted average entropy of words. His calculations estimated the redundancy of English at around 50%. Shannon also had subjects guess letters in phrases to measure redundancy, with subjects able to correctly guess around 69% of letters, suggesting a similar level of redundancy. Statistically measuring redundancy has practical applications like data compression algorithms that take advantage of the non-random nature of English.
1. dice.wmf
(2966
bytes)
Entropy and Redundancy in English
After having defined entropy and redundancy, it is useful to consider an example of these
concepcts applied to the English language. Shannon, in his paper "Prediction and Entropy of
Printed English," gives two methods of estimating the entropy of English. The redundancy,
or number of constraints imposed on the text of the English language, causes a decrease in its
overall entropy. For example, the rules "i before e except after c", and the fact that a q must
always be followed by a u are dependencies that make the English language more redundant.
Rules of grammar, parts of speech, and the fact that we cannot make up words make English
redundant as well.
Redundancy in the English language is actually beneficial at times, for how else might one
discern what is said in a noisy room? The redundancy allows one to infer what is said when
only part of a message comes across. For example if one hears "Turn phat mufic down!", one
can make a fairly good guess as to what the speaker meant.
One possible way of calculating the entropy of English uses N-grams. One can statistically
calculate the entropy of the next letter when the previous N - 1 letters are known. As N
increases, the entropy approaches H, or the entropy of English. Following are the calculated
values from Shannon's paper. FN is the entropy associated with the Nth letter when the
previous N - 1 letters are known. The difficulty of calculating the statistics for FN is O(26^N),
because there are that many sequences of N letters. Note that F0 is simply the maximum
entropy for the set of letters, where each has an equal probability.
F0 F1 F2 F3 Fword
26 letter 4.70 4.14 3.56 3.3 2.62
27 letter 4.76 4.03 3.32 3.1 2.14
The 27-letter sequences include the space as a letter. Onecanalmostalwaysfillinthespacesfrom
asequenceofwordswithnospaces. Therefore, spaces are basically redundant and will cause
lower calculated entropies when taken into account. Only in the case where no statistics are
taken into account, F0, is the entropy higher when the space is added. This simply adds
another possible symbol, which means more uncertainty.
Another strategy Shannon suggests is to calculate the entropy associated with every word in
the English language, and take a weighted average. Shannon uses an approximating function
to estimate the entropy for over 8,000 words. The calculated value he gets for the entropy per
word is 11.82 bits, and since the average word has 4.5 letters, the entropy is 2.62 bits per
letter. This is given as Fword in the above table.
We have already discussed how to calculate redundancy from entropy. The maximum
redundancy occurs when all the symbols have equal likelihood, and is equal to - (log2(1/26))
= 4.7 bits/letter. Therefore, using the formula 1 - H/Hmax, we can estimate the redundancy of
English. Shannon initially estimated this value at 50 %, meaning that about half the letters in
the English language are redundant!
2. Discussed later in the same article is a rather ingenious way of calculating the entropy of the
English language. It incorporates many more features of the English language, such as line of
thought and context that statistical methods cannot explicitly account for. Crossword puzzles
and games such as Hangman and Wheel of Fortune exploit redundancy by assuming that
humans can guess letters in a word or phrase based on their previous knowledge of the
language. Shannon's ingenious idea was to exploit this natural measure of redundancy... the
human mind. He asked subjects to guess the letters in a phrase one by one. If the subject
guessed correctly, then he/she moved on to the next letter. If not, then the subject is told the
next letter. Of 129 letters in a phrase, 69% were guessed correctly. This suggests
approximately a 69 % redundancy in the English language. Say we reproduce only those
letters which were guessed wrong, 31 %. Then we may, by cloning the subject who guessed
from scratch, get back the original sentence. The subject can obviously get the 69 % of the
symbols right, and he/she has the other 31%, so he/she can reproduce the original text with
only about 31 % of the information. Actually, the subject would need slightly more than 31
% of the information. He/she would need to know where the letters are that his is going to
guess wrong on, so actually the redundancy might be a little less. Theoretically this is a good
example, but practically, it is not. Sampling error in terms of sentences and subjects can
cause significant distortion of results. Nevertheless, this example is instrumental in
illustrating a practical example of redundancy, and sheds light on how to code English. There
is no need to create a statistical grammar of the English language in order to calculate its
entropy. Humans have the grammar naturally built in.
Statistically calculating the redundancy of the English language has numerous practical
applications. ASCII reserves exactly 8 binary digits per character. However, this is highly
inefficient, considering that some calculations place the entropy of English at around 1 bit per
letter. This means that theoritically, there is a compression scheme that is 8 times as good as
ASCII. Although modern computers have enough memory that this inefficiency is not
crucial, Huffman compression and Lempel-Ziv compression algorithms save significant
space when text is stored.
Normally, when people say the English language is redundant, they are speaking of the
numerous synonyms which clutter our dictionaries. Redundancy in the sense of information
theory is a measure of how efficiently the letters/symbols are used in the language. The fact
that English is a redundant language is not necessarily bad. That our language is spoken as
well as written brings in many issues besides efficiency. We want to be understood in a noisy
room, we want the sound of words to correspond to their meaning, and we want to be able to
pronounce words with ease. Information rates are only one small part of the analysis of the
English language.
A very interesting illustration of how well a language can be be describe statistically occurs
are the nth order approximations of the English language, reproduced here from Shannon's
paper "The Mathematical Theory of Communication." Would a monkey who knew the n-
gram frequencies of the letters in English where n is large be able to produce credible English
text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one
door and a human is behind the other, could a third-party observer tell which was the
monkey? This question rings of the Turing test for artificial intelligence, to which there is no
easy answer.
Zero-order approximation
XFOML RXKHRJFFJUJ ALPWXFWJXYJ
FFJEYVJCQSGHYD
QPAAMKBZAACIBZLKJQD
First-order approximation
OCRO HLO RGWR NMIELWIS EU LL
NBNESEBYA TH EEI ALHENHTTPA
OOBTTVA NAH BRL
3. Second-order approximation
ON IE ANTSOUTINYS ARE T INCTORE
ST BE S DEAMY ACHIN D ILONASIVE
TUCOOWE AT TEASONARE FUSO
TIZIN ANDY TOBE SEACE CTISBE
Third-order approximation
IN NO IST LAT WHEY CRATICT
FROURE BIRS GROCID PONDENOME
OF DEMONSTURES OF THE REPTAGIN
IS REGOACTIONA OF CRE
First-order word approximation
REPRESENTING AND SPEEDILY IS AN
GOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME
THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD
BE THESE
Second-order word approximation
THE HEAD AND IN FRONTAL ATTACK
ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR
THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR
AN UNEXPECTED