SlideShare a Scribd company logo
1 of 3
Download to read offline
dice.wmf
(2966
bytes)
Entropy and Redundancy in English
After having defined entropy and redundancy, it is useful to consider an example of these
concepcts applied to the English language. Shannon, in his paper "Prediction and Entropy of
Printed English," gives two methods of estimating the entropy of English. The redundancy,
or number of constraints imposed on the text of the English language, causes a decrease in its
overall entropy. For example, the rules "i before e except after c", and the fact that a q must
always be followed by a u are dependencies that make the English language more redundant.
Rules of grammar, parts of speech, and the fact that we cannot make up words make English
redundant as well.
Redundancy in the English language is actually beneficial at times, for how else might one
discern what is said in a noisy room? The redundancy allows one to infer what is said when
only part of a message comes across. For example if one hears "Turn phat mufic down!", one
can make a fairly good guess as to what the speaker meant.
One possible way of calculating the entropy of English uses N-grams. One can statistically
calculate the entropy of the next letter when the previous N - 1 letters are known. As N
increases, the entropy approaches H, or the entropy of English. Following are the calculated
values from Shannon's paper. FN is the entropy associated with the Nth letter when the
previous N - 1 letters are known. The difficulty of calculating the statistics for FN is O(26^N),
because there are that many sequences of N letters. Note that F0 is simply the maximum
entropy for the set of letters, where each has an equal probability.
  F0 F1 F2 F3 Fword
26 letter 4.70 4.14 3.56 3.3 2.62
27 letter 4.76 4.03 3.32 3.1 2.14
The 27-letter sequences include the space as a letter. Onecanalmostalwaysfillinthespacesfrom
asequenceofwordswithnospaces. Therefore, spaces are basically redundant and will cause
lower calculated entropies when taken into account. Only in the case where no statistics are
taken into account, F0, is the entropy higher when the space is added. This simply adds
another possible symbol, which means more uncertainty.
Another strategy Shannon suggests is to calculate the entropy associated with every word in
the English language, and take a weighted average. Shannon uses an approximating function
to estimate the entropy for over 8,000 words. The calculated value he gets for the entropy per
word is 11.82 bits, and since the average word has 4.5 letters, the entropy is 2.62 bits per
letter. This is given as Fword in the above table.
We have already discussed how to calculate redundancy from entropy. The maximum
redundancy occurs when all the symbols have equal likelihood, and is equal to - (log2(1/26))
= 4.7 bits/letter. Therefore, using the formula 1 - H/Hmax, we can estimate the redundancy of
English. Shannon initially estimated this value at 50 %, meaning that about half the letters in
the English language are redundant!
Discussed later in the same article is a rather ingenious way of calculating the entropy of the
English language. It incorporates many more features of the English language, such as line of
thought and context that statistical methods cannot explicitly account for. Crossword puzzles
and games such as Hangman and Wheel of Fortune exploit redundancy by assuming that
humans can guess letters in a word or phrase based on their previous knowledge of the
language. Shannon's ingenious idea was to exploit this natural measure of redundancy... the
human mind. He asked subjects to guess the letters in a phrase one by one. If the subject
guessed correctly, then he/she moved on to the next letter. If not, then the subject is told the
next letter. Of 129 letters in a phrase, 69% were guessed correctly. This suggests
approximately a 69 % redundancy in the English language. Say we reproduce only those
letters which were guessed wrong, 31 %. Then we may, by cloning the subject who guessed
from scratch, get back the original sentence. The subject can obviously get the 69 % of the
symbols right, and he/she has the other 31%, so he/she can reproduce the original text with
only about 31 % of the information. Actually, the subject would need slightly more than 31
% of the information. He/she would need to know where the letters are that his is going to
guess wrong on, so actually the redundancy might be a little less. Theoretically this is a good
example, but practically, it is not. Sampling error in terms of sentences and subjects can
cause significant distortion of results. Nevertheless, this example is instrumental in
illustrating a practical example of redundancy, and sheds light on how to code English. There
is no need to create a statistical grammar of the English language in order to calculate its
entropy. Humans have the grammar naturally built in.
Statistically calculating the redundancy of the English language has numerous practical
applications. ASCII reserves exactly 8 binary digits per character. However, this is highly
inefficient, considering that some calculations place the entropy of English at around 1 bit per
letter. This means that theoritically, there is a compression scheme that is 8 times as good as
ASCII. Although modern computers have enough memory that this inefficiency is not
crucial, Huffman compression and Lempel-Ziv compression algorithms save significant
space when text is stored.
Normally, when people say the English language is redundant, they are speaking of the
numerous synonyms which clutter our dictionaries. Redundancy in the sense of information
theory is a measure of how efficiently the letters/symbols are used in the language. The fact
that English is a redundant language is not necessarily bad. That our language is spoken as
well as written brings in many issues besides efficiency. We want to be understood in a noisy
room, we want the sound of words to correspond to their meaning, and we want to be able to
pronounce words with ease. Information rates are only one small part of the analysis of the
English language.
A very interesting illustration of how well a language can be be describe statistically occurs
are the nth order approximations of the English language, reproduced here from Shannon's
paper "The Mathematical Theory of Communication." Would a monkey who knew the n-
gram frequencies of the letters in English where n is large be able to produce credible English
text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one
door and a human is behind the other, could a third-party observer tell which was the
monkey? This question rings of the Turing test for artificial intelligence, to which there is no
easy answer.
Zero-order approximation
XFOML RXKHRJFFJUJ ALPWXFWJXYJ
FFJEYVJCQSGHYD
QPAAMKBZAACIBZLKJQD
First-order approximation
OCRO HLO RGWR NMIELWIS EU LL
NBNESEBYA TH EEI ALHENHTTPA
OOBTTVA NAH BRL
Second-order approximation
ON IE ANTSOUTINYS ARE T INCTORE
ST BE S DEAMY ACHIN D ILONASIVE
TUCOOWE AT TEASONARE FUSO
TIZIN ANDY TOBE SEACE CTISBE
Third-order approximation
IN NO IST LAT WHEY CRATICT
FROURE BIRS GROCID PONDENOME
OF DEMONSTURES OF THE REPTAGIN
IS REGOACTIONA OF CRE
First-order word approximation
REPRESENTING AND SPEEDILY IS AN
GOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME
THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD
BE THESE
Second-order word approximation
THE HEAD AND IN FRONTAL ATTACK
ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR
THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR
AN UNEXPECTED

More Related Content

Viewers also liked

Antiwan Decatur Warehouse(1)
Antiwan Decatur Warehouse(1)Antiwan Decatur Warehouse(1)
Antiwan Decatur Warehouse(1)Antiwan DeCatur
 
Sohail Ahmed- System admin
Sohail Ahmed- System adminSohail Ahmed- System admin
Sohail Ahmed- System adminSohail Ahmed
 
Cómo diagnosticar sus inteligencias
Cómo diagnosticar sus inteligenciasCómo diagnosticar sus inteligencias
Cómo diagnosticar sus inteligenciasjahir ariza
 
Trabajo de investigacion - Clase 4 y 5
Trabajo de investigacion - Clase 4 y 5Trabajo de investigacion - Clase 4 y 5
Trabajo de investigacion - Clase 4 y 5Elkin J. Navarro
 
7 Clase 3 pp - Medios de comunicación
7 Clase 3 pp - Medios de comunicación7 Clase 3 pp - Medios de comunicación
7 Clase 3 pp - Medios de comunicaciónElkin J. Navarro
 
2003 03-04 joeck energo-eko-tech-kable sn-n n
2003 03-04 joeck energo-eko-tech-kable sn-n n2003 03-04 joeck energo-eko-tech-kable sn-n n
2003 03-04 joeck energo-eko-tech-kable sn-n nRemigiusz Joeck
 
Claves de la comunicación efectiva
Claves de la comunicación efectivaClaves de la comunicación efectiva
Claves de la comunicación efectivaJuan Carlos Jiménez
 
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.Elkin J. Navarro
 
what.digital Content Marketing Strategien & Ideen
what.digital Content Marketing Strategien & Ideenwhat.digital Content Marketing Strategien & Ideen
what.digital Content Marketing Strategien & Ideenwhat.digital
 

Viewers also liked (13)

Antiwan Decatur Warehouse(1)
Antiwan Decatur Warehouse(1)Antiwan Decatur Warehouse(1)
Antiwan Decatur Warehouse(1)
 
El uso de las tics en el turismo
El uso de las tics en el turismoEl uso de las tics en el turismo
El uso de las tics en el turismo
 
Sohail Ahmed- System admin
Sohail Ahmed- System adminSohail Ahmed- System admin
Sohail Ahmed- System admin
 
Stuck in stone
Stuck in stoneStuck in stone
Stuck in stone
 
Juiced
JuicedJuiced
Juiced
 
Diagrama de flujo final
Diagrama de flujo finalDiagrama de flujo final
Diagrama de flujo final
 
Cómo diagnosticar sus inteligencias
Cómo diagnosticar sus inteligenciasCómo diagnosticar sus inteligencias
Cómo diagnosticar sus inteligencias
 
Trabajo de investigacion - Clase 4 y 5
Trabajo de investigacion - Clase 4 y 5Trabajo de investigacion - Clase 4 y 5
Trabajo de investigacion - Clase 4 y 5
 
7 Clase 3 pp - Medios de comunicación
7 Clase 3 pp - Medios de comunicación7 Clase 3 pp - Medios de comunicación
7 Clase 3 pp - Medios de comunicación
 
2003 03-04 joeck energo-eko-tech-kable sn-n n
2003 03-04 joeck energo-eko-tech-kable sn-n n2003 03-04 joeck energo-eko-tech-kable sn-n n
2003 03-04 joeck energo-eko-tech-kable sn-n n
 
Claves de la comunicación efectiva
Claves de la comunicación efectivaClaves de la comunicación efectiva
Claves de la comunicación efectiva
 
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.
7 Clase 2 pp - Comunicación, Comunicación en humanos y Teoría de la información.
 
what.digital Content Marketing Strategien & Ideen
what.digital Content Marketing Strategien & Ideenwhat.digital Content Marketing Strategien & Ideen
what.digital Content Marketing Strategien & Ideen
 

Similar to Bits and binary digits

Grammatical and Ungrammatical Sentences.pdf
Grammatical and Ungrammatical Sentences.pdfGrammatical and Ungrammatical Sentences.pdf
Grammatical and Ungrammatical Sentences.pdfSistemadeEstudiosMed
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Stretching your brain the challenge of translation
Stretching your brain   the challenge of translationStretching your brain   the challenge of translation
Stretching your brain the challenge of translationCarmen Cabrera Alvarez
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithmsRaghu nath
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaITIIIndustries
 
problems with_natural_languages
problems with_natural_languagesproblems with_natural_languages
problems with_natural_languagesRajendran
 
Syntactic Analysis Based on Morphological characteristic Features of the Roma...
Syntactic Analysis Based on Morphological characteristic Features of the Roma...Syntactic Analysis Based on Morphological characteristic Features of the Roma...
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
 

Similar to Bits and binary digits (20)

Grammatical and Ungrammatical Sentences.pdf
Grammatical and Ungrammatical Sentences.pdfGrammatical and Ungrammatical Sentences.pdf
Grammatical and Ungrammatical Sentences.pdf
 
Introduction to linguistics
Introduction to linguisticsIntroduction to linguistics
Introduction to linguistics
 
English Grammar
English GrammarEnglish Grammar
English Grammar
 
P99 1067
P99 1067P99 1067
P99 1067
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
sadf
sadfsadf
sadf
 
Parsing techniques
Parsing techniquesParsing techniques
Parsing techniques
 
Stretching your brain the challenge of translation
Stretching your brain   the challenge of translationStretching your brain   the challenge of translation
Stretching your brain the challenge of translation
 
Stemming algorithms
Stemming algorithmsStemming algorithms
Stemming algorithms
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Cap 1
Cap 1Cap 1
Cap 1
 
2001052491
20010524912001052491
2001052491
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Generative grammar
Generative grammarGenerative grammar
Generative grammar
 
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech Phenomena
 
problems with_natural_languages
problems with_natural_languagesproblems with_natural_languages
problems with_natural_languages
 
Grammar Syntax(1).pptx
Grammar Syntax(1).pptxGrammar Syntax(1).pptx
Grammar Syntax(1).pptx
 
Syntactic Analysis Based on Morphological characteristic Features of the Roma...
Syntactic Analysis Based on Morphological characteristic Features of the Roma...Syntactic Analysis Based on Morphological characteristic Features of the Roma...
Syntactic Analysis Based on Morphological characteristic Features of the Roma...
 

Recently uploaded

A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einsteinxgamestudios8
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandRcvets
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyAreesha Ahmad
 
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdfSuchita Rawat
 
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.pptTHE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.pptsinghnarendra5386
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismAreesha Ahmad
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil RecordSangram Sahoo
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneySérgio Sacani
 
Precision Farming in Fruit Crops presentation
Precision Farming in Fruit Crops presentationPrecision Farming in Fruit Crops presentation
Precision Farming in Fruit Crops presentationscvns2828
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxssusera4ec7b
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...Sérgio Sacani
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Fabiano Dalpiaz
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaTimothyOkuna
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloChristian Robert
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algaekushbuR
 
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.ppt
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.pptGENETICALLY MODIFIED ORGANISM'S PRESENTATION.ppt
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.pptSyedArifMalki
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...kevin8smith
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfmarcuskenyatta275
 

Recently uploaded (20)

A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einstein
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) Enzymology
 
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
 
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.pptTHE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
THE FUNDAMENTAL UNIT OF LIFE CLASS IX.ppt
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Precision Farming in Fruit Crops presentation
Precision Farming in Fruit Crops presentationPrecision Farming in Fruit Crops presentation
Precision Farming in Fruit Crops presentation
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algae
 
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.ppt
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.pptGENETICALLY MODIFIED ORGANISM'S PRESENTATION.ppt
GENETICALLY MODIFIED ORGANISM'S PRESENTATION.ppt
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
Harry Coumnas Thinks That Human Teleportation is Possible in Quantum Mechanic...
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 

Bits and binary digits

  • 1. dice.wmf (2966 bytes) Entropy and Redundancy in English After having defined entropy and redundancy, it is useful to consider an example of these concepcts applied to the English language. Shannon, in his paper "Prediction and Entropy of Printed English," gives two methods of estimating the entropy of English. The redundancy, or number of constraints imposed on the text of the English language, causes a decrease in its overall entropy. For example, the rules "i before e except after c", and the fact that a q must always be followed by a u are dependencies that make the English language more redundant. Rules of grammar, parts of speech, and the fact that we cannot make up words make English redundant as well. Redundancy in the English language is actually beneficial at times, for how else might one discern what is said in a noisy room? The redundancy allows one to infer what is said when only part of a message comes across. For example if one hears "Turn phat mufic down!", one can make a fairly good guess as to what the speaker meant. One possible way of calculating the entropy of English uses N-grams. One can statistically calculate the entropy of the next letter when the previous N - 1 letters are known. As N increases, the entropy approaches H, or the entropy of English. Following are the calculated values from Shannon's paper. FN is the entropy associated with the Nth letter when the previous N - 1 letters are known. The difficulty of calculating the statistics for FN is O(26^N), because there are that many sequences of N letters. Note that F0 is simply the maximum entropy for the set of letters, where each has an equal probability.   F0 F1 F2 F3 Fword 26 letter 4.70 4.14 3.56 3.3 2.62 27 letter 4.76 4.03 3.32 3.1 2.14 The 27-letter sequences include the space as a letter. Onecanalmostalwaysfillinthespacesfrom asequenceofwordswithnospaces. Therefore, spaces are basically redundant and will cause lower calculated entropies when taken into account. Only in the case where no statistics are taken into account, F0, is the entropy higher when the space is added. This simply adds another possible symbol, which means more uncertainty. Another strategy Shannon suggests is to calculate the entropy associated with every word in the English language, and take a weighted average. Shannon uses an approximating function to estimate the entropy for over 8,000 words. The calculated value he gets for the entropy per word is 11.82 bits, and since the average word has 4.5 letters, the entropy is 2.62 bits per letter. This is given as Fword in the above table. We have already discussed how to calculate redundancy from entropy. The maximum redundancy occurs when all the symbols have equal likelihood, and is equal to - (log2(1/26)) = 4.7 bits/letter. Therefore, using the formula 1 - H/Hmax, we can estimate the redundancy of English. Shannon initially estimated this value at 50 %, meaning that about half the letters in the English language are redundant!
  • 2. Discussed later in the same article is a rather ingenious way of calculating the entropy of the English language. It incorporates many more features of the English language, such as line of thought and context that statistical methods cannot explicitly account for. Crossword puzzles and games such as Hangman and Wheel of Fortune exploit redundancy by assuming that humans can guess letters in a word or phrase based on their previous knowledge of the language. Shannon's ingenious idea was to exploit this natural measure of redundancy... the human mind. He asked subjects to guess the letters in a phrase one by one. If the subject guessed correctly, then he/she moved on to the next letter. If not, then the subject is told the next letter. Of 129 letters in a phrase, 69% were guessed correctly. This suggests approximately a 69 % redundancy in the English language. Say we reproduce only those letters which were guessed wrong, 31 %. Then we may, by cloning the subject who guessed from scratch, get back the original sentence. The subject can obviously get the 69 % of the symbols right, and he/she has the other 31%, so he/she can reproduce the original text with only about 31 % of the information. Actually, the subject would need slightly more than 31 % of the information. He/she would need to know where the letters are that his is going to guess wrong on, so actually the redundancy might be a little less. Theoretically this is a good example, but practically, it is not. Sampling error in terms of sentences and subjects can cause significant distortion of results. Nevertheless, this example is instrumental in illustrating a practical example of redundancy, and sheds light on how to code English. There is no need to create a statistical grammar of the English language in order to calculate its entropy. Humans have the grammar naturally built in. Statistically calculating the redundancy of the English language has numerous practical applications. ASCII reserves exactly 8 binary digits per character. However, this is highly inefficient, considering that some calculations place the entropy of English at around 1 bit per letter. This means that theoritically, there is a compression scheme that is 8 times as good as ASCII. Although modern computers have enough memory that this inefficiency is not crucial, Huffman compression and Lempel-Ziv compression algorithms save significant space when text is stored. Normally, when people say the English language is redundant, they are speaking of the numerous synonyms which clutter our dictionaries. Redundancy in the sense of information theory is a measure of how efficiently the letters/symbols are used in the language. The fact that English is a redundant language is not necessarily bad. That our language is spoken as well as written brings in many issues besides efficiency. We want to be understood in a noisy room, we want the sound of words to correspond to their meaning, and we want to be able to pronounce words with ease. Information rates are only one small part of the analysis of the English language. A very interesting illustration of how well a language can be be describe statistically occurs are the nth order approximations of the English language, reproduced here from Shannon's paper "The Mathematical Theory of Communication." Would a monkey who knew the n- gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer. Zero-order approximation XFOML RXKHRJFFJUJ ALPWXFWJXYJ FFJEYVJCQSGHYD QPAAMKBZAACIBZLKJQD First-order approximation OCRO HLO RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
  • 3. Second-order approximation ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE Third-order approximation IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE First-order word approximation REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE Second-order word approximation THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED