This study, which was initially borne out of curiosity, looked at a special class of English words using methodology as in similar studies with the aim of researching the fact that they have similar or dissimilar patterns and also provide comprehensive methodology of how to carry out this research and similar tasks using Microsoft® Excel. English words whose weighted-sum is 100 percent were studied to investigate the pattern of use of English alphabets for constructing such words and to see the if these words differ in construction compared to free text and dictionary words. These words are colloquially referred to as
centograqs for the purpose of simplicity in this study.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Centograqs
1. CENTOGRAQ
Empirical Study of Distribution of Letters in English Words: A
Special Case of Centograqs
Omisile Kehinde Olugbenga
2. Background
• One of my colleagues posted on WhatsApp that if we assign
1 to A, 2 to B, … 26 to Z,
we could weight English words with percent as unit of
measurement.
• He went on to create summation for words like
‘HARDWORK’ = 98 percent
‘KNOWLEDGE’ = 96 percent
‘LOVE’ = 54 percent
‘LUCK’ = 47 percent and ended up posting that
‘ATTITUDE’ sums to 100 percent!
5. Objectives
1. The number of English words that actually weighted-sum up to 100
percent and their weighted-average length.
2. Average length of Centograqs.
3. The distribution of initial letters (first letters) and terminal letters (last
letters).
4. The distribution of letters used to form long words – words longer than
nine letters.
5. The composite distribution of letters in terms of the most common letter
used in constructing words.
6. Distribution of words with unique letters (isograms), that is words whose
letters are all unique.
7. Distribution of repeated letters within words and twin bigrams/digraphs.
6. What is a centograq?
•There was a need to have a single name to
call words whose weighted-sum is 100% so I
came up with the colloquial word – Centograq.
•Hence, a centograq is a word whose letter-by-
letter weighted-sum is 100 percent; and, the
weights are based on the letters’ positions in
the list of English alphabets.
7. How many words are centograqs?
•From the website www.gist.github.com, one of
the participants Peter Magenheimer had
written a Python code to extract text strings
from the English dictionary, evaluate their sum
and return only those that summed to 100
percent. Hence, using his python code, he
was able to extract 2302 centograqs.
24. Most common
letters
Showing the top 7 and the
bottom seven.
E A I R O N
[CELLRANGE];
2225
[CELLRANGE];
1852
[CELLRANGE];
1826
[CELLRANGE];
1462
[CELLRANGE];
1363
[CELLRANGE];
1292
[CELLRANGE];
1267
[CELLRANGE];
196
[CELLRANGE];
185
[CELLRANGE];
168
[CELLRANGE]; 68
[CELLRANGE]; 55
[CELLRANGE]; 32
[CELLRANGE]; 28
0 500 1000 1500 2000 2500
E
A
I
R
O
N
L
K
W
V
Z
X
J
Q
Frequency of Occurence
25. Comparison of most common letters in English
Words between centograqs, free texts, and email
text
Centograq
s E A I R O N L T S C U D M P H
Text1
E T A O I N S R H L D C U M F
Email2
E T O A I N S R H L D U C M P
26. Top 7 Initial and Terminal Letters
[CELLRANGE],
126
[CELLRANGE],
132
[CELLRANGE],
166
[CELLRANGE],
179
[CELLRANGE],
183
[CELLRANGE],
224
[CELLRANGE],
250
0 50 100 150 200 250 300
t
m
u
a
c
p
s
Initial
[CELLRANGE],
156
[CELLRANGE],
168
[CELLRANGE],
181
[CELLRANGE],
185
[CELLRANGE],
187
[CELLRANGE],
229
[CELLRANGE],
489
0100200300400500
l
s
r
d
n
y
e
Terminal
27. Number of Words with Unique Letters
Length of Word
Number of Duplications
Total0 1 2 3 4 5 6
5 2 3 1 6
6 36 36 12 84
7 121 154 36 5 316
8 140 298 162 36 2 638
9 105 228 197 74 9 613
10 14 103 159 88 23 2 389
11 21 56 73 32 7 189
12 9 16 17 9 1 52
13 1 4 3 5 13
14 2 2
Total 418 843 632 293 87 23 6 2302
28. Example of Centograqs that are isograms
baculiform fatherling lageniform neoblastic
pelargonic plumbagine impugnable
anhydremic unmiracled coislander
Colubrinae conjugated athrogenic
Purbeckian syndicate dysphemia
cystidean cymophane exophasic
sulfamine subrepand muraenoid lubricant
Juncoides guildsman bufotalin steckling
staminode asyndetic trembling tranceful
29. Comparison between All Centograqs
and Unique Centograqs
Position Class 1 2 3 4 5 6 7 8 9 10 11 12 13
1
4
1
5
Initial
Unique centograqs S P U T C A D O M B G E V NF
All centograqs S P C A U M T D B R G E H FO
Terminal
Unique centograqs Y E N D R T L M S C G P A HW
All centograqs E Y N D R S L T A C M G H PO
Common
Letter
Unique centograqs I R E A O T N S L U Y P C HM
All centograqs E A I R O N L T S C U D M PH
30. Twin Bigrams
•We also attempted to study the pattern of twin
bigrams, words with same letter repeated one after
the other.
•Of the 2302 centograqs, 484 (21%) have at least
one twin bigram while there are 28 words that have
two twin bigrams but not more.
32. Position of Twin Bigrams in Centograqs
72
120
68
62 63
45
36
15
2 1
0
20
40
60
80
100
120
140
2 3 4 5 6 7 8 9 10 11
NumberofWords
Position
33. Comparison between the Twin Bigrams in
Centograqs and Google English Corpus
Centograqs
L
L
S
S
O
O
E
E
R
R
T
T
M
M
N
N
P
P
F
F
G
G
C
C
D
D
B
B
Z
Z
A
A
K
K
H I J Q U V W X Y
Google*
L
L
S
S
E
E
O
O
T
T
F
F
P
P
R
R
M
M
C
C
N
N
D
D
G
G
II
B
B
A
A
Z
Z
X
X
U
U
H
H
Q J W K Y V
*Rick Wicklin, 2014
36. Vkrhu jhf zry lirasmt sabyk cidwdrkeuk
(‘Thank you for reading about centograqs’).
Editor's Notes
The list of alphabets alone is ‘alphaLIST’ (A1:A26)
The formula looks up the letter in the range named ‘spread’ ($F$2:$T$2) from the range of English alphabets named ‘alphaLIST’ ($A$1:$A$26) and assigns corresponding numbers based on its relative position in the list of letter.
All the extracted letters were converted to lower cases hence the function LOWER().
The remaining part of the formula extracts one letter from the given word contained in the column [Word] based on the length of the word (@[Length]) and the position of the letter in the word.
The formula was written as an array formula, hence the curly brackets before and after the formula, so that the table references would remain absolute and not relative except ‘E2’.
CHAR() and CODE() were used in the formula to enhance the output of IFERROR() which would return ‘0’ should the output of CHAR() be an error.
This formula is more compact instead of using the IF() function which would have required that we re-write the formula being evaluated twice.
The formula matched each letter in the columns [1st] to [14th] from the list ‘alpha’ and sums them using SUMPRODUCT().
Note that this is written as an array formula.
I made a deduction of 14 from the sum in order to account for ‘0’ which add been added to the top of the list ‘alpha’ to avoid blanks.
Hence the difference between the [length] of the word and the number of [unique] letters gives insight into the words whose values are [completely unique] and which ones have at least one repeated letter.
The shortest words are tousy, struv, totty, buzzy, nutty, and pussy
The longest words are Batrachoididae and Biddulphiaceae, both of which are scientific jargons.
The average word length for centograqs is quite higher than the average length of English words which has been widely documented at approximately 5-letter long in a text (Bochkarev, Shevlyakova, & Solovyev, 2015).
But for distinct words, which resembles the use of a dictionary, the average word length is approximately 8-letter long
One out of every five centograq ends with letter ‘e’
One out of every ten centograq ends with letter ‘s’
Findings in this section are similar to the findings of Rick Wicklin (Wicklin, 2014) who analyzed further “Peter Norvig’s analysis of 774-billion word corpus of documents that were digitized at Google” (Norvig, 2012). Wicklin found that ‘LL’ is the most common twin bigram followed by ‘SS’, ‘EE’, ‘OO’, and ‘TT’ and there were also 20 letters that have twin bigrams including ‘XX’, ‘UU’, and ‘HH’, which are mostly likely non-English words or more likely proper nouns from other languages that appear in the corpus. However, unlike the corpus, ‘KK’ appears in the English dictionary and among centograqs.
A table was constructed with the showing the relative positions of the letters of centograqs.
A second table was generated that ranked the letters by position
The new table is the then used in scrambling words.