Identifying cyclic words with the help of google

Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Identifying Cyclic Words with the Help of
Google Books N-grams Corpus
Costin-Gabriel CHIRU and Vladimir-Nicolae Dinu
costin.chiru@cs.pub.ro, vladimir.dinu92@yahoo.com

Introduction (1)
• Purpose:
– A system to identifying English words whose use is cyclic or
regularly varies in time.
• Cross-platform system for indexing and analyzing the graphs of
words usage over time.
• Usefulness:
– Depends on the meaning of the cyclic word:
• Generic events  events that are about to happen: rebellion,
revolution, war
• Economic field  public interest in different products – influence
the dynamic of the sales and stocks of companies
• Resources:
– Google Books N-grams Corpus – for indexing words’ usage
in time – analysis done at unigram level
– WordNet lexical database – for filtering the words of
interest
27.06.2017 ICIW 2017 2

Introduction (2)
• Analysis:
– Based on the graphs generated from the number of
uses of each word in the publications from 1800 until
2008 (from Google Corpus unigrams)
• Algorithms:
– Autocorrelation
– Dynamic Time Warping (DTW)
• Results:
– The words that were identified as being cyclic
– The years where these words were cyclic and
– The length of the cycle in years (before repeating the
cycle)
27.06.2017 ICIW 2017 3

Similar Approaches (1)
27.06.2017 ICIW 2017 4
• Petersen et. al (2012) analyzed the evolution of 107 words from
English, Spanish and Hebrew to highlight the co-evolution of
language and culture:
– The correlations between words are influenced by co-evolutionary
social, technological and political factors
– The birth of different words is most commonly related to new social
and technological trends
– A new word requires some time to get into regular use (30 – 50 years)
• Roth (2014) examined the role of different systems (economy,
science, art, etc.) in in 3 different societies (English, French and
German) with the purpose of ranking them:
– Assumed that the public opinion related to each system may be
expressed as the number of times words from that system were used
– For English: beginning: law  religion  arts / end: policy  law 
health  education
– For French: beginning: art  religion  justice  policy / end: policy
 art  economy
– For German: beginning: law  science  art  religion / end: policy
 legal system  art  science

Similar Approaches (2)
27.06.2017 ICIW 2017 5
• Acerbi et. al (2013) analysed the trend in using emotional words in
the 20th century books using six feelings (anger, disgust, fear,
happiness, sadness and surprise)
– descending trend in using emotional words in the last century, except
for the last half, when, in American books
– also investigated the difference between words and phrases related to
the individual and to the collective  individual has seen a great
increase in American books
– there can be distinguished periods of happiness and sadness
correlated with important historical events
• Islam, Milios and Kešelj (2012) compared six corpus-based methods
for estimating word relatedness using Google Books Ngram Corpus:
– The most accurate is the Relatedness based on Tri-grams, which led to
a Pearson correlation coefficient with gold standard of 0.916
• Wijaya and Yeniterzi (2011) analysed the changes that occurred in
the meaning of a word over time:
– Using k-means and topic modelling to cluster the words co-occurring
with a given word over time.

Implementation Details - Resources
27.06.2017 ICIW 2017 6
• Google Books N-grams Corpus:
– Contains the words written in over 5 million books
published between 1500 and 2008 (over 500 billion words
in 7 languages)
– We only used the unigrams dataset (2 types of files)
• One with information about the number of uses of different words
• Another with the total of words indexed for each year  used for
normalization
– Due to corpus criticism (errors due to OCR and not a good
coverage), we restricted the analysis to the period 1800 -
2008
• WordNet:
– Contains only English words grouped based on their:
• part-of-speech (POS)  different structures for nouns, verbs,
adjectives and adverbs
• semantic  words with similar meaning are clustered in synsets

Implementation Details -
Modularization
• Three-tiers organization: data access module on
the 1st tier, the services modules on the 2nd and
the presentation tier on the 3rd
• Data Access Module
– a table “total” - data referring to the entire unigram
corpus  used for normalizing the data
– 26 tables (one for each letter) containing the words
starting with that letter (for efficiency)
– 26 tables for the results obtained by our application
• Services Module
– Services for accessing the database (CRUD operations)
– Implementations for the two algorithms used for
identiying the words’ cyclicity
27.06.2017 ICIW 2017 7

Modularization - the Presentation
Tier
• Contains three modules: the indexer, the analyzer and the graphical
user interface (GUI)
• The indexer - indexes the data from the n-grams files in the 26+1
tables
– Filters the data with the help of WordNet + heuristics: word’s length > 2;
characters = letters, quotes, dashes; the word cannot contain > 3
identical consecutive characters; information about the word’s for > 10
years; the dataset should have information for > 95% of the years
• The analyzer - responsible for identifying the words’ cyclicity
– Normalizes the data using the “total” table (counts  frequencies)
– Runs the 2 algorithms, varying the running parameters: the length of the
interval where we search for the cycle (changing starting date at 10
years rate) and of the cycle (from 1/6 to 1/3 of the total interval)
– Outputs the best results obtained by each algorithm + parameters
• The GUI - web interface
– Shows general information about the dimension of the indexed data
– Shows the best results obtained using the two algorithms
– Offers the possibility of choosing a word viewing its usage in time
27.06.2017 ICIW 2017 8

Algorithms - Autocorrelation
• Analysis method for time series used for determining the
correlation of a time series with its own values, shifted in
time, backward and/or forward
• It is assumed that the measurements where performed at
equidistant moments in time
• This method may be used for identifying the covariance or
correlation between time-series, but its most practical use is
in forecasting
27.06.2017 ICIW 2017 9
𝑟𝑘 =
𝑖=1
𝑁−𝑘
𝑦𝑖 − 𝑦 𝑦𝑖+𝑘 − 𝑦
𝑖=1
𝑁−𝑘
𝑦𝑖 − 𝑦 2
• For measurements Y =
(y1, y2, ... yN) at time
moments X = (x1, x2, ...
xN), autocorrelation with
the delay k is computed:

Algorithms - DTW
• Used for detecting time series with similar shapes by allowing an
elastic transformation between two time series
• Dynamic programming algorithm – complexity O (M*N)
• Restriction: the series to be sampled at equidistant points in time.
• We used DTW to compare the time series obtained from the words’
usage in time with some pre-defined cyclic ones: sinusoidal or only
the absolute values of sinusoidal with various periods (to allow the
detection of cycles of various dimensions)
27.06.2017 ICIW 2017 5

1
Results
27.06.2017 ICIW 2017
• Most of the detected cyclic words are from
the pharmaceutical domain
anaprox augmentin
didanosine propylthiouracil

Results
• asdasdas
27.06.2017 ICIW 2017 12
Letter
Number of
analyzed words
Detected cyclic words
A 2994
abacus, abdominoplasty, agave,
aircrewman, allogeneic,
alphanumerical, alphavirus, anaprox,
anatomical, anticipation, ape
B 2241
basuco, beatrice, belief, bland,
blarney, bobbysoxer, botch, brunt,
brussels, buoyancy
C 4105
capacitive, catapres, clioquinol,
codex, cognac, cognizant, collision,
colonization, conceding,
counterinsurgency, cowherd, cushion,
cyberphobia
D 2446
dadaism, dbms, deathbed, decadron,
decapitated, defunct, delavirdine,
deoxythymidine, desertification,
desyrel, didanosine, dislocate, dissect,
domesticated, dronabinol
E 1808
egotrip, egyptologist, empennage,
enalapril, enclosure, enthrall,
eumycota, evergreen, excrement,
extensively
F 1652
fainthearted, festering, fiddler,
figment, fleshiness, frisian
G 1280 geological, gifted, glassy, gulf
H 1585
haldol, helmsman, herbaceous,
hermes, hillbilly, history, honeycomb,
horticultural, hydroxyzine, hyena,
hypervolaemia
I 1875
illegible, immersion, inderal, induct,
informercial, interlace, intralinguistic
J 371 joust
L 1506
lac, legitimately, leo, lifelessness,
limnodromus, lindsay, linkup, llama,
lopressor, lyophilise
Letter
Number of
analyzed words
Detected cyclic words
M 2298
manifestation, marge, mentha,
metricate, microelectronic,
microphone, molehill, monosyllabic,
montgomery, multiethnic, munro
N 876
nadolol, naltrexone, ncdc, nelson,
neosporin, nonproliferation, nureyev,
nydrazid
O 952
ominous, omnipresent, onerous,
opponent, optative, oswald, outlandish,
outpouring, overcome, overflight
P 3474
paedophile, paintbox, paramount,
paternally, pectoralis, personify,
pharmacogenetics, pimpled, plantago,
plentitude, plop, polygonal, popular,
postindustrial, privatize,
propylthiouracil, psittacosaur, pyramid
R 1918
rarely, recoverable, reluctantly,
remodel, renegade, resident,
resoluteness, retrovirus, reverberating,
ritalin, robertson, rocephin, roleplaying,
root
S 4338
saquinavir, saturate, schtik, scott,
scrutinise, seats, sectarianism, sedum,
serratus, shoed, soliton, speaker,
sporanox, sunchoke, supporter, swiss,
switchblade
T 2127
teleconference, temp, theologian,
tonocard, topicalization, toradol,
tracing, transparence, tranylcypromine
U 1434
underboss, unfettered, unfinished,
unimpeded
V 780 vacate, velban, videodisc
W 875 waking, willis, workings
Z 101 zinacef, zovirax

Discussions
• Both algorithms may be used for detecting if a graph
varies regularly
• Autocorrelation offers the best results when the graph
has a shape that repeats at certain intervals, but
without having a specific form
• DTW algorithm compares the graph with a predefined
shape  it detects that the time series varies regularly
only if the two shapes are alike
• Autocorrelation – more generic results, while DTW –
more specific ones
• Autocorrelation:
– Advantage: the curves may have any repeatable shape
– Disadvantage: the graph may also autocorrelate when it is
almost constant in time
27.06.2017 ICIW 2017 13

Conclusions
27.06.2017 ICIW 2017 14
• System capable of:
– indexing the unigram dataset provided by Google
– analyzing the graph of each indexed word
– establish if the graphic representation is cyclic
• Analysis was done using 2 algorithms: autocorrelation and DTW
• Most identified cyclic words are from the pharmaceutic domain
– Interpretation: the interest for pharmaceutic products tends to be
sinusoidal, with ups and downs
• Both algorithms have advantages and disadvantages –
autocorrelation is more general, while DTW is more specific
• Autocorrelation may end up giving false alarms in the case of
constant use of a word
• DTW will fail to identify cyclic words if they have a different shape
than a sinusoidal
• Future work: clustering the cyclic words (events, products,
personalities, locations, sentiments, actions)  custom conclusions
may be drawn

Questions
27.06.2017 ICIW 2017 15
Thank you very much!
This work has been funded by University Politehnica of Bucharest, through the
“Excellence Research Grants” Program, UPB – GEX. Identifier: UPB–
EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza
seriilor de timp (Applying machine learning techniques in time series analysis),
Contract number 09/26.09.2016.

Identifying cyclic words with the help of google

Recommended

Recommended

More Related Content

Similar to Identifying cyclic words with the help of google

Similar to Identifying cyclic words with the help of google (20)

More from University Politehnica Bucharest

More from University Politehnica Bucharest (20)

Recently uploaded

Recently uploaded (20)

Identifying cyclic words with the help of google

Editor's Notes