SlideShare a Scribd company logo
1 of 15
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Identifying Cyclic Words with the Help of
Google Books N-grams Corpus
Costin-Gabriel CHIRU and Vladimir-Nicolae Dinu
costin.chiru@cs.pub.ro, vladimir.dinu92@yahoo.com
Introduction (1)
• Purpose:
– A system to identifying English words whose use is cyclic or
regularly varies in time.
• Cross-platform system for indexing and analyzing the graphs of
words usage over time.
• Usefulness:
– Depends on the meaning of the cyclic word:
• Generic events  events that are about to happen: rebellion,
revolution, war
• Economic field  public interest in different products – influence
the dynamic of the sales and stocks of companies
• Resources:
– Google Books N-grams Corpus – for indexing words’ usage
in time – analysis done at unigram level
– WordNet lexical database – for filtering the words of
interest
27.06.2017 ICIW 2017 2
Introduction (2)
• Analysis:
– Based on the graphs generated from the number of
uses of each word in the publications from 1800 until
2008 (from Google Corpus unigrams)
• Algorithms:
– Autocorrelation
– Dynamic Time Warping (DTW)
• Results:
– The words that were identified as being cyclic
– The years where these words were cyclic and
– The length of the cycle in years (before repeating the
cycle)
27.06.2017 ICIW 2017 3
Similar Approaches (1)
27.06.2017 ICIW 2017 4
• Petersen et. al (2012) analyzed the evolution of 107 words from
English, Spanish and Hebrew to highlight the co-evolution of
language and culture:
– The correlations between words are influenced by co-evolutionary
social, technological and political factors
– The birth of different words is most commonly related to new social
and technological trends
– A new word requires some time to get into regular use (30 – 50 years)
• Roth (2014) examined the role of different systems (economy,
science, art, etc.) in in 3 different societies (English, French and
German) with the purpose of ranking them:
– Assumed that the public opinion related to each system may be
expressed as the number of times words from that system were used
– For English: beginning: law  religion  arts / end: policy  law 
health  education
– For French: beginning: art  religion  justice  policy / end: policy
 art  economy
– For German: beginning: law  science  art  religion / end: policy
 legal system  art  science
Similar Approaches (2)
27.06.2017 ICIW 2017 5
• Acerbi et. al (2013) analysed the trend in using emotional words in
the 20th century books using six feelings (anger, disgust, fear,
happiness, sadness and surprise)
– descending trend in using emotional words in the last century, except
for the last half, when, in American books
– also investigated the difference between words and phrases related to
the individual and to the collective  individual has seen a great
increase in American books
– there can be distinguished periods of happiness and sadness
correlated with important historical events
• Islam, Milios and Kešelj (2012) compared six corpus-based methods
for estimating word relatedness using Google Books Ngram Corpus:
– The most accurate is the Relatedness based on Tri-grams, which led to
a Pearson correlation coefficient with gold standard of 0.916
• Wijaya and Yeniterzi (2011) analysed the changes that occurred in
the meaning of a word over time:
– Using k-means and topic modelling to cluster the words co-occurring
with a given word over time.
Implementation Details - Resources
27.06.2017 ICIW 2017 6
• Google Books N-grams Corpus:
– Contains the words written in over 5 million books
published between 1500 and 2008 (over 500 billion words
in 7 languages)
– We only used the unigrams dataset (2 types of files)
• One with information about the number of uses of different words
• Another with the total of words indexed for each year  used for
normalization
– Due to corpus criticism (errors due to OCR and not a good
coverage), we restricted the analysis to the period 1800 -
2008
• WordNet:
– Contains only English words grouped based on their:
• part-of-speech (POS)  different structures for nouns, verbs,
adjectives and adverbs
• semantic  words with similar meaning are clustered in synsets
Implementation Details -
Modularization
• Three-tiers organization: data access module on
the 1st tier, the services modules on the 2nd and
the presentation tier on the 3rd
• Data Access Module
– a table “total” - data referring to the entire unigram
corpus  used for normalizing the data
– 26 tables (one for each letter) containing the words
starting with that letter (for efficiency)
– 26 tables for the results obtained by our application
• Services Module
– Services for accessing the database (CRUD operations)
– Implementations for the two algorithms used for
identiying the words’ cyclicity
27.06.2017 ICIW 2017 7
Modularization - the Presentation
Tier
• Contains three modules: the indexer, the analyzer and the graphical
user interface (GUI)
• The indexer - indexes the data from the n-grams files in the 26+1
tables
– Filters the data with the help of WordNet + heuristics: word’s length > 2;
characters = letters, quotes, dashes; the word cannot contain > 3
identical consecutive characters; information about the word’s for > 10
years; the dataset should have information for > 95% of the years
• The analyzer - responsible for identifying the words’ cyclicity
– Normalizes the data using the “total” table (counts  frequencies)
– Runs the 2 algorithms, varying the running parameters: the length of the
interval where we search for the cycle (changing starting date at 10
years rate) and of the cycle (from 1/6 to 1/3 of the total interval)
– Outputs the best results obtained by each algorithm + parameters
• The GUI - web interface
– Shows general information about the dimension of the indexed data
– Shows the best results obtained using the two algorithms
– Offers the possibility of choosing a word viewing its usage in time
27.06.2017 ICIW 2017 8
Algorithms - Autocorrelation
• Analysis method for time series used for determining the
correlation of a time series with its own values, shifted in
time, backward and/or forward
• It is assumed that the measurements where performed at
equidistant moments in time
• This method may be used for identifying the covariance or
correlation between time-series, but its most practical use is
in forecasting
27.06.2017 ICIW 2017 9
𝑟𝑘 =
𝑖=1
𝑁−𝑘
𝑦𝑖 − 𝑦 𝑦𝑖+𝑘 − 𝑦
𝑖=1
𝑁−𝑘
𝑦𝑖 − 𝑦 2
• For measurements Y =
(y1, y2, ... yN) at time
moments X = (x1, x2, ...
xN), autocorrelation with
the delay k is computed:
Algorithms - DTW
• Used for detecting time series with similar shapes by allowing an
elastic transformation between two time series
• Dynamic programming algorithm – complexity O (M*N)
• Restriction: the series to be sampled at equidistant points in time.
• We used DTW to compare the time series obtained from the words’
usage in time with some pre-defined cyclic ones: sinusoidal or only
the absolute values of sinusoidal with various periods (to allow the
detection of cycles of various dimensions)
27.06.2017 ICIW 2017 5
1
Results
27.06.2017 ICIW 2017
• Most of the detected cyclic words are from
the pharmaceutical domain
anaprox augmentin
didanosine propylthiouracil
Results
• asdasdas
27.06.2017 ICIW 2017 12
Letter
Number of
analyzed words
Detected cyclic words
A 2994
abacus, abdominoplasty, agave,
aircrewman, allogeneic,
alphanumerical, alphavirus, anaprox,
anatomical, anticipation, ape
B 2241
basuco, beatrice, belief, bland,
blarney, bobbysoxer, botch, brunt,
brussels, buoyancy
C 4105
capacitive, catapres, clioquinol,
codex, cognac, cognizant, collision,
colonization, conceding,
counterinsurgency, cowherd, cushion,
cyberphobia
D 2446
dadaism, dbms, deathbed, decadron,
decapitated, defunct, delavirdine,
deoxythymidine, desertification,
desyrel, didanosine, dislocate, dissect,
domesticated, dronabinol
E 1808
egotrip, egyptologist, empennage,
enalapril, enclosure, enthrall,
eumycota, evergreen, excrement,
extensively
F 1652
fainthearted, festering, fiddler,
figment, fleshiness, frisian
G 1280 geological, gifted, glassy, gulf
H 1585
haldol, helmsman, herbaceous,
hermes, hillbilly, history, honeycomb,
horticultural, hydroxyzine, hyena,
hypervolaemia
I 1875
illegible, immersion, inderal, induct,
informercial, interlace, intralinguistic
J 371 joust
L 1506
lac, legitimately, leo, lifelessness,
limnodromus, lindsay, linkup, llama,
lopressor, lyophilise
Letter
Number of
analyzed words
Detected cyclic words
M 2298
manifestation, marge, mentha,
metricate, microelectronic,
microphone, molehill, monosyllabic,
montgomery, multiethnic, munro
N 876
nadolol, naltrexone, ncdc, nelson,
neosporin, nonproliferation, nureyev,
nydrazid
O 952
ominous, omnipresent, onerous,
opponent, optative, oswald, outlandish,
outpouring, overcome, overflight
P 3474
paedophile, paintbox, paramount,
paternally, pectoralis, personify,
pharmacogenetics, pimpled, plantago,
plentitude, plop, polygonal, popular,
postindustrial, privatize,
propylthiouracil, psittacosaur, pyramid
R 1918
rarely, recoverable, reluctantly,
remodel, renegade, resident,
resoluteness, retrovirus, reverberating,
ritalin, robertson, rocephin, roleplaying,
root
S 4338
saquinavir, saturate, schtik, scott,
scrutinise, seats, sectarianism, sedum,
serratus, shoed, soliton, speaker,
sporanox, sunchoke, supporter, swiss,
switchblade
T 2127
teleconference, temp, theologian,
tonocard, topicalization, toradol,
tracing, transparence, tranylcypromine
U 1434
underboss, unfettered, unfinished,
unimpeded
V 780 vacate, velban, videodisc
W 875 waking, willis, workings
Z 101 zinacef, zovirax
Discussions
• Both algorithms may be used for detecting if a graph
varies regularly
• Autocorrelation offers the best results when the graph
has a shape that repeats at certain intervals, but
without having a specific form
• DTW algorithm compares the graph with a predefined
shape  it detects that the time series varies regularly
only if the two shapes are alike
• Autocorrelation – more generic results, while DTW –
more specific ones
• Autocorrelation:
– Advantage: the curves may have any repeatable shape
– Disadvantage: the graph may also autocorrelate when it is
almost constant in time
27.06.2017 ICIW 2017 13
Conclusions
27.06.2017 ICIW 2017 14
• System capable of:
– indexing the unigram dataset provided by Google
– analyzing the graph of each indexed word
– establish if the graphic representation is cyclic
• Analysis was done using 2 algorithms: autocorrelation and DTW
• Most identified cyclic words are from the pharmaceutic domain
– Interpretation: the interest for pharmaceutic products tends to be
sinusoidal, with ups and downs
• Both algorithms have advantages and disadvantages –
autocorrelation is more general, while DTW is more specific
• Autocorrelation may end up giving false alarms in the case of
constant use of a word
• DTW will fail to identify cyclic words if they have a different shape
than a sinusoidal
• Future work: clustering the cyclic words (events, products,
personalities, locations, sentiments, actions)  custom conclusions
may be drawn
Questions
27.06.2017 ICIW 2017 15
Thank you very much!
This work has been funded by University Politehnica of Bucharest, through the
“Excellence Research Grants” Program, UPB – GEX. Identifier: UPB–
EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza
seriilor de timp (Applying machine learning techniques in time series analysis),
Contract number 09/26.09.2016.

More Related Content

Similar to Identifying cyclic words with the help of google

Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...JohannWanja
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizLuis Marco Ruiz
 
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...TELKOMNIKA JOURNAL
 
Modeling For Sustainability: Or How to Make Smart CPS Smarter?
Modeling For Sustainability: Or How to Make Smart CPS Smarter?Modeling For Sustainability: Or How to Make Smart CPS Smarter?
Modeling For Sustainability: Or How to Make Smart CPS Smarter?Benoit Combemale
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
Global climate change unit 5
Global climate change unit 5Global climate change unit 5
Global climate change unit 5Martin Wildenberg
 
softwares in public health
softwares in public healthsoftwares in public health
softwares in public healthPragyan Parija
 
· EXERCISE 12.12A regional airline transfers passengers from sma.docx
· EXERCISE 12.12A regional airline transfers passengers from sma.docx· EXERCISE 12.12A regional airline transfers passengers from sma.docx
· EXERCISE 12.12A regional airline transfers passengers from sma.docxoswald1horne84988
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityTERN Australia
 
An Engineering-to-Biology Thesaurus for Engineering Design.pdf
An Engineering-to-Biology Thesaurus for Engineering Design.pdfAn Engineering-to-Biology Thesaurus for Engineering Design.pdf
An Engineering-to-Biology Thesaurus for Engineering Design.pdfNaomi Hansen
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...eMadrid network
 
Fera parallel activities
Fera parallel activitiesFera parallel activities
Fera parallel activitiesForest Research
 
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...Longitudinal Journal Usage Analysis and the Development of Institutional Spec...
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...Charleston Conference
 
Vince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince Smith
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Item 6: Discussion on the Global Spectral Calibration Library
Item 6: Discussion on the Global Spectral Calibration LibraryItem 6: Discussion on the Global Spectral Calibration Library
Item 6: Discussion on the Global Spectral Calibration LibrarySoils FAO-GSP
 

Similar to Identifying cyclic words with the help of google (20)

CISER & the Data Reference Interview
CISER & the Data Reference InterviewCISER & the Data Reference Interview
CISER & the Data Reference Interview
 
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling ...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco Ruiz
 
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
 
Modeling For Sustainability: Or How to Make Smart CPS Smarter?
Modeling For Sustainability: Or How to Make Smart CPS Smarter?Modeling For Sustainability: Or How to Make Smart CPS Smarter?
Modeling For Sustainability: Or How to Make Smart CPS Smarter?
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
Global climate change unit 5
Global climate change unit 5Global climate change unit 5
Global climate change unit 5
 
softwares in public health
softwares in public healthsoftwares in public health
softwares in public health
 
· EXERCISE 12.12A regional airline transfers passengers from sma.docx
· EXERCISE 12.12A regional airline transfers passengers from sma.docx· EXERCISE 12.12A regional airline transfers passengers from sma.docx
· EXERCISE 12.12A regional airline transfers passengers from sma.docx
 
Medlars
MedlarsMedlars
Medlars
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
 
An Engineering-to-Biology Thesaurus for Engineering Design.pdf
An Engineering-to-Biology Thesaurus for Engineering Design.pdfAn Engineering-to-Biology Thesaurus for Engineering Design.pdf
An Engineering-to-Biology Thesaurus for Engineering Design.pdf
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
 
Fera parallel activities
Fera parallel activitiesFera parallel activities
Fera parallel activities
 
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...Longitudinal Journal Usage Analysis and the Development of Institutional Spec...
Longitudinal Journal Usage Analysis and the Development of Institutional Spec...
 
Vince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notext
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Item 6: Discussion on the Global Spectral Calibration Library
Item 6: Discussion on the Global Spectral Calibration LibraryItem 6: Discussion on the Global Spectral Calibration Library
Item 6: Discussion on the Global Spectral Calibration Library
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisUniversity Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisUniversity Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileUniversity Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaUniversity Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyUniversity Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUniversity Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentareaUniversity Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsUniversity Politehnica Bucharest
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Recently uploaded (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Identifying cyclic words with the help of google

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Identifying Cyclic Words with the Help of Google Books N-grams Corpus Costin-Gabriel CHIRU and Vladimir-Nicolae Dinu costin.chiru@cs.pub.ro, vladimir.dinu92@yahoo.com
  • 2. Introduction (1) • Purpose: – A system to identifying English words whose use is cyclic or regularly varies in time. • Cross-platform system for indexing and analyzing the graphs of words usage over time. • Usefulness: – Depends on the meaning of the cyclic word: • Generic events  events that are about to happen: rebellion, revolution, war • Economic field  public interest in different products – influence the dynamic of the sales and stocks of companies • Resources: – Google Books N-grams Corpus – for indexing words’ usage in time – analysis done at unigram level – WordNet lexical database – for filtering the words of interest 27.06.2017 ICIW 2017 2
  • 3. Introduction (2) • Analysis: – Based on the graphs generated from the number of uses of each word in the publications from 1800 until 2008 (from Google Corpus unigrams) • Algorithms: – Autocorrelation – Dynamic Time Warping (DTW) • Results: – The words that were identified as being cyclic – The years where these words were cyclic and – The length of the cycle in years (before repeating the cycle) 27.06.2017 ICIW 2017 3
  • 4. Similar Approaches (1) 27.06.2017 ICIW 2017 4 • Petersen et. al (2012) analyzed the evolution of 107 words from English, Spanish and Hebrew to highlight the co-evolution of language and culture: – The correlations between words are influenced by co-evolutionary social, technological and political factors – The birth of different words is most commonly related to new social and technological trends – A new word requires some time to get into regular use (30 – 50 years) • Roth (2014) examined the role of different systems (economy, science, art, etc.) in in 3 different societies (English, French and German) with the purpose of ranking them: – Assumed that the public opinion related to each system may be expressed as the number of times words from that system were used – For English: beginning: law  religion  arts / end: policy  law  health  education – For French: beginning: art  religion  justice  policy / end: policy  art  economy – For German: beginning: law  science  art  religion / end: policy  legal system  art  science
  • 5. Similar Approaches (2) 27.06.2017 ICIW 2017 5 • Acerbi et. al (2013) analysed the trend in using emotional words in the 20th century books using six feelings (anger, disgust, fear, happiness, sadness and surprise) – descending trend in using emotional words in the last century, except for the last half, when, in American books – also investigated the difference between words and phrases related to the individual and to the collective  individual has seen a great increase in American books – there can be distinguished periods of happiness and sadness correlated with important historical events • Islam, Milios and Kešelj (2012) compared six corpus-based methods for estimating word relatedness using Google Books Ngram Corpus: – The most accurate is the Relatedness based on Tri-grams, which led to a Pearson correlation coefficient with gold standard of 0.916 • Wijaya and Yeniterzi (2011) analysed the changes that occurred in the meaning of a word over time: – Using k-means and topic modelling to cluster the words co-occurring with a given word over time.
  • 6. Implementation Details - Resources 27.06.2017 ICIW 2017 6 • Google Books N-grams Corpus: – Contains the words written in over 5 million books published between 1500 and 2008 (over 500 billion words in 7 languages) – We only used the unigrams dataset (2 types of files) • One with information about the number of uses of different words • Another with the total of words indexed for each year  used for normalization – Due to corpus criticism (errors due to OCR and not a good coverage), we restricted the analysis to the period 1800 - 2008 • WordNet: – Contains only English words grouped based on their: • part-of-speech (POS)  different structures for nouns, verbs, adjectives and adverbs • semantic  words with similar meaning are clustered in synsets
  • 7. Implementation Details - Modularization • Three-tiers organization: data access module on the 1st tier, the services modules on the 2nd and the presentation tier on the 3rd • Data Access Module – a table “total” - data referring to the entire unigram corpus  used for normalizing the data – 26 tables (one for each letter) containing the words starting with that letter (for efficiency) – 26 tables for the results obtained by our application • Services Module – Services for accessing the database (CRUD operations) – Implementations for the two algorithms used for identiying the words’ cyclicity 27.06.2017 ICIW 2017 7
  • 8. Modularization - the Presentation Tier • Contains three modules: the indexer, the analyzer and the graphical user interface (GUI) • The indexer - indexes the data from the n-grams files in the 26+1 tables – Filters the data with the help of WordNet + heuristics: word’s length > 2; characters = letters, quotes, dashes; the word cannot contain > 3 identical consecutive characters; information about the word’s for > 10 years; the dataset should have information for > 95% of the years • The analyzer - responsible for identifying the words’ cyclicity – Normalizes the data using the “total” table (counts  frequencies) – Runs the 2 algorithms, varying the running parameters: the length of the interval where we search for the cycle (changing starting date at 10 years rate) and of the cycle (from 1/6 to 1/3 of the total interval) – Outputs the best results obtained by each algorithm + parameters • The GUI - web interface – Shows general information about the dimension of the indexed data – Shows the best results obtained using the two algorithms – Offers the possibility of choosing a word viewing its usage in time 27.06.2017 ICIW 2017 8
  • 9. Algorithms - Autocorrelation • Analysis method for time series used for determining the correlation of a time series with its own values, shifted in time, backward and/or forward • It is assumed that the measurements where performed at equidistant moments in time • This method may be used for identifying the covariance or correlation between time-series, but its most practical use is in forecasting 27.06.2017 ICIW 2017 9 𝑟𝑘 = 𝑖=1 𝑁−𝑘 𝑦𝑖 − 𝑦 𝑦𝑖+𝑘 − 𝑦 𝑖=1 𝑁−𝑘 𝑦𝑖 − 𝑦 2 • For measurements Y = (y1, y2, ... yN) at time moments X = (x1, x2, ... xN), autocorrelation with the delay k is computed:
  • 10. Algorithms - DTW • Used for detecting time series with similar shapes by allowing an elastic transformation between two time series • Dynamic programming algorithm – complexity O (M*N) • Restriction: the series to be sampled at equidistant points in time. • We used DTW to compare the time series obtained from the words’ usage in time with some pre-defined cyclic ones: sinusoidal or only the absolute values of sinusoidal with various periods (to allow the detection of cycles of various dimensions) 27.06.2017 ICIW 2017 5
  • 11. 1 Results 27.06.2017 ICIW 2017 • Most of the detected cyclic words are from the pharmaceutical domain anaprox augmentin didanosine propylthiouracil
  • 12. Results • asdasdas 27.06.2017 ICIW 2017 12 Letter Number of analyzed words Detected cyclic words A 2994 abacus, abdominoplasty, agave, aircrewman, allogeneic, alphanumerical, alphavirus, anaprox, anatomical, anticipation, ape B 2241 basuco, beatrice, belief, bland, blarney, bobbysoxer, botch, brunt, brussels, buoyancy C 4105 capacitive, catapres, clioquinol, codex, cognac, cognizant, collision, colonization, conceding, counterinsurgency, cowherd, cushion, cyberphobia D 2446 dadaism, dbms, deathbed, decadron, decapitated, defunct, delavirdine, deoxythymidine, desertification, desyrel, didanosine, dislocate, dissect, domesticated, dronabinol E 1808 egotrip, egyptologist, empennage, enalapril, enclosure, enthrall, eumycota, evergreen, excrement, extensively F 1652 fainthearted, festering, fiddler, figment, fleshiness, frisian G 1280 geological, gifted, glassy, gulf H 1585 haldol, helmsman, herbaceous, hermes, hillbilly, history, honeycomb, horticultural, hydroxyzine, hyena, hypervolaemia I 1875 illegible, immersion, inderal, induct, informercial, interlace, intralinguistic J 371 joust L 1506 lac, legitimately, leo, lifelessness, limnodromus, lindsay, linkup, llama, lopressor, lyophilise Letter Number of analyzed words Detected cyclic words M 2298 manifestation, marge, mentha, metricate, microelectronic, microphone, molehill, monosyllabic, montgomery, multiethnic, munro N 876 nadolol, naltrexone, ncdc, nelson, neosporin, nonproliferation, nureyev, nydrazid O 952 ominous, omnipresent, onerous, opponent, optative, oswald, outlandish, outpouring, overcome, overflight P 3474 paedophile, paintbox, paramount, paternally, pectoralis, personify, pharmacogenetics, pimpled, plantago, plentitude, plop, polygonal, popular, postindustrial, privatize, propylthiouracil, psittacosaur, pyramid R 1918 rarely, recoverable, reluctantly, remodel, renegade, resident, resoluteness, retrovirus, reverberating, ritalin, robertson, rocephin, roleplaying, root S 4338 saquinavir, saturate, schtik, scott, scrutinise, seats, sectarianism, sedum, serratus, shoed, soliton, speaker, sporanox, sunchoke, supporter, swiss, switchblade T 2127 teleconference, temp, theologian, tonocard, topicalization, toradol, tracing, transparence, tranylcypromine U 1434 underboss, unfettered, unfinished, unimpeded V 780 vacate, velban, videodisc W 875 waking, willis, workings Z 101 zinacef, zovirax
  • 13. Discussions • Both algorithms may be used for detecting if a graph varies regularly • Autocorrelation offers the best results when the graph has a shape that repeats at certain intervals, but without having a specific form • DTW algorithm compares the graph with a predefined shape  it detects that the time series varies regularly only if the two shapes are alike • Autocorrelation – more generic results, while DTW – more specific ones • Autocorrelation: – Advantage: the curves may have any repeatable shape – Disadvantage: the graph may also autocorrelate when it is almost constant in time 27.06.2017 ICIW 2017 13
  • 14. Conclusions 27.06.2017 ICIW 2017 14 • System capable of: – indexing the unigram dataset provided by Google – analyzing the graph of each indexed word – establish if the graphic representation is cyclic • Analysis was done using 2 algorithms: autocorrelation and DTW • Most identified cyclic words are from the pharmaceutic domain – Interpretation: the interest for pharmaceutic products tends to be sinusoidal, with ups and downs • Both algorithms have advantages and disadvantages – autocorrelation is more general, while DTW is more specific • Autocorrelation may end up giving false alarms in the case of constant use of a word • DTW will fail to identify cyclic words if they have a different shape than a sinusoidal • Future work: clustering the cyclic words (events, products, personalities, locations, sentiments, actions)  custom conclusions may be drawn
  • 15. Questions 27.06.2017 ICIW 2017 15 Thank you very much! This work has been funded by University Politehnica of Bucharest, through the “Excellence Research Grants” Program, UPB – GEX. Identifier: UPB– EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza seriilor de timp (Applying machine learning techniques in time series analysis), Contract number 09/26.09.2016.

Editor's Notes

  1. Palo Alto
  2. Palo Alto
  3. it is assumed that the measurements where performed at equidistant moments in time
  4. \