1. A text mining and association analysis:
Exploring text data for creating topic models
Kyuson Lim
1Department of Mathematics & Statistics, McMaster University
E-mail: limk15@mcmaster.ca
3. Why text mining?
I Data transformation: business/industry overwhelmed with unstructured data.
I Telecommunication industry: analysis on customer termination reasoning.
I Government agency: news issues, local opinions, topic clusterings on annual
report.
I Data mining: web crawling on news headlines, social media, movie reviews.
I Business intelligence, exploratory data analysis, consumer behaviors.
I Easy to interpret, variety of applications, attractable outcomes, combinations with
other result.
I Variety of algorithm: Bert (Bidirectional Encoder Representations from
Transformers), BTM (Biterm Topic Models), LDA (Latent Dirichlet Allocation)
Figure: Example by Kyuson in 2020
4. How is different?
I Natural language processing (NLP): used to understand human language by
analyzing text, speech, or grammatical syntax.
I Advanced ML models and AI (artificial intelligence), ie. Siri.
I Geared towards mimicking natural human communication, syntax meaning.
I Extract grammatical structure and the sentiment.
I Text mining: used to extract information from unstructured and structured
content.
I Extract information from unstructured data.
I Statistical models to analyze qualitative meaning of content.
I Frequency of words, patterns, and correlation within words to explain the text.
Figure: Example of Journals (Author submitted)
5. Goal of the analysis
I Effectively portray the output and present in a collection of keywords by its
connects.
I Wordcloud, creation using data visualization.
I Hierarchal clustering and correlation graph.
I K-medoids clustering on classification.
I Association between issues and causal discovery on timeline of words.
I Gaussian graphical model: visually portray for connection.
I Local smoothing regression: fit on timelines in frequencies of words.
I BTM: topic clustering on documents, ultimate tool for co-occurrence based
topic model.
6. What gain?
I Statistics Canada (StatsCan COVID-19):
I January to December 2020 of the Canadian Perspectives Survey Series (CPSS).
I Covers livings and lifestyle issues of aged over 20 in Canada.
I Topics in sociological and economical issues.
I Keywords and issues of Covid-19 pandemics in 2020.
I Measure of impact on daily life, endemics.
I Interpretation is simple, concise and scientific.
I Future usage: recap and pandemic solution.
I Exploratory data analysis on advanced analytic method.
I Data visualization: efficient and practical skill to earn.
I Application for real life data: data analyst, complex data.
7. Text mining pre-processing
I Web crawling and tokenize text.
1. In the website Statistics Canada, using html source code, analyze the relevant
code for headlines which parses through contents.
2. Then, build code in R to crawl each headlines by the loop to save with.
3. Tokenize by making all words to be a small letters.
I Pre-process for filtering the unnecessary words.
I Eliminate adverbs and special characters using package "stopwords" and
"tokenizers".
I Construct a term table, consist of words and frequencies.
I Crawl published dates and change format into dates.
Figure: Example of web crawling
8. Literature review of wordcloud
I Visualize keyword metadata on websites in free form text.
I Mainly familiarized by the Web 2.0 websites and blogs.
I Past experience to publish in government report.
I Association rule: support analysis (co-occurences).
I Commonly known to be called as a market basket analysis.
I Machine learning (ML) method for discovering the interesting relations between
variables.
I Identify visually influential factors and attractive for readers to understand the
data.
Figure: Web 2.0 and association rule analysis: support
9. Wordcloud and co-occurences
I Wordcloud and co-occurrence bar graph of top 6 ranked most frequent used
words.
I "Covid" and "pandemic" has been used in 79 times.
I "pandemic" and "Canadian" has been used 21 times.
I Most "articles" are "Canadian" and "impact" except for "covid" and "pandemics".
I Overall data on frequencies and keywords to combine for phrase.
I Supplement unstructured form of wordcloud by quantitative bar graph on the
importance of issues by keywords identification.
I Investigate to find words that constitute topics of impact by Covid-19 in 2020.
Figure: Frequency table and combination of wordcloud with co-occurences graph
10. Interpretation: Wordcloud and co-occurences
I Minor and major words investigation.
I Identify some unique words such as "data", "differences", "survey", "statistics"
and "study".
I Economical and sociological issues for majority of articles.
I Interest in living issues: "price", "mental", "concerns", "home", and "workers".
I Future work for living issues and overcome for inflation, economic support.
I Unique overview to improve in various text data, publish in shiny app.
Figure: Frequency table and combination of wordcloud with co-occurences graph
11. Literature review of hierarchal clustering
I Seeks to construct a hierarchy of clusters, which classify the words into groups
based on the dissimilarity of the words.
1. The number of times of the word used is the coordinates in the space.
2. For any two words, the distance in the space is calculated as a measure of
dissimilarity.
3. At the beginning of the clustering process, each element is in a cluster of its own.
4. Within the distance matrix, we can then cluster the words.
5. For two clusters, the distance is the maximum distance among any pair of
elements from the two clusters.
6. Then the two clusters separated by the shortest distance are combined.
7. Iteratively, two most similar (close) clusters or word is joint until there is 2 cluster.
I Correlation analysis: co-occurrences among all documents of words in the
sparse matrix.
I If a word occurs in a particular document, then the sparse matrix entry for
corresponding to that row and column is 1, else it is 0.
I An efficient representation of the information contained in the term document
matrix.
12. Hierarchal clustering and correlation analysis
I Dendrogram: tree structure, visualize clusters of combination by the distance.
I 2 clusters establishes for 18 words and 3 words with "covid", "health" and
"pandemic".
I A word "covid" and "pandemic" has strong correlation (0.35).
I A "impact" has negative correlation with the word "health" (-0.1).
I Correlation is calculated based on all words to account with.
Figure: A hierarchal clustering of dendrogram and correlation analysis.
13. Interpretation: Hierarchal clustering and correlation
I Sub-hierarchy to formulate a result of topics with issues.
I Words "examines", "study", "using", and "survey" are grouped together in the
same hierarchy
I Main topic of 3 words and 18 words of sub-topics.
I Similar result for the dissimilarity between word "impact" and words of "health"
and "pandemic".
Figure: A hierarchal clustering of dendrogram and correlation analysis.
14. Literature review of K-medoid clustering
I Partitions the data into groups and attempt to minimize the distance between
points by defining a point of the center in that cluster.
I Use the Manhattan distance to define the dissimilarity.
I A k-medoid minimizes a sum of pairwise dissimilarities.
I A k-medoid chooses datapoints in the data as centers (called medoids).
I Build steps to construct the clusters and swap steps to adjust boundaries of
points.
Figure: A iteration steps for constructing the k-medoid clustering.
15. K-medoid clustering analysis
I The method is greedy (local optimal choice) to be heuristic for many solutions.
I 19 words in cluster 2 and a word is contained in both cluster 2 and 3.
I A similar result to contain most words in the cluster 2.
I Determine the number of clusters:
I An "elbow" method, calculate how much variability in the data to be explained by the
clustering.
I Identify the drastic point of increase to be the optimal cut-off for the choice in the
number of clusters.
Figure: A k-medoid clustering analysis and variance explained.
16. Interpretation: K-medoid clustering
I Similar to hierarchical clustering analysis in sub-topics and main topics.
I The words of "covid" and "pandemic" is separated from the major cluster.
I The word of "health" is contained with the other cluster (cluster 2).
I No cluster overlaps to be adequate fit for the data. Two number of cluster
accounts for 45.64% of the variability in the data.
I Reflect a meaningful relation between words where the result is reflected on
how people perceived in livings and issues.
cluster covid pandemic Canada impact Canadians business impacts
number 1 3 2 2 2 2 2
cluster people health Canadian economics survey article data examine
number 2 3 2 2 2 2 2 2
Table: Table of words classified by clusters in k-medoids
17. Time series analysis: local smoothing regression
I A type of non-parametric regression method that is mixed type of a moving average (MA) and
polynomial regression.
I A smooth curve fitted for the trend changes to identify if one word has impact on the other to cause
some issues in 2020.
I Overall result shows that the causality is not possible, as the trend is the same for all 7 words.
I The decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people
aims to describe more specified interest.
I A "Canadian" issues are more frequently appeared in the headlines at the period of July to August as
the word "health" does indicating that the impact on Canadian people for health issues are most
severed.
Figure: A time series data analysis by local smoothing regression
18. Literature review of Gaussian Graphical Model
I Explicitly capture the statistical dependency between the variables of interest
in the form of a network graph.
I Each node in the graph corresponds to one of the word in the text data.
I A missing edges in graph correspond to conditional independence relations.
I Start with complete graph, take stepwise approach with BIC values (compare
graphs) to disconnect the edge.
I Apply a specific threshold for the partial correlation and remove all edge less than
the threshold.
Figure: Undirected Gaussian graphical model for the dependency structure
19. Interpretation of Gaussian Graphical Model
I Conditional independence relationship between sets of words as a practical inference.
I Words "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in
between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional
independence relationship.
I Also, (Health) ⊥ (impact, Canada) | (article), by the connected edges.
I (Canadian) ⊥ (impact, pandemic) | (covid) and (health, article) ⊥ (Canadian, covid) | (impact,
pandemic)
I Structural interpretation and result of partial correlation:
I Most articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in
2020.
I A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also
conditionally independent (with 9).
Figure: Undirected Gaussian graphical model for the dependency structure
20. Literature review of biterm topic model (BTM)
I First introduced in 2013, which attempted to address the inadequacies on
short documents to do modelling of co-occurrences in global term.
I The best method in topic clustering for short words as it is a probabilistic
generative model
I Learns topics by directly modeling the generation of word co-occurrence patterns,
by modeling each document as a mixture of topics.
I The R package BTM was used to perform the biterm topic modeling.
I Crawl data of plain text and pre-process tokenized the inputs. The output gives
unique tagging on each sentences and characteristics of words.
I Perform tagging on title and extract co-occurrences of nouns, adjectives and
verbs within 3 words distance.
I Build the biterm topic model with 5 topics and provide the set of biterms to
cluster, where tuning parameters are input to analyze the data.
I The R package of "ggraph" is used to automatically process the topic clustering
data.
21. Interpretation of the biterm topic model (BTM)
I Some of the unique and unobserved words include "international", "postsecondary", and "student" to have
not appeared.
I The economical issues and sociological issues are somewhat separated to yield a different result.
I A natural consequence of covid-19 pandemics, "medical", "protective", "business" and "personal" words that
appear to be the interest.
I A BTM yields comprehensive and grouped topics of words by the application of mixture models.
I A word "mental" and "health" is closely connected to show that the issue of public health.
I Words of "outlook", "price" and "service", shows worries and livings of Canadians for the impact of Covid-19.
Figure: Biterm clusters for 5 topics
22. Conclusion and discussion
I There are some variation and minor difference in methods.
I The BTM to provide with ultimate guidance on the data we analyzed with, coherent and
consistent topic.
I A hierarchal clustering shows for 2 clusters, but k-medoid clustering shows for 3 clusters.
Figure: Model used to analyze text data.
23. Conclusion and discussion
I Each method is different by the nature of mathematical and statistical foundation, leading
us to explore the data and guide through different result of the analysis.
I Analyzing term frequencies and term co-frequencies, clustering and the formulating topic
models better understand the topics of keywords of covid-19 pandemics in Canada.
I A wordcloud informed keywords and co-occurrence to observe for the data.
I A hierarchical clustering and k-medoid clustering to group them and investigate for the
correlation.
I Observed 2 groups of keywords where the first group of main topics ("covid", "pandemic" and
"health") of keywords and second group of sub-topics for issues.
I A local smoothing regression in time series data to investigate if there is a causal
relationship to draw upon the different trend.
I A trend is similar for top 7 ranked most appeared words indicating that the trend is the same to
have no formal statement on causal inference.
I A Gaussian graphical model to draw a conditional independence and structural
dependence relationship between top 7 rank of words.
I By the conditional independence, issue are organized for the co-occurrences of phrases for
structural dependencies.
I BTM were, more concise and specified topics to differentiate clearly for relevant keywords.
I The 5 topics, yield problems and issues with keywords Canadians to overcome Covid-19
pandemic to end with.
24. References
I Agrawal, R., Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of
items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on
Management of data (pp. 207-216).
I Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project.
org/package= BTM. R package version 0.3, 1.
I Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information
processing letters, 24(6), 377-380.
I Gershon, N., & Page, W. (2001). What storytelling can do for information visualization.
Communications of the ACM, 44(8), 31-37.
I Becue-Bertaut, M. (2019). Textual data science with R. CRC Press.
I Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series
and Copula Models. Applied Sciences, 9(19), 4071.
I Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for ’ggplot2’. R
package version 0.5. 0.
I Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering. Bioinformatics, 31(22), 3718-3720.
I Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM,
CLARA, and CLARANS algorithms. In International conference on similarity search and
applications (pp. 171-187). Springer, Cham.
I James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol1. 112, p.18). New York: Springer
I Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social
and Mobile Data. Springer Nature.
I Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple
keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929.