SlideShare a Scribd company logo
1 of 24
Download to read offline
A text mining and association analysis:
Exploring text data for creating topic models
Kyuson Lim
1Department of Mathematics & Statistics, McMaster University
E-mail: limk15@mcmaster.ca
Content
I Introduction
I Motivation
I Analysis result of 6 methods
I Literature review
I Interpretation
I Conclusion
Why text mining?
I Data transformation: business/industry overwhelmed with unstructured data.
I Telecommunication industry: analysis on customer termination reasoning.
I Government agency: news issues, local opinions, topic clusterings on annual
report.
I Data mining: web crawling on news headlines, social media, movie reviews.
I Business intelligence, exploratory data analysis, consumer behaviors.
I Easy to interpret, variety of applications, attractable outcomes, combinations with
other result.
I Variety of algorithm: Bert (Bidirectional Encoder Representations from
Transformers), BTM (Biterm Topic Models), LDA (Latent Dirichlet Allocation)
Figure: Example by Kyuson in 2020
How is different?
I Natural language processing (NLP): used to understand human language by
analyzing text, speech, or grammatical syntax.
I Advanced ML models and AI (artificial intelligence), ie. Siri.
I Geared towards mimicking natural human communication, syntax meaning.
I Extract grammatical structure and the sentiment.
I Text mining: used to extract information from unstructured and structured
content.
I Extract information from unstructured data.
I Statistical models to analyze qualitative meaning of content.
I Frequency of words, patterns, and correlation within words to explain the text.
Figure: Example of Journals (Author submitted)
Goal of the analysis
I Effectively portray the output and present in a collection of keywords by its
connects.
I Wordcloud, creation using data visualization.
I Hierarchal clustering and correlation graph.
I K-medoids clustering on classification.
I Association between issues and causal discovery on timeline of words.
I Gaussian graphical model: visually portray for connection.
I Local smoothing regression: fit on timelines in frequencies of words.
I BTM: topic clustering on documents, ultimate tool for co-occurrence based
topic model.
What gain?
I Statistics Canada (StatsCan COVID-19):
I January to December 2020 of the Canadian Perspectives Survey Series (CPSS).
I Covers livings and lifestyle issues of aged over 20 in Canada.
I Topics in sociological and economical issues.
I Keywords and issues of Covid-19 pandemics in 2020.
I Measure of impact on daily life, endemics.
I Interpretation is simple, concise and scientific.
I Future usage: recap and pandemic solution.
I Exploratory data analysis on advanced analytic method.
I Data visualization: efficient and practical skill to earn.
I Application for real life data: data analyst, complex data.
Text mining pre-processing
I Web crawling and tokenize text.
1. In the website Statistics Canada, using html source code, analyze the relevant
code for headlines which parses through contents.
2. Then, build code in R to crawl each headlines by the loop to save with.
3. Tokenize by making all words to be a small letters.
I Pre-process for filtering the unnecessary words.
I Eliminate adverbs and special characters using package "stopwords" and
"tokenizers".
I Construct a term table, consist of words and frequencies.
I Crawl published dates and change format into dates.
Figure: Example of web crawling
Literature review of wordcloud
I Visualize keyword metadata on websites in free form text.
I Mainly familiarized by the Web 2.0 websites and blogs.
I Past experience to publish in government report.
I Association rule: support analysis (co-occurences).
I Commonly known to be called as a market basket analysis.
I Machine learning (ML) method for discovering the interesting relations between
variables.
I Identify visually influential factors and attractive for readers to understand the
data.
Figure: Web 2.0 and association rule analysis: support
Wordcloud and co-occurences
I Wordcloud and co-occurrence bar graph of top 6 ranked most frequent used
words.
I "Covid" and "pandemic" has been used in 79 times.
I "pandemic" and "Canadian" has been used 21 times.
I Most "articles" are "Canadian" and "impact" except for "covid" and "pandemics".
I Overall data on frequencies and keywords to combine for phrase.
I Supplement unstructured form of wordcloud by quantitative bar graph on the
importance of issues by keywords identification.
I Investigate to find words that constitute topics of impact by Covid-19 in 2020.
Figure: Frequency table and combination of wordcloud with co-occurences graph
Interpretation: Wordcloud and co-occurences
I Minor and major words investigation.
I Identify some unique words such as "data", "differences", "survey", "statistics"
and "study".
I Economical and sociological issues for majority of articles.
I Interest in living issues: "price", "mental", "concerns", "home", and "workers".
I Future work for living issues and overcome for inflation, economic support.
I Unique overview to improve in various text data, publish in shiny app.
Figure: Frequency table and combination of wordcloud with co-occurences graph
Literature review of hierarchal clustering
I Seeks to construct a hierarchy of clusters, which classify the words into groups
based on the dissimilarity of the words.
1. The number of times of the word used is the coordinates in the space.
2. For any two words, the distance in the space is calculated as a measure of
dissimilarity.
3. At the beginning of the clustering process, each element is in a cluster of its own.
4. Within the distance matrix, we can then cluster the words.
5. For two clusters, the distance is the maximum distance among any pair of
elements from the two clusters.
6. Then the two clusters separated by the shortest distance are combined.
7. Iteratively, two most similar (close) clusters or word is joint until there is 2 cluster.
I Correlation analysis: co-occurrences among all documents of words in the
sparse matrix.
I If a word occurs in a particular document, then the sparse matrix entry for
corresponding to that row and column is 1, else it is 0.
I An efficient representation of the information contained in the term document
matrix.
Hierarchal clustering and correlation analysis
I Dendrogram: tree structure, visualize clusters of combination by the distance.
I 2 clusters establishes for 18 words and 3 words with "covid", "health" and
"pandemic".
I A word "covid" and "pandemic" has strong correlation (0.35).
I A "impact" has negative correlation with the word "health" (-0.1).
I Correlation is calculated based on all words to account with.
Figure: A hierarchal clustering of dendrogram and correlation analysis.
Interpretation: Hierarchal clustering and correlation
I Sub-hierarchy to formulate a result of topics with issues.
I Words "examines", "study", "using", and "survey" are grouped together in the
same hierarchy
I Main topic of 3 words and 18 words of sub-topics.
I Similar result for the dissimilarity between word "impact" and words of "health"
and "pandemic".
Figure: A hierarchal clustering of dendrogram and correlation analysis.
Literature review of K-medoid clustering
I Partitions the data into groups and attempt to minimize the distance between
points by defining a point of the center in that cluster.
I Use the Manhattan distance to define the dissimilarity.
I A k-medoid minimizes a sum of pairwise dissimilarities.
I A k-medoid chooses datapoints in the data as centers (called medoids).
I Build steps to construct the clusters and swap steps to adjust boundaries of
points.
Figure: A iteration steps for constructing the k-medoid clustering.
K-medoid clustering analysis
I The method is greedy (local optimal choice) to be heuristic for many solutions.
I 19 words in cluster 2 and a word is contained in both cluster 2 and 3.
I A similar result to contain most words in the cluster 2.
I Determine the number of clusters:
I An "elbow" method, calculate how much variability in the data to be explained by the
clustering.
I Identify the drastic point of increase to be the optimal cut-off for the choice in the
number of clusters.
Figure: A k-medoid clustering analysis and variance explained.
Interpretation: K-medoid clustering
I Similar to hierarchical clustering analysis in sub-topics and main topics.
I The words of "covid" and "pandemic" is separated from the major cluster.
I The word of "health" is contained with the other cluster (cluster 2).
I No cluster overlaps to be adequate fit for the data. Two number of cluster
accounts for 45.64% of the variability in the data.
I Reflect a meaningful relation between words where the result is reflected on
how people perceived in livings and issues.
cluster covid pandemic Canada impact Canadians business impacts
number 1 3 2 2 2 2 2
cluster people health Canadian economics survey article data examine
number 2 3 2 2 2 2 2 2
Table: Table of words classified by clusters in k-medoids
Time series analysis: local smoothing regression
I A type of non-parametric regression method that is mixed type of a moving average (MA) and
polynomial regression.
I A smooth curve fitted for the trend changes to identify if one word has impact on the other to cause
some issues in 2020.
I Overall result shows that the causality is not possible, as the trend is the same for all 7 words.
I The decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people
aims to describe more specified interest.
I A "Canadian" issues are more frequently appeared in the headlines at the period of July to August as
the word "health" does indicating that the impact on Canadian people for health issues are most
severed.
Figure: A time series data analysis by local smoothing regression
Literature review of Gaussian Graphical Model
I Explicitly capture the statistical dependency between the variables of interest
in the form of a network graph.
I Each node in the graph corresponds to one of the word in the text data.
I A missing edges in graph correspond to conditional independence relations.
I Start with complete graph, take stepwise approach with BIC values (compare
graphs) to disconnect the edge.
I Apply a specific threshold for the partial correlation and remove all edge less than
the threshold.
Figure: Undirected Gaussian graphical model for the dependency structure
Interpretation of Gaussian Graphical Model
I Conditional independence relationship between sets of words as a practical inference.
I Words "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in
between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional
independence relationship.
I Also, (Health) ⊥ (impact, Canada) | (article), by the connected edges.
I (Canadian) ⊥ (impact, pandemic) | (covid) and (health, article) ⊥ (Canadian, covid) | (impact,
pandemic)
I Structural interpretation and result of partial correlation:
I Most articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in
2020.
I A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also
conditionally independent (with 9).
Figure: Undirected Gaussian graphical model for the dependency structure
Literature review of biterm topic model (BTM)
I First introduced in 2013, which attempted to address the inadequacies on
short documents to do modelling of co-occurrences in global term.
I The best method in topic clustering for short words as it is a probabilistic
generative model
I Learns topics by directly modeling the generation of word co-occurrence patterns,
by modeling each document as a mixture of topics.
I The R package BTM was used to perform the biterm topic modeling.
I Crawl data of plain text and pre-process tokenized the inputs. The output gives
unique tagging on each sentences and characteristics of words.
I Perform tagging on title and extract co-occurrences of nouns, adjectives and
verbs within 3 words distance.
I Build the biterm topic model with 5 topics and provide the set of biterms to
cluster, where tuning parameters are input to analyze the data.
I The R package of "ggraph" is used to automatically process the topic clustering
data.
Interpretation of the biterm topic model (BTM)
I Some of the unique and unobserved words include "international", "postsecondary", and "student" to have
not appeared.
I The economical issues and sociological issues are somewhat separated to yield a different result.
I A natural consequence of covid-19 pandemics, "medical", "protective", "business" and "personal" words that
appear to be the interest.
I A BTM yields comprehensive and grouped topics of words by the application of mixture models.
I A word "mental" and "health" is closely connected to show that the issue of public health.
I Words of "outlook", "price" and "service", shows worries and livings of Canadians for the impact of Covid-19.
Figure: Biterm clusters for 5 topics
Conclusion and discussion
I There are some variation and minor difference in methods.
I The BTM to provide with ultimate guidance on the data we analyzed with, coherent and
consistent topic.
I A hierarchal clustering shows for 2 clusters, but k-medoid clustering shows for 3 clusters.
Figure: Model used to analyze text data.
Conclusion and discussion
I Each method is different by the nature of mathematical and statistical foundation, leading
us to explore the data and guide through different result of the analysis.
I Analyzing term frequencies and term co-frequencies, clustering and the formulating topic
models better understand the topics of keywords of covid-19 pandemics in Canada.
I A wordcloud informed keywords and co-occurrence to observe for the data.
I A hierarchical clustering and k-medoid clustering to group them and investigate for the
correlation.
I Observed 2 groups of keywords where the first group of main topics ("covid", "pandemic" and
"health") of keywords and second group of sub-topics for issues.
I A local smoothing regression in time series data to investigate if there is a causal
relationship to draw upon the different trend.
I A trend is similar for top 7 ranked most appeared words indicating that the trend is the same to
have no formal statement on causal inference.
I A Gaussian graphical model to draw a conditional independence and structural
dependence relationship between top 7 rank of words.
I By the conditional independence, issue are organized for the co-occurrences of phrases for
structural dependencies.
I BTM were, more concise and specified topics to differentiate clearly for relevant keywords.
I The 5 topics, yield problems and issues with keywords Canadians to overcome Covid-19
pandemic to end with.
References
I Agrawal, R., Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of
items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on
Management of data (pp. 207-216).
I Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project.
org/package= BTM. R package version 0.3, 1.
I Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information
processing letters, 24(6), 377-380.
I Gershon, N., & Page, W. (2001). What storytelling can do for information visualization.
Communications of the ACM, 44(8), 31-37.
I Becue-Bertaut, M. (2019). Textual data science with R. CRC Press.
I Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series
and Copula Models. Applied Sciences, 9(19), 4071.
I Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for ’ggplot2’. R
package version 0.5. 0.
I Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering. Bioinformatics, 31(22), 3718-3720.
I Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM,
CLARA, and CLARANS algorithms. In International conference on similarity search and
applications (pp. 171-187). Springer, Cham.
I James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol1. 112, p.18). New York: Springer
I Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social
and Mobile Data. Springer Nature.
I Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple
keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929.

More Related Content

Similar to Text mining and its association analysis.pdf

Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Pt2520 Unit 6 Data Mining Project
Pt2520 Unit 6 Data Mining ProjectPt2520 Unit 6 Data Mining Project
Pt2520 Unit 6 Data Mining ProjectJoyce Williams
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text AnalyticsRushikeshChikane2
 
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx
2To ADD names From ADD name Date ADD date Subject ADD ti.docxnovabroom
 
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx
2To ADD names From ADD name Date ADD date Subject ADD ti.docxjesusamckone
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
3282016 Additional Book Resourceshttpscourserooma.cap.docx
3282016 Additional Book Resourceshttpscourserooma.cap.docx3282016 Additional Book Resourceshttpscourserooma.cap.docx
3282016 Additional Book Resourceshttpscourserooma.cap.docxtamicawaysmith
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancerpaperpublications3
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesShantanu Deshpande
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGkevig
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Molly Gibbons (she/her)
 

Similar to Text mining and its association analysis.pdf (13)

Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Pt2520 Unit 6 Data Mining Project
Pt2520 Unit 6 Data Mining ProjectPt2520 Unit 6 Data Mining Project
Pt2520 Unit 6 Data Mining Project
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text Analytics
 
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
 
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx2To  ADD names From  ADD name Date  ADD date Subject  ADD ti.docx
2To ADD names From ADD name Date ADD date Subject ADD ti.docx
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
3282016 Additional Book Resourceshttpscourserooma.cap.docx
3282016 Additional Book Resourceshttpscourserooma.cap.docx3282016 Additional Book Resourceshttpscourserooma.cap.docx
3282016 Additional Book Resourceshttpscourserooma.cap.docx
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniques
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
 

More from KyusonLim

ROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxKyusonLim
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic netKyusonLim
 
ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)KyusonLim
 
BlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelBlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelKyusonLim
 
Missing value imputation (slide)
Missing value imputation (slide)Missing value imputation (slide)
Missing value imputation (slide)KyusonLim
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1KyusonLim
 

More from KyusonLim (7)

ROC Korean drought presentation.pptx
ROC Korean drought presentation.pptxROC Korean drought presentation.pptx
ROC Korean drought presentation.pptx
 
Dag in mmhc
Dag in mmhcDag in mmhc
Dag in mmhc
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)ideas of mathematics -17tilings (final)
ideas of mathematics -17tilings (final)
 
BlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed modelBlUP and BLUE- REML of linear mixed model
BlUP and BLUE- REML of linear mixed model
 
Missing value imputation (slide)
Missing value imputation (slide)Missing value imputation (slide)
Missing value imputation (slide)
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1
 

Recently uploaded

What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxCeline George
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfNirmal Dwivedi
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfstareducators107
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...EADTU
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17Celine George
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17Celine George
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesSHIVANANDaRV
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of PlayPooky Knightsmith
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxakanksha16arora
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 

Recently uploaded (20)

What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Play hard learn harder: The Serious Business of Play
Play hard learn harder:  The Serious Business of PlayPlay hard learn harder:  The Serious Business of Play
Play hard learn harder: The Serious Business of Play
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 

Text mining and its association analysis.pdf

  • 1. A text mining and association analysis: Exploring text data for creating topic models Kyuson Lim 1Department of Mathematics & Statistics, McMaster University E-mail: limk15@mcmaster.ca
  • 2. Content I Introduction I Motivation I Analysis result of 6 methods I Literature review I Interpretation I Conclusion
  • 3. Why text mining? I Data transformation: business/industry overwhelmed with unstructured data. I Telecommunication industry: analysis on customer termination reasoning. I Government agency: news issues, local opinions, topic clusterings on annual report. I Data mining: web crawling on news headlines, social media, movie reviews. I Business intelligence, exploratory data analysis, consumer behaviors. I Easy to interpret, variety of applications, attractable outcomes, combinations with other result. I Variety of algorithm: Bert (Bidirectional Encoder Representations from Transformers), BTM (Biterm Topic Models), LDA (Latent Dirichlet Allocation) Figure: Example by Kyuson in 2020
  • 4. How is different? I Natural language processing (NLP): used to understand human language by analyzing text, speech, or grammatical syntax. I Advanced ML models and AI (artificial intelligence), ie. Siri. I Geared towards mimicking natural human communication, syntax meaning. I Extract grammatical structure and the sentiment. I Text mining: used to extract information from unstructured and structured content. I Extract information from unstructured data. I Statistical models to analyze qualitative meaning of content. I Frequency of words, patterns, and correlation within words to explain the text. Figure: Example of Journals (Author submitted)
  • 5. Goal of the analysis I Effectively portray the output and present in a collection of keywords by its connects. I Wordcloud, creation using data visualization. I Hierarchal clustering and correlation graph. I K-medoids clustering on classification. I Association between issues and causal discovery on timeline of words. I Gaussian graphical model: visually portray for connection. I Local smoothing regression: fit on timelines in frequencies of words. I BTM: topic clustering on documents, ultimate tool for co-occurrence based topic model.
  • 6. What gain? I Statistics Canada (StatsCan COVID-19): I January to December 2020 of the Canadian Perspectives Survey Series (CPSS). I Covers livings and lifestyle issues of aged over 20 in Canada. I Topics in sociological and economical issues. I Keywords and issues of Covid-19 pandemics in 2020. I Measure of impact on daily life, endemics. I Interpretation is simple, concise and scientific. I Future usage: recap and pandemic solution. I Exploratory data analysis on advanced analytic method. I Data visualization: efficient and practical skill to earn. I Application for real life data: data analyst, complex data.
  • 7. Text mining pre-processing I Web crawling and tokenize text. 1. In the website Statistics Canada, using html source code, analyze the relevant code for headlines which parses through contents. 2. Then, build code in R to crawl each headlines by the loop to save with. 3. Tokenize by making all words to be a small letters. I Pre-process for filtering the unnecessary words. I Eliminate adverbs and special characters using package "stopwords" and "tokenizers". I Construct a term table, consist of words and frequencies. I Crawl published dates and change format into dates. Figure: Example of web crawling
  • 8. Literature review of wordcloud I Visualize keyword metadata on websites in free form text. I Mainly familiarized by the Web 2.0 websites and blogs. I Past experience to publish in government report. I Association rule: support analysis (co-occurences). I Commonly known to be called as a market basket analysis. I Machine learning (ML) method for discovering the interesting relations between variables. I Identify visually influential factors and attractive for readers to understand the data. Figure: Web 2.0 and association rule analysis: support
  • 9. Wordcloud and co-occurences I Wordcloud and co-occurrence bar graph of top 6 ranked most frequent used words. I "Covid" and "pandemic" has been used in 79 times. I "pandemic" and "Canadian" has been used 21 times. I Most "articles" are "Canadian" and "impact" except for "covid" and "pandemics". I Overall data on frequencies and keywords to combine for phrase. I Supplement unstructured form of wordcloud by quantitative bar graph on the importance of issues by keywords identification. I Investigate to find words that constitute topics of impact by Covid-19 in 2020. Figure: Frequency table and combination of wordcloud with co-occurences graph
  • 10. Interpretation: Wordcloud and co-occurences I Minor and major words investigation. I Identify some unique words such as "data", "differences", "survey", "statistics" and "study". I Economical and sociological issues for majority of articles. I Interest in living issues: "price", "mental", "concerns", "home", and "workers". I Future work for living issues and overcome for inflation, economic support. I Unique overview to improve in various text data, publish in shiny app. Figure: Frequency table and combination of wordcloud with co-occurences graph
  • 11. Literature review of hierarchal clustering I Seeks to construct a hierarchy of clusters, which classify the words into groups based on the dissimilarity of the words. 1. The number of times of the word used is the coordinates in the space. 2. For any two words, the distance in the space is calculated as a measure of dissimilarity. 3. At the beginning of the clustering process, each element is in a cluster of its own. 4. Within the distance matrix, we can then cluster the words. 5. For two clusters, the distance is the maximum distance among any pair of elements from the two clusters. 6. Then the two clusters separated by the shortest distance are combined. 7. Iteratively, two most similar (close) clusters or word is joint until there is 2 cluster. I Correlation analysis: co-occurrences among all documents of words in the sparse matrix. I If a word occurs in a particular document, then the sparse matrix entry for corresponding to that row and column is 1, else it is 0. I An efficient representation of the information contained in the term document matrix.
  • 12. Hierarchal clustering and correlation analysis I Dendrogram: tree structure, visualize clusters of combination by the distance. I 2 clusters establishes for 18 words and 3 words with "covid", "health" and "pandemic". I A word "covid" and "pandemic" has strong correlation (0.35). I A "impact" has negative correlation with the word "health" (-0.1). I Correlation is calculated based on all words to account with. Figure: A hierarchal clustering of dendrogram and correlation analysis.
  • 13. Interpretation: Hierarchal clustering and correlation I Sub-hierarchy to formulate a result of topics with issues. I Words "examines", "study", "using", and "survey" are grouped together in the same hierarchy I Main topic of 3 words and 18 words of sub-topics. I Similar result for the dissimilarity between word "impact" and words of "health" and "pandemic". Figure: A hierarchal clustering of dendrogram and correlation analysis.
  • 14. Literature review of K-medoid clustering I Partitions the data into groups and attempt to minimize the distance between points by defining a point of the center in that cluster. I Use the Manhattan distance to define the dissimilarity. I A k-medoid minimizes a sum of pairwise dissimilarities. I A k-medoid chooses datapoints in the data as centers (called medoids). I Build steps to construct the clusters and swap steps to adjust boundaries of points. Figure: A iteration steps for constructing the k-medoid clustering.
  • 15. K-medoid clustering analysis I The method is greedy (local optimal choice) to be heuristic for many solutions. I 19 words in cluster 2 and a word is contained in both cluster 2 and 3. I A similar result to contain most words in the cluster 2. I Determine the number of clusters: I An "elbow" method, calculate how much variability in the data to be explained by the clustering. I Identify the drastic point of increase to be the optimal cut-off for the choice in the number of clusters. Figure: A k-medoid clustering analysis and variance explained.
  • 16. Interpretation: K-medoid clustering I Similar to hierarchical clustering analysis in sub-topics and main topics. I The words of "covid" and "pandemic" is separated from the major cluster. I The word of "health" is contained with the other cluster (cluster 2). I No cluster overlaps to be adequate fit for the data. Two number of cluster accounts for 45.64% of the variability in the data. I Reflect a meaningful relation between words where the result is reflected on how people perceived in livings and issues. cluster covid pandemic Canada impact Canadians business impacts number 1 3 2 2 2 2 2 cluster people health Canadian economics survey article data examine number 2 3 2 2 2 2 2 2 Table: Table of words classified by clusters in k-medoids
  • 17. Time series analysis: local smoothing regression I A type of non-parametric regression method that is mixed type of a moving average (MA) and polynomial regression. I A smooth curve fitted for the trend changes to identify if one word has impact on the other to cause some issues in 2020. I Overall result shows that the causality is not possible, as the trend is the same for all 7 words. I The decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people aims to describe more specified interest. I A "Canadian" issues are more frequently appeared in the headlines at the period of July to August as the word "health" does indicating that the impact on Canadian people for health issues are most severed. Figure: A time series data analysis by local smoothing regression
  • 18. Literature review of Gaussian Graphical Model I Explicitly capture the statistical dependency between the variables of interest in the form of a network graph. I Each node in the graph corresponds to one of the word in the text data. I A missing edges in graph correspond to conditional independence relations. I Start with complete graph, take stepwise approach with BIC values (compare graphs) to disconnect the edge. I Apply a specific threshold for the partial correlation and remove all edge less than the threshold. Figure: Undirected Gaussian graphical model for the dependency structure
  • 19. Interpretation of Gaussian Graphical Model I Conditional independence relationship between sets of words as a practical inference. I Words "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional independence relationship. I Also, (Health) ⊥ (impact, Canada) | (article), by the connected edges. I (Canadian) ⊥ (impact, pandemic) | (covid) and (health, article) ⊥ (Canadian, covid) | (impact, pandemic) I Structural interpretation and result of partial correlation: I Most articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in 2020. I A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also conditionally independent (with 9). Figure: Undirected Gaussian graphical model for the dependency structure
  • 20. Literature review of biterm topic model (BTM) I First introduced in 2013, which attempted to address the inadequacies on short documents to do modelling of co-occurrences in global term. I The best method in topic clustering for short words as it is a probabilistic generative model I Learns topics by directly modeling the generation of word co-occurrence patterns, by modeling each document as a mixture of topics. I The R package BTM was used to perform the biterm topic modeling. I Crawl data of plain text and pre-process tokenized the inputs. The output gives unique tagging on each sentences and characteristics of words. I Perform tagging on title and extract co-occurrences of nouns, adjectives and verbs within 3 words distance. I Build the biterm topic model with 5 topics and provide the set of biterms to cluster, where tuning parameters are input to analyze the data. I The R package of "ggraph" is used to automatically process the topic clustering data.
  • 21. Interpretation of the biterm topic model (BTM) I Some of the unique and unobserved words include "international", "postsecondary", and "student" to have not appeared. I The economical issues and sociological issues are somewhat separated to yield a different result. I A natural consequence of covid-19 pandemics, "medical", "protective", "business" and "personal" words that appear to be the interest. I A BTM yields comprehensive and grouped topics of words by the application of mixture models. I A word "mental" and "health" is closely connected to show that the issue of public health. I Words of "outlook", "price" and "service", shows worries and livings of Canadians for the impact of Covid-19. Figure: Biterm clusters for 5 topics
  • 22. Conclusion and discussion I There are some variation and minor difference in methods. I The BTM to provide with ultimate guidance on the data we analyzed with, coherent and consistent topic. I A hierarchal clustering shows for 2 clusters, but k-medoid clustering shows for 3 clusters. Figure: Model used to analyze text data.
  • 23. Conclusion and discussion I Each method is different by the nature of mathematical and statistical foundation, leading us to explore the data and guide through different result of the analysis. I Analyzing term frequencies and term co-frequencies, clustering and the formulating topic models better understand the topics of keywords of covid-19 pandemics in Canada. I A wordcloud informed keywords and co-occurrence to observe for the data. I A hierarchical clustering and k-medoid clustering to group them and investigate for the correlation. I Observed 2 groups of keywords where the first group of main topics ("covid", "pandemic" and "health") of keywords and second group of sub-topics for issues. I A local smoothing regression in time series data to investigate if there is a causal relationship to draw upon the different trend. I A trend is similar for top 7 ranked most appeared words indicating that the trend is the same to have no formal statement on causal inference. I A Gaussian graphical model to draw a conditional independence and structural dependence relationship between top 7 rank of words. I By the conditional independence, issue are organized for the co-occurrences of phrases for structural dependencies. I BTM were, more concise and specified topics to differentiate clearly for relevant keywords. I The 5 topics, yield problems and issues with keywords Canadians to overcome Covid-19 pandemic to end with.
  • 24. References I Agrawal, R., Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp. 207-216). I Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project. org/package= BTM. R package version 0.3, 1. I Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information processing letters, 24(6), 377-380. I Gershon, N., & Page, W. (2001). What storytelling can do for information visualization. Communications of the ACM, 44(8), 31-37. I Becue-Bertaut, M. (2019). Textual data science with R. CRC Press. I Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series and Copula Models. Applied Sciences, 9(19), 4071. I Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for ’ggplot2’. R package version 0.5. 0. I Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31(22), 3718-3720. I Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In International conference on similarity search and applications (pp. 171-187). Springer, Cham. I James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol1. 112, p.18). New York: Springer I Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data. Springer Nature. I Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929.