SlideShare a Scribd company logo
1 of 8
Download to read offline
TELKOMNIKA, Vol.16, No.4, August 2018, pp. 1752~1759
ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018
DOI: 10.12928/TELKOMNIKA.v16i4.8194  1752
Received December 2, 2017; Revised January 24, 2018; Accepted July 3, 2018
WCLOUDVIZ: Word Cloud Visualization of Indonesian
News Articles Classification Based on Latent Dirichlet
Allocation
Retno Kusumaningrum*, Satriyo Adhy, Suryono
Department of Informatics, Universitas Diponegoro, Jl. Prof. Soedarto, SH Tembalang, Semarang, Central
Java, Indonesia
*Corresponding author, e-mail: retno@live.undip.ac.id
Abstract
Latent Dirichlet Allocation (LDA) is a widely implemented approach for extracting hidden topics in
documents generated by soft clustering of a word based on document co-occurrence as a multinomial
probability distribution over terms. Therefore, several visualizations have been developed, such as
matrices design, text-based design, tree design, parallel coordinates, and force-directed graphs.
Furthermore, based on a set of documents representing a class (category), we can implement
classification task by comparing topic proportion for each class and topic proportion for the testing
document by using Kullback-Leibler Divergence (KLD). Therefore, the purpose of this study is to develop a
system for visualizing the output of LDA as a classification task. The visualization system consists of two
parts: bar chart and dependent word cloud. The first visualization aims to show the trend of each category,
while the second visualization aims to show the words that represent each selected category in a word
cloud. This visualization is subsequently called WCloudViz. It provides clear, understandable and
preferably shared the result.
Keywords: Latent dirichlet allocation, Topic modeling, News articles classification, Data visualization,
Word cloud
Copyright © 2018 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
Latent Dirichlet Allocation, also known as LDA, is one of topic modeling algorithms. In
LDA, each document is represented as a random mixture over a set of latent topics and each of
the topic is represented as a distribution over a vocabulary [1]. Until this moment, LDA has been
widely implemented in various domains, such as text document [2]–[8], image domain [9]–[14],
music domain [15], [16], and etc.
LDA can be implemented for classification task as well as clustering task. As clustering
task, the output of LDA is a set of topics in which each topic is represented by a set of words
and its probability . This LDA’s output is a corpus-level output. In addition, LDA also results
an output at document-level, i.e. topic proportion for each document . Furthermore, when
we have a set of documents representing a class (category), we can compute topic proportion
for each class . By implementing Kullback-Leibler Divergence, we can find the most similar
distribution between topic proportion for each class and topic proportion for testing document
based on the smallest value of KLD [12].
In accordance with clustering tasks, there are several visualizations that have been
developed. The first design is matrices design in which rows correspond to terms and columns
to topics. A developed matrices-based system is Termite [17]. The second design is text-based
design [18]. This visualization exploits the use of text to illustrate topic as well as the use of text
to explain the represented words for each topic, related documents for each topic, and related
topics for each topic. Based on the selected document, we can also find the document’s related
topics, document’s text, and document’s related documents.
The third design is tree design. This design is a popular design for visualizing LDA’s
output such as relationship-enriched visualization [19], Diachronic visualization [20],
LDAExplore [21], and etc. LDAExplore also implements parallel coordinates to show topic
proportion for each document. The first vertical coordinate is coordinate for displaying
TELKOMNIKA ISSN: 1693-6930 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum)
1753
document’s ID, while second vertical coordinate and so on is for representing values of topic
proportion (distribution) for each topic. A line is drawn to connect the document's ID and related
topic proportions for each remaining vertical line. The last design is force-directed graphs, such
as relationship-enriched visualization [19] and Diachronic visualization [20]. In addition, there
are some visualizations that have been developed for broader topic (information retrieval) such
as map view, tree view, and bubble view which are implemented in [22], graph-view [23], heat
map, interactive stream graph, and context focus which are implemented in [24].
However, those visualizations are focused on visualizing word-topic distribution
produced by LDA. Therefore, the aim of this study is to develop a system for visualizing output
of LDA as classification task. We implement it against the classification of Indonesian news
articles. This is shown to reduce the drawbacks of Indonesian online news presentations, i.e.
repetition of news contents and news splitting into some headlines. It is known as yellow
journalism which gives the negative impact for readers, that are, a shift of citizen characters
arising from excessive news report such as the outspread of "Alay" culture, the diversion of
political issues, common corruption cases, etc. This classification and visualization system is
expected to be used to know the news trends and the words that represent them for each
category. The system is subsequently called as WCloudViz.
In section 2, we provide the research method related to the design consideration
including task analysis, data and backend, as well as prototype design. The explanation about
result and discussion will be presented in section 4. Finally, in section 5, we conclude this study
with what we perceive to be the future work.
2. Research Method
As explained in previous section, WCloudViz is aimed to visualize the output of LDA as
classification task. Therefore, there are three tasks, which form WCloudViz. The first task is
identifying news trend for each class (category), and visualizing it as bar chart is the second
task. The last task is word cloud generation of word representing for each selected category.
The outline of WCloudViz formation is depicted in Figure 1. The following sub-sub sections
explain each task in detail.
Figure 1. The outline of WCloudViz formation
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759
1754
2.1. News Trend Identification
Since we aim to visualize the output of LDA as classification task, the process of news
trend identification will be conducted based on LDA classification model. The detail about how
to generate the LDA-based classification model for news articles classification can be read in
Ref. [25]. Moreover, the procedure of news trend identification described in Figure 2.
Figure 2. Procedure of news trend identification based on LDA classification model
Tokenization is a process to eliminate characters that are likely to be insignificant for
information extraction, such as white space and punctuation marks. Afterwards, the second
process is stopword removal, implemented for eliminating words that are likely to be
insignificant for information extraction. In this research, there are 758 words in Indonesian that
are considered as stopwords. Table 1 shows example of Indonesian stopwords and its English
translation for each word.
Table 1. Applications in Each Class
Part of Speech Stopwords in Indonesia English Translation
Pronoun saya, kamu, mereka, kita, ini, itu I, you, they, we, this, that
Adverb sekarang, agak currently, some
Preposition setelah, sebelum, kepada, dalam, atas, bawah after, before, to, in, on, under
Conjunction dan, karena and, because
Stemming, the last text processing, is a process to derive each word from their root
form. This research implements Sastrawi-stemmer library for stemming process. Sastrawi-
stemmer is incrementally built. As a base, it implements Nazief Nazief-Adriani algorithm.
Subsequently, it is improved by Confix Stripping (CS) algorithm and Enhanced Confix Stripping
(ECS) algorithm. The final improvement is a modification of Modified ECS. The selection of
Sastrawi-stemmer library for stemming process is based on several issues, namely it
overcomes overstemming problem by using basic words dictionary, it overcomes the
understemming problem by adding rules, as well as it is able to stem compound words correctly.
TELKOMNIKA ISSN: 1693-6930 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum)
1755
On the other hand, we already have the classification models including PWZ (word-
topic distribution) values and PZ (topic distribution) values. Both values are in corpus level. In
addition, we also have a set of words that represent each topic. Based on the model, we
compute the PZD (topic proportion for each document) for each unseen news article by using
the following formula.
  

N
i
i
tkkd kzp
1
)(
,, *  (1)
Where kd, is equal to PZD, the probability of k -th topic in d -th document,  kzp  is the
probabilityof topic z is equal to k , and is the sum of i -th term in the corresponding
document equal to term t and assigned as topic k , while N is the number of term in the
corresponding document.
Afterwards, we can compute the similarity distribution between pre-defined PZC (topic
proportion for each class) and pre-computed PZD (resulted based on Equation 1) based on
Kullback-Leibler Divergence (KLD). Let  ipP  be the topic proportion of the unseen
document and  iqQ  be the topic proportion of each class; then the KLD can be computed
by using the following formula.
2
DQPDPQ
KLD

 (2)
 






i i
i
i
q
p
pDPQ log (3)
 






i i
i
i
p
q
qDQP log (4)
Finally, the most similar distribution between topic proportion for each class and topic
proportion for unseen document is obtained based on the smallest value of KLD.
2.2. Bar Chart Visualization
Bar chart is visualized based on the number of documents for each class (category)
over the whole submitted new articles. Figure 3 shows the detailed procedure for generating bar
chart based on output of the previous task.
Figure 3. Procedure of bar chart visualization
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759
1756
According to the Figure 3, the input of this process is a collection of news articles that
have been classified from previous step into six categories (class) including economic, tourism,
criminal, sport, and politic. The next process is computing a number of articles for each
category. Subsequently, plot the bar chart with x-axis represents six news categories and y-axis
represents number of articles for each category.
2.3. Word Cloud Visualization
Based on the selected class (category), we can generate word cloud of represented
words based on the following procedure as shown in Figure 4. We use the initial value of the
biggest font size that is 150, and the number of words that are 20. Both values can be changed
in accordance with the requirements.
The first step of word cloud visualization is counting frequency of word occurrence in all
documents of selected category. Get maximum frequency and get top-20 most occurred words.
Subsequently, compute the font size of all obtained words based on the following formula. We
use default value of MaxFont is 150 pt.






 MaxFont
cyMaxFrequen
Frequency
FontSize (5)
As an exception and to provide the convenience of the reader, then the default value of the
smallest FontSize is set at 14 pt despite FontSize calculation results based on the
formula (5) gives the results of less than 14 pt.
Figure 4. Procedure of word cloud visualization
TELKOMNIKA ISSN: 1693-6930 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum)
1757
3. Results and Analysis
In this section, it is explained the results of research and its comprehensive discussion
about interaction and visualization. For the demonstrator of WCloudViz, we employ the news
articles data in Indonesian. These articles are obtained from local newspaper, i.e. Jawa Pos -
Radar Semarang. Articles are saved in .txt format for each headline.
3.1. Prototype Design
As explained in previous section, there are three tasks which form WCloudViz,
consisting of news trend identification, bar chart visualization, and word cloud visualization. Two
first tasks are represented in one interface as shown in Figure 5. The third task is represented in
another interface, which is loaded after the user clicks one of the bars in the chart. The interface
can be seen in Figure 6.
Figure 5. Bar chart visualization with embedded news trend identification
Figure 6. Word cloud visualization of selected category
3.2. Interaction and Visualization
Users can gain insight into the news categories that have become the reporting focus
through the bar chart visualization, which means that the news category with most report has
the highest bar chart and vice-versa. In addition, users can also obtain information on the topic
of each news category. Each topic is represented as a word cloud, i.e. word-occurrence in a set
of documents. Word cloud will be displayed when the user has selected one of news categories.
Hereafter, it can be referred as the dependent word cloud.
For example, Figure 6 shows the economic category and it is represented by 10 words
having high frequency, such as “harga” (price), “cabai” (chili), “Rp” (Indonesian currency),
“turun” (decrease), “ribu” (thousand), “BBM” (fuel), “solar” (diesel), “nelayan” (fisherman),
“dagang” (trade), and “emas” (gold). Furthermore, users can also draw conclusion based on the
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759
1758
word cloud, i.e. the current economic news headlines are widely reporting about the news
related to the increasing chili prices up to hundreds of thousands rupiahs, the declining rupiahs,
and the price of fuel, diesel fuel, as well as gold.
In accordance with the previous explanations, there are some points that can be
accomplished by word cloud. First, word cloud provides clarity in identifying trends and pattern
that will otherwise be unclear or difficult to see in a tabular format. Second, word cloud is easier
to understand since it is more visually appealing than the textual format of data. The bigger the
words represent, the more widely-reported the headlines are. The last point, word cloud is more
likely to be shared since it presents pictorial representation of the data. Unfortunately, the
resulting word cloud has not been depicted in accordance with the related topic or theme, thus it
is less impactful.
4. Conclusion
WCloudViz is a visualization system based on LDA. It consists of two parts: bar chart
and dependent word cloud. The first visualization aims to show the trend of each category, while
the second visualization aims to show the words that represent each selected category in word
cloud. This visualization provides clear, understandable, and preferably shared result.
The strategic future work shall form each word cloud in accordance with related topic or theme,
thus it can be more impactful and insightful.
Acknowledgement
The authors would like to acknowledge the research funding supported by Universitas
Diponegoro under the grant of Research for International Scientific Publication (number 1052-
36/UN7.5.1/PG/2016, year 2016)
References
[1] DM Blei, AY Ng, MI Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003; 3: 993–1022.
[2] DM Blei. Probabilistic Models of Text and Images. University of Carolina, Berkeley, 2004.
[3] V Ravi. Legal Documents Clustering using Latent Dirichlet Allocation. International Journal of Applied
Information Systems. 2012; 2(6): 34–37.
[4] I Bíró, J Szabó, AA Benczúr. Latent dirichlet allocation in web spam filtering. Proc. 4th Int. Work.
Advers. Inf. Retr. web, 2008: 29–32.
[5] W Sriurai. Improving Text Categorization by using a Topic Model. Adv. Comput., 2011; 2(6): 21–27.
[6] I Bíró. Document Classification with Latent Dirichlet Allocation. Unpubl. Dr. Diss. Eotvos Lorand …,
2009.
[7] L Bolelli, Ş Ertekin, CL Giles. Topic and trend detection in text collections using latent dirichlet
allocation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics). 2009; 5478 LNCS: 776–780.
[8] XH Phan, LM Nguyen, S Horiguchi. Learning to Classify Short and Sparse Text & Web with Hidden
Topics from Large-scale Data Collections. Proceeding 17th Int. Conf. World Wide Web - WWW ’08,
no. January 2008: 91–100.
[9] K Barnard, P Duygulu, D Forsyth, N de Freitas, DM Blei, MI Jordan. Matching Words and Pictures. J.
Mach. Learn. Res., 2003; 3: 1107–1135.
[10] C Wang, DM Blei, L Fei-fei. Simultaneous image classification and annotation. IEEE Conference on
Computer Vision and Pattern Recognition. 2009:1903–1910.
[11] D Bratasanu, I Nedelcu, M Datcu. Bridging the Semantic Gap for Satellite Image Annotation and
Automatic Mapping Applications. IEEE J. Sel. Top. Earth Obs. Remote Sens. 2011; 4(1): 193-120.
[12] R Kusumaningrum, H Wei, R Manurung, A Murni. Integrated visual vocabulary in latent Dirichlet
allocation–based scene classification for IKONOS image. J. Appl. Remote Sens. Jan. 2014; 8(1):
83690-1-18.
[13] R Kusumaningrum, HM Manurung, AM Arymurthy. CIELab Color Moments: Alternative Descriptors
for LANDSAT Images Classification System. INKOM. 2014; 8(2): 109–114.
[14] M Liénou, H Maître, M Datcu. Using Latent Dirichlet Allocation. IEEE Geosci. Remote Sens. Lett.
2010; 7(1): 28–32.
[15] D Hu. Latent Dirichlet Allocation, Text, Images, & Music.”
[16] R Cai, C Zhang, C Wang, L Zhang, W Ma. MusicSense: Contextual Music Recommendation using
Emotional Allocation Modeling. Emotion. 2007: 553–556.
TELKOMNIKA ISSN: 1693-6930 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum)
1759
[17] J Chuang, CD Manning, J Heer. Termite: Visualization Techniques for Assessing Textual Topic
Models. in Proceedings of the International Working Conference on Advanced Visual Interfaces
(AVI), 2012: 74–77.
[18] AJB Chaney, DM Blei. Visualizing Topic Models. in Proceedings of the Sixth International Conference
on Weblogs and Social Media. 2003.
[19] A Smith, J Chuang, Y Hu, J Boyd-graber, L Findlater. Concurrent Visualization of Relationships
between Words and Topics in Topic Models. in Proceedings of the Workshop on Interactive
Language Learning, Visualization, and Interfaces. 2014: 79–82.
[20] J Lamirel, N Dugu, P Cuxac. Performing and Visualizing Temporal Analysis of Large Text Data
Issued for Open Sources: Past and Future. in Proceedings of the 12th International Conference on
Beyond Databases, Architecture and Structures. 2016: 1340846.
[21] A Ganesan, K Brantley, S Pan J Chen. LDAExplore: Visualizing Topic Models Generated Using
Latent Dirichlet Allocation. CoRR, 2015.
[22] MH Mamoon, HM El-Bakry, AA Salama. Visualization for Information Retrieval based on Fast Search
Technology. Indonesian Journal of Electrical Engineering and Informatics (IJEEI). December 2013;
1(4): 140-156.
[23] F Nadirman, A Ridha, Annisa. Searching and Visualization of References in Research Documents.
TELKOMNIKA (Telecommunication Computing Electronics and Control). June 2014; 12(2): 447-454.
[24] MNP Ma'ady, CK Yang, RP Kusumawardani, H Suryotrisongko. Temporal Exploration in 2D
Visualization of Emotions on Twitter Stream. TELKOMNIKA (Telecommunication Computing
Electronics and Control). February 2018; 16(1): 376-384.
[25] R Kusumaningrum, S Adhy, MIA Wiedjayanto, Suryono. Classification of Indonesian News Articles
based on Latent Dirichlet Allocation. in Proceedings of the 3rd International Conference on Software
and Data Engineering, 2016.

More Related Content

What's hot

NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
Urjit Patel
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
ijseajournal
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
suthi
 

What's hot (20)

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Text summarization
Text summarizationText summarization
Text summarization
 
G04124041046
G04124041046G04124041046
G04124041046
 
similarity measure
similarity measure similarity measure
similarity measure
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 

Similar to WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation

Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL DevelopmentProposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Jorge Barreto
 
A Survey of Object Oriented Programming LanguagesMaya Hris.docx
A Survey of Object Oriented Programming LanguagesMaya Hris.docxA Survey of Object Oriented Programming LanguagesMaya Hris.docx
A Survey of Object Oriented Programming LanguagesMaya Hris.docx
daniahendric
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
dannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
IJwest
 

Similar to WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation (20)

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
20433-39028-3-PB.pdf
20433-39028-3-PB.pdf20433-39028-3-PB.pdf
20433-39028-3-PB.pdf
 
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONEXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
 
Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...
 
Exploiting rhetorical relations to
Exploiting rhetorical relations toExploiting rhetorical relations to
Exploiting rhetorical relations to
 
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL DevelopmentProposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
 
A Metamodel and Graphical Syntax for NS-2 Programing
A Metamodel and Graphical Syntax for NS-2 ProgramingA Metamodel and Graphical Syntax for NS-2 Programing
A Metamodel and Graphical Syntax for NS-2 Programing
 
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
A Survey of Object Oriented Programming LanguagesMaya Hris.docx
A Survey of Object Oriented Programming LanguagesMaya Hris.docxA Survey of Object Oriented Programming LanguagesMaya Hris.docx
A Survey of Object Oriented Programming LanguagesMaya Hris.docx
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 

More from TELKOMNIKA JOURNAL

Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...
TELKOMNIKA JOURNAL
 
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
TELKOMNIKA JOURNAL
 
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
TELKOMNIKA JOURNAL
 
Adaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imagingAdaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imaging
TELKOMNIKA JOURNAL
 
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
TELKOMNIKA JOURNAL
 

More from TELKOMNIKA JOURNAL (20)

Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...Amazon products reviews classification based on machine learning, deep learni...
Amazon products reviews classification based on machine learning, deep learni...
 
Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...Design, simulation, and analysis of microstrip patch antenna for wireless app...
Design, simulation, and analysis of microstrip patch antenna for wireless app...
 
Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...Design and simulation an optimal enhanced PI controller for congestion avoida...
Design and simulation an optimal enhanced PI controller for congestion avoida...
 
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...
 
Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...Conceptual model of internet banking adoption with perceived risk and trust f...
Conceptual model of internet banking adoption with perceived risk and trust f...
 
Efficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antennaEfficient combined fuzzy logic and LMS algorithm for smart antenna
Efficient combined fuzzy logic and LMS algorithm for smart antenna
 
Design and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fireDesign and implementation of a LoRa-based system for warning of forest fire
Design and implementation of a LoRa-based system for warning of forest fire
 
Wavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio networkWavelet-based sensing technique in cognitive radio network
Wavelet-based sensing technique in cognitive radio network
 
A novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bandsA novel compact dual-band bandstop filter with enhanced rejection bands
A novel compact dual-band bandstop filter with enhanced rejection bands
 
Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...Deep learning approach to DDoS attack with imbalanced data at the application...
Deep learning approach to DDoS attack with imbalanced data at the application...
 
Brief note on match and miss-match uncertainties
Brief note on match and miss-match uncertaintiesBrief note on match and miss-match uncertainties
Brief note on match and miss-match uncertainties
 
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
 
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G systemEvaluation of the weighted-overlap add model with massive MIMO in a 5G system
Evaluation of the weighted-overlap add model with massive MIMO in a 5G system
 
Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...Reflector antenna design in different frequencies using frequency selective s...
Reflector antenna design in different frequencies using frequency selective s...
 
Reagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensorReagentless iron detection in water based on unclad fiber optical sensor
Reagentless iron detection in water based on unclad fiber optical sensor
 
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...Impact of CuS counter electrode calcination temperature on quantum dot sensit...
Impact of CuS counter electrode calcination temperature on quantum dot sensit...
 
A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...A progressive learning for structural tolerance online sequential extreme lea...
A progressive learning for structural tolerance online sequential extreme lea...
 
Electroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networksElectroencephalography-based brain-computer interface using neural networks
Electroencephalography-based brain-computer interface using neural networks
 
Adaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imagingAdaptive segmentation algorithm based on level set model in medical imaging
Adaptive segmentation algorithm based on level set model in medical imaging
 
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 

WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation

  • 1. TELKOMNIKA, Vol.16, No.4, August 2018, pp. 1752~1759 ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018 DOI: 10.12928/TELKOMNIKA.v16i4.8194  1752 Received December 2, 2017; Revised January 24, 2018; Accepted July 3, 2018 WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation Retno Kusumaningrum*, Satriyo Adhy, Suryono Department of Informatics, Universitas Diponegoro, Jl. Prof. Soedarto, SH Tembalang, Semarang, Central Java, Indonesia *Corresponding author, e-mail: retno@live.undip.ac.id Abstract Latent Dirichlet Allocation (LDA) is a widely implemented approach for extracting hidden topics in documents generated by soft clustering of a word based on document co-occurrence as a multinomial probability distribution over terms. Therefore, several visualizations have been developed, such as matrices design, text-based design, tree design, parallel coordinates, and force-directed graphs. Furthermore, based on a set of documents representing a class (category), we can implement classification task by comparing topic proportion for each class and topic proportion for the testing document by using Kullback-Leibler Divergence (KLD). Therefore, the purpose of this study is to develop a system for visualizing the output of LDA as a classification task. The visualization system consists of two parts: bar chart and dependent word cloud. The first visualization aims to show the trend of each category, while the second visualization aims to show the words that represent each selected category in a word cloud. This visualization is subsequently called WCloudViz. It provides clear, understandable and preferably shared the result. Keywords: Latent dirichlet allocation, Topic modeling, News articles classification, Data visualization, Word cloud Copyright © 2018 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction Latent Dirichlet Allocation, also known as LDA, is one of topic modeling algorithms. In LDA, each document is represented as a random mixture over a set of latent topics and each of the topic is represented as a distribution over a vocabulary [1]. Until this moment, LDA has been widely implemented in various domains, such as text document [2]–[8], image domain [9]–[14], music domain [15], [16], and etc. LDA can be implemented for classification task as well as clustering task. As clustering task, the output of LDA is a set of topics in which each topic is represented by a set of words and its probability . This LDA’s output is a corpus-level output. In addition, LDA also results an output at document-level, i.e. topic proportion for each document . Furthermore, when we have a set of documents representing a class (category), we can compute topic proportion for each class . By implementing Kullback-Leibler Divergence, we can find the most similar distribution between topic proportion for each class and topic proportion for testing document based on the smallest value of KLD [12]. In accordance with clustering tasks, there are several visualizations that have been developed. The first design is matrices design in which rows correspond to terms and columns to topics. A developed matrices-based system is Termite [17]. The second design is text-based design [18]. This visualization exploits the use of text to illustrate topic as well as the use of text to explain the represented words for each topic, related documents for each topic, and related topics for each topic. Based on the selected document, we can also find the document’s related topics, document’s text, and document’s related documents. The third design is tree design. This design is a popular design for visualizing LDA’s output such as relationship-enriched visualization [19], Diachronic visualization [20], LDAExplore [21], and etc. LDAExplore also implements parallel coordinates to show topic proportion for each document. The first vertical coordinate is coordinate for displaying
  • 2. TELKOMNIKA ISSN: 1693-6930  WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum) 1753 document’s ID, while second vertical coordinate and so on is for representing values of topic proportion (distribution) for each topic. A line is drawn to connect the document's ID and related topic proportions for each remaining vertical line. The last design is force-directed graphs, such as relationship-enriched visualization [19] and Diachronic visualization [20]. In addition, there are some visualizations that have been developed for broader topic (information retrieval) such as map view, tree view, and bubble view which are implemented in [22], graph-view [23], heat map, interactive stream graph, and context focus which are implemented in [24]. However, those visualizations are focused on visualizing word-topic distribution produced by LDA. Therefore, the aim of this study is to develop a system for visualizing output of LDA as classification task. We implement it against the classification of Indonesian news articles. This is shown to reduce the drawbacks of Indonesian online news presentations, i.e. repetition of news contents and news splitting into some headlines. It is known as yellow journalism which gives the negative impact for readers, that are, a shift of citizen characters arising from excessive news report such as the outspread of "Alay" culture, the diversion of political issues, common corruption cases, etc. This classification and visualization system is expected to be used to know the news trends and the words that represent them for each category. The system is subsequently called as WCloudViz. In section 2, we provide the research method related to the design consideration including task analysis, data and backend, as well as prototype design. The explanation about result and discussion will be presented in section 4. Finally, in section 5, we conclude this study with what we perceive to be the future work. 2. Research Method As explained in previous section, WCloudViz is aimed to visualize the output of LDA as classification task. Therefore, there are three tasks, which form WCloudViz. The first task is identifying news trend for each class (category), and visualizing it as bar chart is the second task. The last task is word cloud generation of word representing for each selected category. The outline of WCloudViz formation is depicted in Figure 1. The following sub-sub sections explain each task in detail. Figure 1. The outline of WCloudViz formation
  • 3.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759 1754 2.1. News Trend Identification Since we aim to visualize the output of LDA as classification task, the process of news trend identification will be conducted based on LDA classification model. The detail about how to generate the LDA-based classification model for news articles classification can be read in Ref. [25]. Moreover, the procedure of news trend identification described in Figure 2. Figure 2. Procedure of news trend identification based on LDA classification model Tokenization is a process to eliminate characters that are likely to be insignificant for information extraction, such as white space and punctuation marks. Afterwards, the second process is stopword removal, implemented for eliminating words that are likely to be insignificant for information extraction. In this research, there are 758 words in Indonesian that are considered as stopwords. Table 1 shows example of Indonesian stopwords and its English translation for each word. Table 1. Applications in Each Class Part of Speech Stopwords in Indonesia English Translation Pronoun saya, kamu, mereka, kita, ini, itu I, you, they, we, this, that Adverb sekarang, agak currently, some Preposition setelah, sebelum, kepada, dalam, atas, bawah after, before, to, in, on, under Conjunction dan, karena and, because Stemming, the last text processing, is a process to derive each word from their root form. This research implements Sastrawi-stemmer library for stemming process. Sastrawi- stemmer is incrementally built. As a base, it implements Nazief Nazief-Adriani algorithm. Subsequently, it is improved by Confix Stripping (CS) algorithm and Enhanced Confix Stripping (ECS) algorithm. The final improvement is a modification of Modified ECS. The selection of Sastrawi-stemmer library for stemming process is based on several issues, namely it overcomes overstemming problem by using basic words dictionary, it overcomes the understemming problem by adding rules, as well as it is able to stem compound words correctly.
  • 4. TELKOMNIKA ISSN: 1693-6930  WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum) 1755 On the other hand, we already have the classification models including PWZ (word- topic distribution) values and PZ (topic distribution) values. Both values are in corpus level. In addition, we also have a set of words that represent each topic. Based on the model, we compute the PZD (topic proportion for each document) for each unseen news article by using the following formula.     N i i tkkd kzp 1 )( ,, *  (1) Where kd, is equal to PZD, the probability of k -th topic in d -th document,  kzp  is the probabilityof topic z is equal to k , and is the sum of i -th term in the corresponding document equal to term t and assigned as topic k , while N is the number of term in the corresponding document. Afterwards, we can compute the similarity distribution between pre-defined PZC (topic proportion for each class) and pre-computed PZD (resulted based on Equation 1) based on Kullback-Leibler Divergence (KLD). Let  ipP  be the topic proportion of the unseen document and  iqQ  be the topic proportion of each class; then the KLD can be computed by using the following formula. 2 DQPDPQ KLD   (2)         i i i i q p pDPQ log (3)         i i i i p q qDQP log (4) Finally, the most similar distribution between topic proportion for each class and topic proportion for unseen document is obtained based on the smallest value of KLD. 2.2. Bar Chart Visualization Bar chart is visualized based on the number of documents for each class (category) over the whole submitted new articles. Figure 3 shows the detailed procedure for generating bar chart based on output of the previous task. Figure 3. Procedure of bar chart visualization
  • 5.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759 1756 According to the Figure 3, the input of this process is a collection of news articles that have been classified from previous step into six categories (class) including economic, tourism, criminal, sport, and politic. The next process is computing a number of articles for each category. Subsequently, plot the bar chart with x-axis represents six news categories and y-axis represents number of articles for each category. 2.3. Word Cloud Visualization Based on the selected class (category), we can generate word cloud of represented words based on the following procedure as shown in Figure 4. We use the initial value of the biggest font size that is 150, and the number of words that are 20. Both values can be changed in accordance with the requirements. The first step of word cloud visualization is counting frequency of word occurrence in all documents of selected category. Get maximum frequency and get top-20 most occurred words. Subsequently, compute the font size of all obtained words based on the following formula. We use default value of MaxFont is 150 pt.        MaxFont cyMaxFrequen Frequency FontSize (5) As an exception and to provide the convenience of the reader, then the default value of the smallest FontSize is set at 14 pt despite FontSize calculation results based on the formula (5) gives the results of less than 14 pt. Figure 4. Procedure of word cloud visualization
  • 6. TELKOMNIKA ISSN: 1693-6930  WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum) 1757 3. Results and Analysis In this section, it is explained the results of research and its comprehensive discussion about interaction and visualization. For the demonstrator of WCloudViz, we employ the news articles data in Indonesian. These articles are obtained from local newspaper, i.e. Jawa Pos - Radar Semarang. Articles are saved in .txt format for each headline. 3.1. Prototype Design As explained in previous section, there are three tasks which form WCloudViz, consisting of news trend identification, bar chart visualization, and word cloud visualization. Two first tasks are represented in one interface as shown in Figure 5. The third task is represented in another interface, which is loaded after the user clicks one of the bars in the chart. The interface can be seen in Figure 6. Figure 5. Bar chart visualization with embedded news trend identification Figure 6. Word cloud visualization of selected category 3.2. Interaction and Visualization Users can gain insight into the news categories that have become the reporting focus through the bar chart visualization, which means that the news category with most report has the highest bar chart and vice-versa. In addition, users can also obtain information on the topic of each news category. Each topic is represented as a word cloud, i.e. word-occurrence in a set of documents. Word cloud will be displayed when the user has selected one of news categories. Hereafter, it can be referred as the dependent word cloud. For example, Figure 6 shows the economic category and it is represented by 10 words having high frequency, such as “harga” (price), “cabai” (chili), “Rp” (Indonesian currency), “turun” (decrease), “ribu” (thousand), “BBM” (fuel), “solar” (diesel), “nelayan” (fisherman), “dagang” (trade), and “emas” (gold). Furthermore, users can also draw conclusion based on the
  • 7.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 4, August 2018: 1752-1759 1758 word cloud, i.e. the current economic news headlines are widely reporting about the news related to the increasing chili prices up to hundreds of thousands rupiahs, the declining rupiahs, and the price of fuel, diesel fuel, as well as gold. In accordance with the previous explanations, there are some points that can be accomplished by word cloud. First, word cloud provides clarity in identifying trends and pattern that will otherwise be unclear or difficult to see in a tabular format. Second, word cloud is easier to understand since it is more visually appealing than the textual format of data. The bigger the words represent, the more widely-reported the headlines are. The last point, word cloud is more likely to be shared since it presents pictorial representation of the data. Unfortunately, the resulting word cloud has not been depicted in accordance with the related topic or theme, thus it is less impactful. 4. Conclusion WCloudViz is a visualization system based on LDA. It consists of two parts: bar chart and dependent word cloud. The first visualization aims to show the trend of each category, while the second visualization aims to show the words that represent each selected category in word cloud. This visualization provides clear, understandable, and preferably shared result. The strategic future work shall form each word cloud in accordance with related topic or theme, thus it can be more impactful and insightful. Acknowledgement The authors would like to acknowledge the research funding supported by Universitas Diponegoro under the grant of Research for International Scientific Publication (number 1052- 36/UN7.5.1/PG/2016, year 2016) References [1] DM Blei, AY Ng, MI Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003; 3: 993–1022. [2] DM Blei. Probabilistic Models of Text and Images. University of Carolina, Berkeley, 2004. [3] V Ravi. Legal Documents Clustering using Latent Dirichlet Allocation. International Journal of Applied Information Systems. 2012; 2(6): 34–37. [4] I Bíró, J Szabó, AA Benczúr. Latent dirichlet allocation in web spam filtering. Proc. 4th Int. Work. Advers. Inf. Retr. web, 2008: 29–32. [5] W Sriurai. Improving Text Categorization by using a Topic Model. Adv. Comput., 2011; 2(6): 21–27. [6] I Bíró. Document Classification with Latent Dirichlet Allocation. Unpubl. Dr. Diss. Eotvos Lorand …, 2009. [7] L Bolelli, Ş Ertekin, CL Giles. Topic and trend detection in text collections using latent dirichlet allocation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 2009; 5478 LNCS: 776–780. [8] XH Phan, LM Nguyen, S Horiguchi. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. Proceeding 17th Int. Conf. World Wide Web - WWW ’08, no. January 2008: 91–100. [9] K Barnard, P Duygulu, D Forsyth, N de Freitas, DM Blei, MI Jordan. Matching Words and Pictures. J. Mach. Learn. Res., 2003; 3: 1107–1135. [10] C Wang, DM Blei, L Fei-fei. Simultaneous image classification and annotation. IEEE Conference on Computer Vision and Pattern Recognition. 2009:1903–1910. [11] D Bratasanu, I Nedelcu, M Datcu. Bridging the Semantic Gap for Satellite Image Annotation and Automatic Mapping Applications. IEEE J. Sel. Top. Earth Obs. Remote Sens. 2011; 4(1): 193-120. [12] R Kusumaningrum, H Wei, R Manurung, A Murni. Integrated visual vocabulary in latent Dirichlet allocation–based scene classification for IKONOS image. J. Appl. Remote Sens. Jan. 2014; 8(1): 83690-1-18. [13] R Kusumaningrum, HM Manurung, AM Arymurthy. CIELab Color Moments: Alternative Descriptors for LANDSAT Images Classification System. INKOM. 2014; 8(2): 109–114. [14] M Liénou, H Maître, M Datcu. Using Latent Dirichlet Allocation. IEEE Geosci. Remote Sens. Lett. 2010; 7(1): 28–32. [15] D Hu. Latent Dirichlet Allocation, Text, Images, & Music.” [16] R Cai, C Zhang, C Wang, L Zhang, W Ma. MusicSense: Contextual Music Recommendation using Emotional Allocation Modeling. Emotion. 2007: 553–556.
  • 8. TELKOMNIKA ISSN: 1693-6930  WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles... (Retno Kusumaningrum) 1759 [17] J Chuang, CD Manning, J Heer. Termite: Visualization Techniques for Assessing Textual Topic Models. in Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI), 2012: 74–77. [18] AJB Chaney, DM Blei. Visualizing Topic Models. in Proceedings of the Sixth International Conference on Weblogs and Social Media. 2003. [19] A Smith, J Chuang, Y Hu, J Boyd-graber, L Findlater. Concurrent Visualization of Relationships between Words and Topics in Topic Models. in Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 2014: 79–82. [20] J Lamirel, N Dugu, P Cuxac. Performing and Visualizing Temporal Analysis of Large Text Data Issued for Open Sources: Past and Future. in Proceedings of the 12th International Conference on Beyond Databases, Architecture and Structures. 2016: 1340846. [21] A Ganesan, K Brantley, S Pan J Chen. LDAExplore: Visualizing Topic Models Generated Using Latent Dirichlet Allocation. CoRR, 2015. [22] MH Mamoon, HM El-Bakry, AA Salama. Visualization for Information Retrieval based on Fast Search Technology. Indonesian Journal of Electrical Engineering and Informatics (IJEEI). December 2013; 1(4): 140-156. [23] F Nadirman, A Ridha, Annisa. Searching and Visualization of References in Research Documents. TELKOMNIKA (Telecommunication Computing Electronics and Control). June 2014; 12(2): 447-454. [24] MNP Ma'ady, CK Yang, RP Kusumawardani, H Suryotrisongko. Temporal Exploration in 2D Visualization of Emotions on Twitter Stream. TELKOMNIKA (Telecommunication Computing Electronics and Control). February 2018; 16(1): 376-384. [25] R Kusumaningrum, S Adhy, MIA Wiedjayanto, Suryono. Classification of Indonesian News Articles based on Latent Dirichlet Allocation. in Proceedings of the 3rd International Conference on Software and Data Engineering, 2016.