This document presents a study on improving news recommendation in Nepal using classification of Nepali news articles with LSTM recurrent neural networks. The study collected over 72,000 news articles from 5 popular Nepali news websites covering 8 categories. The articles were preprocessed and features were extracted using TF-IDF and word2vec. An LSTM model was developed and tested on each website individually, achieving accuracies ranging from 77-89%. The overall LSTM model achieved 84.63% accuracy, outperforming an SVM baseline. Sports articles had the highest classification accuracy. Increasing training data and using hybrid models were proposed for future work.
Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper
1. Improving Nepali News
Recommendation using
Classification based on LSTM
Recurrent Neural Networks
Ashok Basnet
Pokhara University
Arun K. Timalsina
Institute of Engineering
2. Topics covered
1. Background
2. Problem Statement
3. Literature Review
4. Research Objectives
5. Research Methodology
6. Experiment and Analysis
7. Conclusion
8. Future Works
9. References
Conf Record # 76 2
3. Background
● Online news content of Nepal
● More than 1000 news portals producing news in
than 20 categories
● Production of more than 100K news content from
popular news portals such as onlinekhabar.com,
ratopati.com, dcnepal.com etc.
● Deep Learning mechanism producing excellent
results in the classification of documents
Conf Record # 76 3
4. Problem Statement
● More than 500 news being produced in
daily basis from popular news sites, so
users cannot go through all the articles
and might miss their interested category
news
● Many news portals are using manual way
of recommending content to user only
based on that particular news they are
reading
Conf Record # 76 4
5. Literature Review
● Nihar et al. [1] demonstrated outstanding
performance of Long Short Term Memory in
document classification in comparison to other
machine learning algorithms. LSTM has achieved an
accuracy of up to 93% in document classification.
● Kaushal Kafle et al. [2] successfully tested the use of
neural networks in the word2vec for word
embeddings to improve the vector representation of
the text. The use of word2vec model outperforms the
TF-IDF method by 1.6 percent. The classification is
carried out with Support Vector Machine.
Conf Record # 76 5
6. Research Objectives
1. To improve the classification of Nepali
news documents using LSTM recurrent
neural network.
2. To compare different news portals and
their classification accuracies
Conf Record # 76 6
7. Research Methodology
1. Data Gathering
2. Data Preprocessing
3. Feature Extraction
4. Model Building
5. Model Validation
Conf Record # 76 7
9. Technology Used
● Python
● Scrapy - for news scrap
● word2vec gensim
● Keras Library for python
● Scikit Learn Library for python
Conf Record # 76 9
10. System Configuration
● Model Name: MacBook Pro
● Processor Name: Intel Core i7
● Processor Speed: 2.8 GHz
● Number of Processors: 1
● Total Number of Cores: 2
● L2 Cache (per Core): 256 KB
● L3 Cache: 4 MB
● Memory: 8 GB
Conf Record # 76 10
11. Data Gathering
S.No. Site Name Site URL No. of articles
1 DC Nepal dcnepal.com 12,872
2 Image Khabar imagekhabar.com 19,296
3 Online Khabar onlinekhabar.com 20,504
4 Ratopati ratopati.com 7,855
5 Ujyaalo Online ujyaaloonline.com 11,476
TOTAL 72,003
*The data were collected from 2015 till April 2018
Conf Record # 76 11
12. Data Gathering - Categories
S.No. Category Name No. of Articles
1 Diaspora 6,224
2 Economy 14,067
3 Entertainment 7,588
4 Health 3,122
5 International 9,879
6 Opinion 2,675
7 Politics 8,352
8 Sports 13,018
TOTAL 64,925
Conf Record # 76 12
13. Data Preprocessing
1. HTML Tags remove from the news
document
2. White space and special symbol removal
3. Stop Word Removal
Conf Record # 76 13
14. Feature Extraction
Using TFIDF Vectorizer
TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the
document).
IDF(t) = log_e (Total number of documents /
Number of documents with term t in it).
Using Word2Vec Vectorizer
Uses word embedding to represent sentences
using neural network.
Conf Record # 76 14
16. Model Development
● Recurrent Neural Network
● Word2Vec model with 100, 200 and 300
features
● LSTM model
● 50 hidden units of first layer for LSTM
● 25 hidden units in 2nd layer for LSTM
● 50 hidden units for Dense Layer
● 5 / 10 epochs
● Dropout of 0.2
● Softmax activation at output
● Categorical cross entropy as loss function
Conf Record # 76 16
33. Conclusion
1. LSTM with 300 word2vec features produced a good
improvement over existing SVM models with 84.63%
accuracy and 89% precision
2. Onlinekhabar.com was found to have highest accuracy and
precision among five different websites compared in the
experiments with 89.26% accuracy and 91% precision
3. Sports category was the one with highest accuracy in
classification among 8 different categories with 93%
precision
Conf Record # 76 33
34. Future Works
1. More websites into consideration and
increasing the categories of the data will help
to classify more news accurately
2. Use effective stemming method to reduce
the error rate in classification
3. Use of hybrid approach with technology like
CNN
Conf Record # 76 34
35. References
• N. M. Ranjan, Y. R .Ghorpade, G. R. Kanthale, A. R. Ghorpade and A. S. Dubey,
“Document Classification using LSTM Neural Network”, Journal of Data Mining and
Management, Volume 2 Issue 2, 2017
• K. Kafle, D. Sharma, A. Subedi and A. K. Timalsina , “Improving Nepali Document
Classification by Neural Network”, IOE Graduate Conference, 2016, pp. 317–322m
• S. Kaur and N. K. Khiva, “Online news classification using Deep Learning Technique”,
International Research Journal of Engineering and Technology (IRJET), Volume: 03 Issue:
10, 2016
• C. Zhou, C. Sun, Z. Liu, and F. Lau., “A C-LSTM Neural Network for Text Classification”,
CoRR abs/1511.08630, 2015
• F. Al. Zaghoul and S. Al. Dhaheri “Arabic Text Classification Based on Features Reduction
Using Artificial Neural Networks”, UKSim 15th International Conference on Computer
Modelling and Simulation, 2013
• C. Chan, A. Sun and E. Lim, “Automated Online News Classification with Personalization”,
Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001),
Pages 320-329, Bangalore, India, December, 2001
Conf Record # 76 35
Good Afternoon Everyone, My name is Ashok Basnet and I am going to present the research paper on Improving Nepali News Recommendation using Classification based on LSTM Recurrent Neural Networks. This paper is co-authored by Arun Kumar Timalsina from Institute of Engineering.
I will be covering the following topics in my presentation
There has been rapid growth in the news creation and consumption in Nepal in past few years.
More than 1000 news portals available in Nepal which spreads news across more than 20 categories.
They had produced more than 1 lakh news content since their establishment.
The recent achievement in the natural language processing and deep Learning mechanisms are producing excellent results in terms of document classification
So, due to the rapid growth of news content, there is large volume of data available in the internet
User are finding it difficult to get the news based on the category they are interested.
The news website are using manual way of recommending the news, which can be automated with the machine learning based categorization
Mr. Nihar used the recurrent neural networks, that has been deployed in english news classification with accuracies more than 93%
There has been significant work in Support Vector machine with the use wordtovec by Kaushal Kafle that was published recently in IOE Graduate conference
Other than this, there has been work on the sentiment review analysis on Nepali Movie reviews using different Machine Learning Algorithms
The objective of my research was to improve the classification of nepali news documents using the deep learning technology, Recurrent Neural Network
And to compare five popular news sites of Nepal for their classification accuracies
The research will be carried out in the following methods,
The data was collected from various sources and cleaned for any HTML tags and extra spaces. The data was selected and stop words were removed from the sentences.
The feature extraction was done to get vectors from the words
The train test split was done as 70% - 30%, where training set was again divided into validation sets.
The input was fed to LSTM network and the model is evaluated across different measures
Finally, the model is evaluated with test sets and the prediction of the category is done.
The whole experiments was carried out with Python programming language.
I have used the frameworks as scrapy for news scraping purpose
Word2vec for embeddings of the nepali words
Keras library for implementing LSTM model
Scikit Learn for implementing TFIDF, SVM, Classification metrices
And other libraries
I have tested the whole experiments in the system configuration as shown. The processor is i7 and 8GB RAM
I started the collection of data from 5 different websites which are popular in Nepal. In total, I collected around 72,000 articles from these websites
The data were collected from 2015 to April 2018.
The news were categorized into 8 different categories and data were cleaned to get around 65,000 news articles. There were fair share in all the categories
The data preprocessing step was used on the data set collected. The HTML tags were removed and white space and special symbol were also cleaned.
The stop word removal was done from standard data from NLTK ( Natural language toolkit )
The next process was the feature extraction. As I have used both TFIDF and word2vector for the vectorization of the words.
The TFIDF formula is as given above and word2vec was used from gensim library.
Principal Component Analysis was used for the visualization of the dataset. The plot shows the distribution of the data across different categories as separated by color coding.
So, the model I used was the LSTM recurrent neural network model which will be used to classify the nepali documents into appropriate categories. For this, I have used word2vec model to convert words into embedding vectors that will be fed to LSTM model. In midterm, I have experimented with 50 hidden units of LSTM, 5 epoch and dropout of 0.2
Let’s go through the experiment I have conducted and their results.
In this section, I will be presenting you with model experiment with LSTM for 5 websites and their performance measures
The overall data was also be evaluated, and the comparison based on the category was done.
I ran the same dataset in SVM and compared the results with the LSTM
Now lets proceed to the results section
Here, the dcnepal.com website was run on the LSTM model that we built and got the following output.
It classified sports with precision of 91%, with avg. 85% precision
Similarly, the model was run through imagekhabar.com data and measures as shown in the table.
And this for onlinekhabar, it predicted economy news pretty well and sports and diaspora was also classified accurately
Similarly for ratopati.com, health category was found to have highest precision.
And ujyaaloonline.com, got good precision for sports category
I merged all the data from these website and model was evaluated. Here I got a good precision for the model around 89%. Sports and international news was found to be categorized with over 90% precision. The least was for the health category.
This is the plot between model accuracy vs epoch and model loss from epoch, which shows, pretty good result in learning of the model with training and validation sets
The five websites were compared across various measures such as accuracy, precision, recall and f1-score.
I found onlinekhabar with highest accuracy and precision, ratopati was least among them
Here is the comparison graph for the same result
The 8 categories were also compared across precision, recall and f1-score.
It shows that sports category was more accurately predicted among all.
The LSTM model was further analyzed for different hyper parameters such as no. of epochs and no. of features.
Here we can see that, as the no. of features increases, the accuracy and other metrics are found to have increased, and similar was the case for the no. of epochs.
The best result was found for the 300 features and 10 epochs of the LSTM model.
This shows the graphical format for 5 epochs.
For 10 epochs
Here I present the model with different metricsI ran similar experiments with SVM model in order to compare with the LSTM model that i have implemented. The details are in the report
LSTM seems to do well in large data set, as expected.
Here is the comparison graph
The collection of data was limited to eight categories and five websites, due to which the data needed for the deep learning module was limited to below 1 lakh. It can be improved by taking more websites into consideration and increasing the categories of the data will help to classify more news. The thesis uses rule-based stemming in the basic form, due to the unavailability of standard stemming library for Nepali language. The effective use of stemming can help to improve the overall accuracy of the classification. The thesis is limited to LSTM, where other technology like CNN can be used along with LSTM to achieve higher degree of accuracy and precision in the news categorization for Nepali news.