Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

Improving Nepali News
Recommendation using
Classification based on LSTM
Recurrent Neural Networks
Ashok Basnet
Pokhara University
Arun K. Timalsina
Institute of Engineering

Topics covered
1. Background
2. Problem Statement
3. Literature Review
4. Research Objectives
5. Research Methodology
6. Experiment and Analysis
7. Conclusion
8. Future Works
9. References
Conf Record # 76 2

Background
● Online news content of Nepal
● More than 1000 news portals producing news in
than 20 categories
● Production of more than 100K news content from
popular news portals such as onlinekhabar.com,
ratopati.com, dcnepal.com etc.
● Deep Learning mechanism producing excellent
results in the classification of documents
Conf Record # 76 3

Problem Statement
● More than 500 news being produced in
daily basis from popular news sites, so
users cannot go through all the articles
and might miss their interested category
news
● Many news portals are using manual way
of recommending content to user only
based on that particular news they are
reading
Conf Record # 76 4

Literature Review
● Nihar et al. [1] demonstrated outstanding
performance of Long Short Term Memory in
document classification in comparison to other
machine learning algorithms. LSTM has achieved an
accuracy of up to 93% in document classification.
● Kaushal Kafle et al. [2] successfully tested the use of
neural networks in the word2vec for word
embeddings to improve the vector representation of
the text. The use of word2vec model outperforms the
TF-IDF method by 1.6 percent. The classification is
carried out with Support Vector Machine.
Conf Record # 76 5

Research Objectives
1. To improve the classification of Nepali
news documents using LSTM recurrent
neural network.
2. To compare different news portals and
their classification accuracies
Conf Record # 76 6

Research Methodology
1. Data Gathering
2. Data Preprocessing
3. Feature Extraction
4. Model Building
5. Model Validation
Conf Record # 76 7

Research Methodology Flow Diagram
Conf Record # 76 8

Technology Used
● Python
● Scrapy - for news scrap
● word2vec gensim
● Keras Library for python
● Scikit Learn Library for python
Conf Record # 76 9

System Configuration
● Model Name: MacBook Pro
● Processor Name: Intel Core i7
● Processor Speed: 2.8 GHz
● Number of Processors: 1
● Total Number of Cores: 2
● L2 Cache (per Core): 256 KB
● L3 Cache: 4 MB
● Memory: 8 GB
Conf Record # 76 10

Data Gathering
S.No. Site Name Site URL No. of articles
1 DC Nepal dcnepal.com 12,872
2 Image Khabar imagekhabar.com 19,296
3 Online Khabar onlinekhabar.com 20,504
4 Ratopati ratopati.com 7,855
5 Ujyaalo Online ujyaaloonline.com 11,476
TOTAL 72,003
*The data were collected from 2015 till April 2018
Conf Record # 76 11

Data Gathering - Categories
S.No. Category Name No. of Articles
1 Diaspora 6,224
2 Economy 14,067
3 Entertainment 7,588
4 Health 3,122
5 International 9,879
6 Opinion 2,675
7 Politics 8,352
8 Sports 13,018
TOTAL 64,925
Conf Record # 76 12

Data Preprocessing
1. HTML Tags remove from the news
document
2. White space and special symbol removal
3. Stop Word Removal
Conf Record # 76 13

Feature Extraction
Using TFIDF Vectorizer
TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the
document).
IDF(t) = log_e (Total number of documents /
Number of documents with term t in it).
Using Word2Vec Vectorizer
Uses word embedding to represent sentences
using neural network.
Conf Record # 76 14

Data Visualization
Conf Record # 76 15

Model Development
● Recurrent Neural Network
● Word2Vec model with 100, 200 and 300
features
● LSTM model
● 50 hidden units of first layer for LSTM
● 25 hidden units in 2nd layer for LSTM
● 50 hidden units for Dense Layer
● 5 / 10 epochs
● Dropout of 0.2
● Softmax activation at output
● Categorical cross entropy as loss function
Conf Record # 76 16

Model Experiment for dcnepal.com
using LSTM
category precision recall f1-score support
diaspora 0.86 0.63 0.72 383
entertainment 0.80 0.78 0.79 564
health 0.86 0.67 0.75 178
international 0.88 0.78 0.83 888
society 0.84 0.93 0.88 1468
sports 0.91 0.74 0.82 379
avg / total 0.85 0.81 0.83 3860
Conf Record # 76 18

Model Experiment for
imagekhabar.com using LSTM
economy 0.84 0.91 0.87 1565
health 0.84 0.72 0.78 365
international 0.94 0.93 0.94 1105
opinion 0.86 0.77 0.81 82
politics 0.92 0.89 0.90 1738
sports 0.98 0.87 0.92 932
avg / total 0.90 0.89 0.90 5787
Conf Record # 76 19

onlinekhabar.com using LSTM
diaspora 0.92 0.86 0.89 1398
economy 0.92 0.95 0.93 822
health 0.74 0.77 0.76 169
opinion 0.91 0.89 0.90 648
sports 0.95 0.96 0.95 1678
technology 0.88 0.81 0.84 604
avg / total 0.91 0.90 0.90 6150
Conf Record # 76 20

Model Experiment for ratopati.com
using LSTM
diaspora 0.66 0.62 0.64 158
economy 0.89 0.87 0.88 1059
health 0.96 0.73 0.83 230
opinion 0.84 0.65 0.73 91
sports 0.70 0.74 0.72 310
avg / total 0.83 0.78 0.80 2355
Conf Record # 76 21

ujyaaloonline.com using LSTM
economy 0.84 0.82 0.83 665
politics 0.84 0.85 0.85 727
sports 0.91 0.93 0.92 639
avg / total 0.89 0.85 0.87 3441
Conf Record # 76 22

Model Experiment for overall websites
using LSTM
diaspora 0.89 0.72 0.79 1896
economy 0.86 0.86 0.86 4194
entertainment 0.84 0.79 0.81 2208
health 0.82 0.75 0.78 909
international 0.92 0.85 0.88 3019
opinion 0.84 0.90 0.87 815
politics 0.88 0.85 0.86 2501
sports 0.96 0.93 0.95 3936
avg / total 0.89 0.85 0.87 19478
Conf Record # 76 23

Model accuracy and loss graphs
Fig: Accuracy vs Epoch Fig: Loss vs Epoch
Conf Record # 76 24

Comparison of 5 websites using LSTM
model
Website Accuracy Precision Recall f1-score
Dcnepal.com 81.45% 85% 81% 83%
Imagekhabar.com 88.83% 90% 89% 90%
Onlinekhabar.com 89.26% 91% 90% 90%
Ratopati.com 77.87% 83% 78% 80%
Ujyaaloonline.com 85.29% 89% 85% 87%
Conf Record # 76 25

Comparison chart of 5 websites using
LSTM model
Conf Record # 76 26

Comparison of category classification
using LSTM
Conf Record # 76 27

Comparison of LSTM model with
different hyper-parameters
No. of features No. of epochs Accuracy Precision Recall f1-score
100 5 70.43% 85% 70% 77%
100 10 73.22% 84% 73% 78%
200 5 79.19% 87% 79% 83%
200 10 80.95% 87% 81% 84%
300 5 83.05% 89% 83% 86%
300 10 84.63% 89% 85% 87%
Conf Record # 76 28

Analysis of LSTM metrics with number
of features using 5 epochs
Conf Record # 76 29

Analysis of LSTM metrics with number
of features using 10 epochs
Conf Record # 76 30

Comparison of LSTM with SVM
Model Accuracy Precision Recall f1-score
SVM 81.41% 85% 81% 80%
LSTM 84.63% 89% 85% 87%
Conf Record # 76 31

Comparison Graph of LSTM with SVM
Conf Record # 76 32

Conclusion
1. LSTM with 300 word2vec features produced a good
improvement over existing SVM models with 84.63%
accuracy and 89% precision
2. Onlinekhabar.com was found to have highest accuracy and
precision among five different websites compared in the
experiments with 89.26% accuracy and 91% precision
3. Sports category was the one with highest accuracy in
classification among 8 different categories with 93%
precision
Conf Record # 76 33

Future Works
1. More websites into consideration and
increasing the categories of the data will help
to classify more news accurately
2. Use effective stemming method to reduce
the error rate in classification
3. Use of hybrid approach with technology like
CNN
Conf Record # 76 34

References
• N. M. Ranjan, Y. R .Ghorpade, G. R. Kanthale, A. R. Ghorpade and A. S. Dubey,
“Document Classification using LSTM Neural Network”, Journal of Data Mining and
Management, Volume 2 Issue 2, 2017
• K. Kafle, D. Sharma, A. Subedi and A. K. Timalsina , “Improving Nepali Document
Classification by Neural Network”, IOE Graduate Conference, 2016, pp. 317–322m
• S. Kaur and N. K. Khiva, “Online news classification using Deep Learning Technique”,
International Research Journal of Engineering and Technology (IRJET), Volume: 03 Issue:
10, 2016
• C. Zhou, C. Sun, Z. Liu, and F. Lau., “A C-LSTM Neural Network for Text Classification”,
CoRR abs/1511.08630, 2015
• F. Al. Zaghoul and S. Al. Dhaheri “Arabic Text Classification Based on Features Reduction
Using Artificial Neural Networks”, UKSim 15th International Conference on Computer
Modelling and Simulation, 2013
• C. Chan, A. Sun and E. Lim, “Automated Online News Classification with Personalization”,
Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001),
Pages 320-329, Bangalore, India, December, 2001
Conf Record # 76 35

Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

Recommended

Recommended

More Related Content

Similar to Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

Similar to Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper (20)

Recently uploaded

Recently uploaded (20)

Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

Editor's Notes