SlideShare a Scribd company logo
Improving Nepali News
Recommendation using
Classification based on LSTM
Recurrent Neural Networks
Ashok Basnet
Pokhara University
Arun K. Timalsina
Institute of Engineering
Topics covered
1. Background
2. Problem Statement
3. Literature Review
4. Research Objectives
5. Research Methodology
6. Experiment and Analysis
7. Conclusion
8. Future Works
9. References
Conf Record # 76 2
Background
● Online news content of Nepal
● More than 1000 news portals producing news in
than 20 categories
● Production of more than 100K news content from
popular news portals such as onlinekhabar.com,
ratopati.com, dcnepal.com etc.
● Deep Learning mechanism producing excellent
results in the classification of documents
Conf Record # 76 3
Problem Statement
● More than 500 news being produced in
daily basis from popular news sites, so
users cannot go through all the articles
and might miss their interested category
news
● Many news portals are using manual way
of recommending content to user only
based on that particular news they are
reading
Conf Record # 76 4
Literature Review
● Nihar et al. [1] demonstrated outstanding
performance of Long Short Term Memory in
document classification in comparison to other
machine learning algorithms. LSTM has achieved an
accuracy of up to 93% in document classification.
● Kaushal Kafle et al. [2] successfully tested the use of
neural networks in the word2vec for word
embeddings to improve the vector representation of
the text. The use of word2vec model outperforms the
TF-IDF method by 1.6 percent. The classification is
carried out with Support Vector Machine.
Conf Record # 76 5
Research Objectives
1. To improve the classification of Nepali
news documents using LSTM recurrent
neural network.
2. To compare different news portals and
their classification accuracies
Conf Record # 76 6
Research Methodology
1. Data Gathering
2. Data Preprocessing
3. Feature Extraction
4. Model Building
5. Model Validation
Conf Record # 76 7
Research Methodology Flow Diagram
Conf Record # 76 8
Technology Used
● Python
● Scrapy - for news scrap
● word2vec gensim
● Keras Library for python
● Scikit Learn Library for python
Conf Record # 76 9
System Configuration
● Model Name: MacBook Pro
● Processor Name: Intel Core i7
● Processor Speed: 2.8 GHz
● Number of Processors: 1
● Total Number of Cores: 2
● L2 Cache (per Core): 256 KB
● L3 Cache: 4 MB
● Memory: 8 GB
Conf Record # 76 10
Data Gathering
S.No. Site Name Site URL No. of articles
1 DC Nepal dcnepal.com 12,872
2 Image Khabar imagekhabar.com 19,296
3 Online Khabar onlinekhabar.com 20,504
4 Ratopati ratopati.com 7,855
5 Ujyaalo Online ujyaaloonline.com 11,476
TOTAL 72,003
*The data were collected from 2015 till April 2018
Conf Record # 76 11
Data Gathering - Categories
S.No. Category Name No. of Articles
1 Diaspora 6,224
2 Economy 14,067
3 Entertainment 7,588
4 Health 3,122
5 International 9,879
6 Opinion 2,675
7 Politics 8,352
8 Sports 13,018
TOTAL 64,925
Conf Record # 76 12
Data Preprocessing
1. HTML Tags remove from the news
document
2. White space and special symbol removal
3. Stop Word Removal
Conf Record # 76 13
Feature Extraction
Using TFIDF Vectorizer
TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the
document).
IDF(t) = log_e (Total number of documents /
Number of documents with term t in it).
Using Word2Vec Vectorizer
Uses word embedding to represent sentences
using neural network.
Conf Record # 76 14
Data Visualization
Conf Record # 76 15
Model Development
● Recurrent Neural Network
● Word2Vec model with 100, 200 and 300
features
● LSTM model
● 50 hidden units of first layer for LSTM
● 25 hidden units in 2nd layer for LSTM
● 50 hidden units for Dense Layer
● 5 / 10 epochs
● Dropout of 0.2
● Softmax activation at output
● Categorical cross entropy as loss function
Conf Record # 76 16
Experiments
and
Results
Model Experiment for dcnepal.com
using LSTM
category precision recall f1-score support
diaspora 0.86 0.63 0.72 383
entertainment 0.80 0.78 0.79 564
health 0.86 0.67 0.75 178
international 0.88 0.78 0.83 888
society 0.84 0.93 0.88 1468
sports 0.91 0.74 0.82 379
avg / total 0.85 0.81 0.83 3860
Conf Record # 76 18
Model Experiment for
imagekhabar.com using LSTM
category precision recall f1-score support
economy 0.84 0.91 0.87 1565
health 0.84 0.72 0.78 365
international 0.94 0.93 0.94 1105
opinion 0.86 0.77 0.81 82
politics 0.92 0.89 0.90 1738
sports 0.98 0.87 0.92 932
avg / total 0.90 0.89 0.90 5787
Conf Record # 76 19
Model Experiment for
onlinekhabar.com using LSTM
category precision recall f1-score support
diaspora 0.92 0.86 0.89 1398
economy 0.92 0.95 0.93 822
entertainment 0.88 0.87 0.88 831
health 0.74 0.77 0.76 169
opinion 0.91 0.89 0.90 648
sports 0.95 0.96 0.95 1678
technology 0.88 0.81 0.84 604
avg / total 0.91 0.90 0.90 6150
Conf Record # 76 20
Model Experiment for ratopati.com
using LSTM
category precision recall f1-score support
diaspora 0.66 0.62 0.64 158
economy 0.89 0.87 0.88 1059
entertainment 0.72 0.58 0.64 131
health 0.96 0.73 0.83 230
international 0.80 0.76 0.78 376
opinion 0.84 0.65 0.73 91
sports 0.70 0.74 0.72 310
avg / total 0.83 0.78 0.80 2355
Conf Record # 76 21
Model Experiment for
ujyaaloonline.com using LSTM
category precision recall f1-score support
economy 0.84 0.82 0.83 665
entertainment 0.89 0.85 0.87 751
international 0.95 0.82 0.88 659
politics 0.84 0.85 0.85 727
sports 0.91 0.93 0.92 639
avg / total 0.89 0.85 0.87 3441
Conf Record # 76 22
Model Experiment for overall websites
using LSTM
category precision recall f1-score support
diaspora 0.89 0.72 0.79 1896
economy 0.86 0.86 0.86 4194
entertainment 0.84 0.79 0.81 2208
health 0.82 0.75 0.78 909
international 0.92 0.85 0.88 3019
opinion 0.84 0.90 0.87 815
politics 0.88 0.85 0.86 2501
sports 0.96 0.93 0.95 3936
avg / total 0.89 0.85 0.87 19478
Conf Record # 76 23
Model accuracy and loss graphs
Fig: Accuracy vs Epoch Fig: Loss vs Epoch
Conf Record # 76 24
Comparison of 5 websites using LSTM
model
Website Accuracy Precision Recall f1-score
Dcnepal.com 81.45% 85% 81% 83%
Imagekhabar.com 88.83% 90% 89% 90%
Onlinekhabar.com 89.26% 91% 90% 90%
Ratopati.com 77.87% 83% 78% 80%
Ujyaaloonline.com 85.29% 89% 85% 87%
Conf Record # 76 25
Comparison chart of 5 websites using
LSTM model
Conf Record # 76 26
Comparison of category classification
using LSTM
Conf Record # 76 27
Comparison of LSTM model with
different hyper-parameters
No. of features No. of epochs Accuracy Precision Recall f1-score
100 5 70.43% 85% 70% 77%
100 10 73.22% 84% 73% 78%
200 5 79.19% 87% 79% 83%
200 10 80.95% 87% 81% 84%
300 5 83.05% 89% 83% 86%
300 10 84.63% 89% 85% 87%
Conf Record # 76 28
Analysis of LSTM metrics with number
of features using 5 epochs
Conf Record # 76 29
Analysis of LSTM metrics with number
of features using 10 epochs
Conf Record # 76 30
Comparison of LSTM with SVM
Model Accuracy Precision Recall f1-score
SVM 81.41% 85% 81% 80%
LSTM 84.63% 89% 85% 87%
Conf Record # 76 31
Comparison Graph of LSTM with SVM
Conf Record # 76 32
Conclusion
1. LSTM with 300 word2vec features produced a good
improvement over existing SVM models with 84.63%
accuracy and 89% precision
2. Onlinekhabar.com was found to have highest accuracy and
precision among five different websites compared in the
experiments with 89.26% accuracy and 91% precision
3. Sports category was the one with highest accuracy in
classification among 8 different categories with 93%
precision
Conf Record # 76 33
Future Works
1. More websites into consideration and
increasing the categories of the data will help
to classify more news accurately
2. Use effective stemming method to reduce
the error rate in classification
3. Use of hybrid approach with technology like
CNN
Conf Record # 76 34
References
• N. M. Ranjan, Y. R .Ghorpade, G. R. Kanthale, A. R. Ghorpade and A. S. Dubey,
“Document Classification using LSTM Neural Network”, Journal of Data Mining and
Management, Volume 2 Issue 2, 2017
• K. Kafle, D. Sharma, A. Subedi and A. K. Timalsina , “Improving Nepali Document
Classification by Neural Network”, IOE Graduate Conference, 2016, pp. 317–322m
• S. Kaur and N. K. Khiva, “Online news classification using Deep Learning Technique”,
International Research Journal of Engineering and Technology (IRJET), Volume: 03 Issue:
10, 2016
• C. Zhou, C. Sun, Z. Liu, and F. Lau., “A C-LSTM Neural Network for Text Classification”,
CoRR abs/1511.08630, 2015
• F. Al. Zaghoul and S. Al. Dhaheri “Arabic Text Classification Based on Features Reduction
Using Artificial Neural Networks”, UKSim 15th International Conference on Computer
Modelling and Simulation, 2013
• C. Chan, A. Sun and E. Lim, “Automated Online News Classification with Personalization”,
Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001),
Pages 320-329, Bangalore, India, December, 2001
Conf Record # 76 35
THANK YOU

More Related Content

Similar to Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
Khadija Atiya
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Lippo Group Digital
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
eaiti
 
Essay Outline There are several vital elements to any successful.docx
Essay Outline There are several vital elements to any successful.docxEssay Outline There are several vital elements to any successful.docx
Essay Outline There are several vital elements to any successful.docx
SALU18
 
ptg18221866ptg18221866The Practice of System and.docx
ptg18221866ptg18221866The Practice of System and.docxptg18221866ptg18221866The Practice of System and.docx
ptg18221866ptg18221866The Practice of System and.docx
potmanandrea
 

Similar to Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper (20)

Measuring Digital Signage Networks - Quividi
Measuring Digital Signage Networks - QuividiMeasuring Digital Signage Networks - Quividi
Measuring Digital Signage Networks - Quividi
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
An intelligent framework using hybrid social media and market data, for stock...
An intelligent framework using hybrid social media and market data, for stock...An intelligent framework using hybrid social media and market data, for stock...
An intelligent framework using hybrid social media and market data, for stock...
 
IRJET- Agricultural Productivity System
IRJET- Agricultural Productivity SystemIRJET- Agricultural Productivity System
IRJET- Agricultural Productivity System
 
How to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart LogisticsHow to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart Logistics
 
RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
 
2 open power engagements in china
2 open power engagements in china2 open power engagements in china
2 open power engagements in china
 
Football League Management System Final Year Report
Football League Management System Final Year ReportFootball League Management System Final Year Report
Football League Management System Final Year Report
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
 
Essay Outline There are several vital elements to any successful.docx
Essay Outline There are several vital elements to any successful.docxEssay Outline There are several vital elements to any successful.docx
Essay Outline There are several vital elements to any successful.docx
 
AI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen GalsworthyAI and Machine Learning for the Connected Home with Stephen Galsworthy
AI and Machine Learning for the Connected Home with Stephen Galsworthy
 
Quantitative Methods for Business ( PDFDrive ).pdf
Quantitative Methods for Business ( PDFDrive ).pdfQuantitative Methods for Business ( PDFDrive ).pdf
Quantitative Methods for Business ( PDFDrive ).pdf
 
Daily monitoring cbm of switchyard equipments through android app
Daily monitoring cbm of switchyard equipments through android appDaily monitoring cbm of switchyard equipments through android app
Daily monitoring cbm of switchyard equipments through android app
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
ptg18221866ptg18221866The Practice of System and.docx
ptg18221866ptg18221866The Practice of System and.docxptg18221866ptg18221866The Practice of System and.docx
ptg18221866ptg18221866The Practice of System and.docx
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Micro Processors Present Technology and Up gradations Required
Micro Processors Present Technology and Up gradations RequiredMicro Processors Present Technology and Up gradations Required
Micro Processors Present Technology and Up gradations Required
 
STOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKSSTOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKS
 

Recently uploaded

Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
Atif Razi
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 

Improving Nepali News Recommendation Using Classification Based on LSTM Recurrent Neural Networks - ICCCS 2018 Paper

  • 1. Improving Nepali News Recommendation using Classification based on LSTM Recurrent Neural Networks Ashok Basnet Pokhara University Arun K. Timalsina Institute of Engineering
  • 2. Topics covered 1. Background 2. Problem Statement 3. Literature Review 4. Research Objectives 5. Research Methodology 6. Experiment and Analysis 7. Conclusion 8. Future Works 9. References Conf Record # 76 2
  • 3. Background ● Online news content of Nepal ● More than 1000 news portals producing news in than 20 categories ● Production of more than 100K news content from popular news portals such as onlinekhabar.com, ratopati.com, dcnepal.com etc. ● Deep Learning mechanism producing excellent results in the classification of documents Conf Record # 76 3
  • 4. Problem Statement ● More than 500 news being produced in daily basis from popular news sites, so users cannot go through all the articles and might miss their interested category news ● Many news portals are using manual way of recommending content to user only based on that particular news they are reading Conf Record # 76 4
  • 5. Literature Review ● Nihar et al. [1] demonstrated outstanding performance of Long Short Term Memory in document classification in comparison to other machine learning algorithms. LSTM has achieved an accuracy of up to 93% in document classification. ● Kaushal Kafle et al. [2] successfully tested the use of neural networks in the word2vec for word embeddings to improve the vector representation of the text. The use of word2vec model outperforms the TF-IDF method by 1.6 percent. The classification is carried out with Support Vector Machine. Conf Record # 76 5
  • 6. Research Objectives 1. To improve the classification of Nepali news documents using LSTM recurrent neural network. 2. To compare different news portals and their classification accuracies Conf Record # 76 6
  • 7. Research Methodology 1. Data Gathering 2. Data Preprocessing 3. Feature Extraction 4. Model Building 5. Model Validation Conf Record # 76 7
  • 8. Research Methodology Flow Diagram Conf Record # 76 8
  • 9. Technology Used ● Python ● Scrapy - for news scrap ● word2vec gensim ● Keras Library for python ● Scikit Learn Library for python Conf Record # 76 9
  • 10. System Configuration ● Model Name: MacBook Pro ● Processor Name: Intel Core i7 ● Processor Speed: 2.8 GHz ● Number of Processors: 1 ● Total Number of Cores: 2 ● L2 Cache (per Core): 256 KB ● L3 Cache: 4 MB ● Memory: 8 GB Conf Record # 76 10
  • 11. Data Gathering S.No. Site Name Site URL No. of articles 1 DC Nepal dcnepal.com 12,872 2 Image Khabar imagekhabar.com 19,296 3 Online Khabar onlinekhabar.com 20,504 4 Ratopati ratopati.com 7,855 5 Ujyaalo Online ujyaaloonline.com 11,476 TOTAL 72,003 *The data were collected from 2015 till April 2018 Conf Record # 76 11
  • 12. Data Gathering - Categories S.No. Category Name No. of Articles 1 Diaspora 6,224 2 Economy 14,067 3 Entertainment 7,588 4 Health 3,122 5 International 9,879 6 Opinion 2,675 7 Politics 8,352 8 Sports 13,018 TOTAL 64,925 Conf Record # 76 12
  • 13. Data Preprocessing 1. HTML Tags remove from the news document 2. White space and special symbol removal 3. Stop Word Removal Conf Record # 76 13
  • 14. Feature Extraction Using TFIDF Vectorizer TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). IDF(t) = log_e (Total number of documents / Number of documents with term t in it). Using Word2Vec Vectorizer Uses word embedding to represent sentences using neural network. Conf Record # 76 14
  • 16. Model Development ● Recurrent Neural Network ● Word2Vec model with 100, 200 and 300 features ● LSTM model ● 50 hidden units of first layer for LSTM ● 25 hidden units in 2nd layer for LSTM ● 50 hidden units for Dense Layer ● 5 / 10 epochs ● Dropout of 0.2 ● Softmax activation at output ● Categorical cross entropy as loss function Conf Record # 76 16
  • 18. Model Experiment for dcnepal.com using LSTM category precision recall f1-score support diaspora 0.86 0.63 0.72 383 entertainment 0.80 0.78 0.79 564 health 0.86 0.67 0.75 178 international 0.88 0.78 0.83 888 society 0.84 0.93 0.88 1468 sports 0.91 0.74 0.82 379 avg / total 0.85 0.81 0.83 3860 Conf Record # 76 18
  • 19. Model Experiment for imagekhabar.com using LSTM category precision recall f1-score support economy 0.84 0.91 0.87 1565 health 0.84 0.72 0.78 365 international 0.94 0.93 0.94 1105 opinion 0.86 0.77 0.81 82 politics 0.92 0.89 0.90 1738 sports 0.98 0.87 0.92 932 avg / total 0.90 0.89 0.90 5787 Conf Record # 76 19
  • 20. Model Experiment for onlinekhabar.com using LSTM category precision recall f1-score support diaspora 0.92 0.86 0.89 1398 economy 0.92 0.95 0.93 822 entertainment 0.88 0.87 0.88 831 health 0.74 0.77 0.76 169 opinion 0.91 0.89 0.90 648 sports 0.95 0.96 0.95 1678 technology 0.88 0.81 0.84 604 avg / total 0.91 0.90 0.90 6150 Conf Record # 76 20
  • 21. Model Experiment for ratopati.com using LSTM category precision recall f1-score support diaspora 0.66 0.62 0.64 158 economy 0.89 0.87 0.88 1059 entertainment 0.72 0.58 0.64 131 health 0.96 0.73 0.83 230 international 0.80 0.76 0.78 376 opinion 0.84 0.65 0.73 91 sports 0.70 0.74 0.72 310 avg / total 0.83 0.78 0.80 2355 Conf Record # 76 21
  • 22. Model Experiment for ujyaaloonline.com using LSTM category precision recall f1-score support economy 0.84 0.82 0.83 665 entertainment 0.89 0.85 0.87 751 international 0.95 0.82 0.88 659 politics 0.84 0.85 0.85 727 sports 0.91 0.93 0.92 639 avg / total 0.89 0.85 0.87 3441 Conf Record # 76 22
  • 23. Model Experiment for overall websites using LSTM category precision recall f1-score support diaspora 0.89 0.72 0.79 1896 economy 0.86 0.86 0.86 4194 entertainment 0.84 0.79 0.81 2208 health 0.82 0.75 0.78 909 international 0.92 0.85 0.88 3019 opinion 0.84 0.90 0.87 815 politics 0.88 0.85 0.86 2501 sports 0.96 0.93 0.95 3936 avg / total 0.89 0.85 0.87 19478 Conf Record # 76 23
  • 24. Model accuracy and loss graphs Fig: Accuracy vs Epoch Fig: Loss vs Epoch Conf Record # 76 24
  • 25. Comparison of 5 websites using LSTM model Website Accuracy Precision Recall f1-score Dcnepal.com 81.45% 85% 81% 83% Imagekhabar.com 88.83% 90% 89% 90% Onlinekhabar.com 89.26% 91% 90% 90% Ratopati.com 77.87% 83% 78% 80% Ujyaaloonline.com 85.29% 89% 85% 87% Conf Record # 76 25
  • 26. Comparison chart of 5 websites using LSTM model Conf Record # 76 26
  • 27. Comparison of category classification using LSTM Conf Record # 76 27
  • 28. Comparison of LSTM model with different hyper-parameters No. of features No. of epochs Accuracy Precision Recall f1-score 100 5 70.43% 85% 70% 77% 100 10 73.22% 84% 73% 78% 200 5 79.19% 87% 79% 83% 200 10 80.95% 87% 81% 84% 300 5 83.05% 89% 83% 86% 300 10 84.63% 89% 85% 87% Conf Record # 76 28
  • 29. Analysis of LSTM metrics with number of features using 5 epochs Conf Record # 76 29
  • 30. Analysis of LSTM metrics with number of features using 10 epochs Conf Record # 76 30
  • 31. Comparison of LSTM with SVM Model Accuracy Precision Recall f1-score SVM 81.41% 85% 81% 80% LSTM 84.63% 89% 85% 87% Conf Record # 76 31
  • 32. Comparison Graph of LSTM with SVM Conf Record # 76 32
  • 33. Conclusion 1. LSTM with 300 word2vec features produced a good improvement over existing SVM models with 84.63% accuracy and 89% precision 2. Onlinekhabar.com was found to have highest accuracy and precision among five different websites compared in the experiments with 89.26% accuracy and 91% precision 3. Sports category was the one with highest accuracy in classification among 8 different categories with 93% precision Conf Record # 76 33
  • 34. Future Works 1. More websites into consideration and increasing the categories of the data will help to classify more news accurately 2. Use effective stemming method to reduce the error rate in classification 3. Use of hybrid approach with technology like CNN Conf Record # 76 34
  • 35. References • N. M. Ranjan, Y. R .Ghorpade, G. R. Kanthale, A. R. Ghorpade and A. S. Dubey, “Document Classification using LSTM Neural Network”, Journal of Data Mining and Management, Volume 2 Issue 2, 2017 • K. Kafle, D. Sharma, A. Subedi and A. K. Timalsina , “Improving Nepali Document Classification by Neural Network”, IOE Graduate Conference, 2016, pp. 317–322m • S. Kaur and N. K. Khiva, “Online news classification using Deep Learning Technique”, International Research Journal of Engineering and Technology (IRJET), Volume: 03 Issue: 10, 2016 • C. Zhou, C. Sun, Z. Liu, and F. Lau., “A C-LSTM Neural Network for Text Classification”, CoRR abs/1511.08630, 2015 • F. Al. Zaghoul and S. Al. Dhaheri “Arabic Text Classification Based on Features Reduction Using Artificial Neural Networks”, UKSim 15th International Conference on Computer Modelling and Simulation, 2013 • C. Chan, A. Sun and E. Lim, “Automated Online News Classification with Personalization”, Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), Pages 320-329, Bangalore, India, December, 2001 Conf Record # 76 35

Editor's Notes

  1. Good Afternoon Everyone, My name is Ashok Basnet and I am going to present the research paper on Improving Nepali News Recommendation using Classification based on LSTM Recurrent Neural Networks. This paper is co-authored by Arun Kumar Timalsina from Institute of Engineering.
  2. I will be covering the following topics in my presentation
  3. There has been rapid growth in the news creation and consumption in Nepal in past few years. More than 1000 news portals available in Nepal which spreads news across more than 20 categories. They had produced more than 1 lakh news content since their establishment. The recent achievement in the natural language processing and deep Learning mechanisms are producing excellent results in terms of document classification
  4. So, due to the rapid growth of news content, there is large volume of data available in the internet User are finding it difficult to get the news based on the category they are interested. The news website are using manual way of recommending the news, which can be automated with the machine learning based categorization
  5. Mr. Nihar used the recurrent neural networks, that has been deployed in english news classification with accuracies more than 93% There has been significant work in Support Vector machine with the use wordtovec by Kaushal Kafle that was published recently in IOE Graduate conference Other than this, there has been work on the sentiment review analysis on Nepali Movie reviews using different Machine Learning Algorithms
  6. The objective of my research was to improve the classification of nepali news documents using the deep learning technology, Recurrent Neural Network And to compare five popular news sites of Nepal for their classification accuracies
  7. The research will be carried out in the following methods,
  8. The data was collected from various sources and cleaned for any HTML tags and extra spaces. The data was selected and stop words were removed from the sentences. The feature extraction was done to get vectors from the words The train test split was done as 70% - 30%, where training set was again divided into validation sets. The input was fed to LSTM network and the model is evaluated across different measures Finally, the model is evaluated with test sets and the prediction of the category is done.
  9. The whole experiments was carried out with Python programming language. I have used the frameworks as scrapy for news scraping purpose Word2vec for embeddings of the nepali words Keras library for implementing LSTM model Scikit Learn for implementing TFIDF, SVM, Classification metrices And other libraries
  10. I have tested the whole experiments in the system configuration as shown. The processor is i7 and 8GB RAM
  11. I started the collection of data from 5 different websites which are popular in Nepal. In total, I collected around 72,000 articles from these websites The data were collected from 2015 to April 2018.
  12. The news were categorized into 8 different categories and data were cleaned to get around 65,000 news articles. There were fair share in all the categories
  13. The data preprocessing step was used on the data set collected. The HTML tags were removed and white space and special symbol were also cleaned. The stop word removal was done from standard data from NLTK ( Natural language toolkit )
  14. The next process was the feature extraction. As I have used both TFIDF and word2vector for the vectorization of the words. The TFIDF formula is as given above and word2vec was used from gensim library.
  15. Principal Component Analysis was used for the visualization of the dataset. The plot shows the distribution of the data across different categories as separated by color coding.
  16. So, the model I used was the LSTM recurrent neural network model which will be used to classify the nepali documents into appropriate categories. For this, I have used word2vec model to convert words into embedding vectors that will be fed to LSTM model. In midterm, I have experimented with 50 hidden units of LSTM, 5 epoch and dropout of 0.2
  17. Let’s go through the experiment I have conducted and their results. In this section, I will be presenting you with model experiment with LSTM for 5 websites and their performance measures The overall data was also be evaluated, and the comparison based on the category was done. I ran the same dataset in SVM and compared the results with the LSTM Now lets proceed to the results section
  18. Here, the dcnepal.com website was run on the LSTM model that we built and got the following output. It classified sports with precision of 91%, with avg. 85% precision
  19. Similarly, the model was run through imagekhabar.com data and measures as shown in the table.
  20. And this for onlinekhabar, it predicted economy news pretty well and sports and diaspora was also classified accurately
  21. Similarly for ratopati.com, health category was found to have highest precision.
  22. And ujyaaloonline.com, got good precision for sports category
  23. I merged all the data from these website and model was evaluated. Here I got a good precision for the model around 89%. Sports and international news was found to be categorized with over 90% precision. The least was for the health category.
  24. This is the plot between model accuracy vs epoch and model loss from epoch, which shows, pretty good result in learning of the model with training and validation sets
  25. The five websites were compared across various measures such as accuracy, precision, recall and f1-score. I found onlinekhabar with highest accuracy and precision, ratopati was least among them
  26. Here is the comparison graph for the same result
  27. The 8 categories were also compared across precision, recall and f1-score. It shows that sports category was more accurately predicted among all.
  28. The LSTM model was further analyzed for different hyper parameters such as no. of epochs and no. of features. Here we can see that, as the no. of features increases, the accuracy and other metrics are found to have increased, and similar was the case for the no. of epochs. The best result was found for the 300 features and 10 epochs of the LSTM model.
  29. This shows the graphical format for 5 epochs.
  30. For 10 epochs
  31. Here I present the model with different metricsI ran similar experiments with SVM model in order to compare with the LSTM model that i have implemented. The details are in the report LSTM seems to do well in large data set, as expected.
  32. Here is the comparison graph
  33. The collection of data was limited to eight categories and five websites, due to which the data needed for the deep learning module was limited to below 1 lakh. It can be improved by taking more websites into consideration and increasing the categories of the data will help to classify more news. The thesis uses rule-based stemming in the basic form, due to the unavailability of standard stemming library for Nepali language. The effective use of stemming can help to improve the overall accuracy of the classification. The thesis is limited to LSTM, where other technology like CNN can be used along with LSTM to achieve higher degree of accuracy and precision in the news categorization for Nepali news.
  34. And the references that I took from