SlideShare a Scribd company logo
1 of 17
Document Classification
using NLP and Deep
Learning
EEL 6825 Pattern Recognition
Presented By:
Saurabh Kumar Prasad
Introduction
What comes under Document?
Blog, News article, Books, Customer Review, Research Papers
Need for Document/Text Classification
• Natural language cannot be easily understood by computers
• Categorization of vast amount of textual data
Applications
• Spam Detection
• Sentiment Analysis
• Customer Query classification in chatbots
• Fake news detection
Steps for Document classification using
Machine Learning
• Data Preprocessing
• Feature Engineering
• Model Selection and Training
• Performance Tuning
• Testing
Dataset
AG corpus of News Article (http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)
Collected from 2000 news sources.
120000 training samples in csv format
7600 testing samples in csv format
4 Categories - World, Sports, Business, Sci/Tech.
Library used - pandas
Preprocessing
• Remove HTML tags
• Remove extra whitespaces
• Convert accented characters to ASCII characters
• Remove special characters
• Lowercase all texts
• Convert number words to numeric form
• Remove numbers
• Remove stopwords (NLTK)
• Lemmatization (NLTK)
Input = SPACE.com - TORONTO, Canada -- A secondteam of rocketeers competing for the #36;10 million Ansari X
Prize, a contest forprivately funded suborbital space flight, has officially announced the firstlaunch date for its
manned rocket.
After preprocessing = spacecom toronto canada a second team rocketeers compete million ansari x prize
contest for privately fund suborbital space flight officially announce first launch date man rocket
Feature Engineering
• Count Vectors- generates vocabulary for all unique words of sentence. This in turn creates feature
vector of the count of the words (Bag of Words BOW model)
• TF-IDF - given a score to the words based on its Term Frequency (#words/ #Total words) and its
Inverse Frequency
• TF-IDF N-grams – TF-IDF also including bi-grams, tri-grams
• TF-IDF N-grams character level
• Word Embedding - It preserves contextually similar words. Useful for deep learning. Creates an
embedding matrix using pre-trained vectors on large corpus
Model Selection – Baseline Models
Naïve Bayes Classifier
This classification technique is based on Bayes’ Theorem and assumes independence
among the predictors.
Baseline Models – Logistic Regression
This model uses a logistic/sigmoid function to calculate the probabilities of different values of the
categorical dependent variable in presence of one or more predictors. Logit function is an estimation
of log of odds in the favor of event and outputs a s-shaped curve with probability estimates.
Baseline Models – Random Forest and XG boost
Bagging Model (Random Forest) - Random Forest models are a type of ensemble models, particularly
bagging models. They are part of the tree-based model family.
XG Boost - Boosting models are another type of ensemble models and are part of tree based models.
This model is meta-algorithm for reducing bias and variance in supervised learning. One of its strong
points is that it converts weak learner i.e. classifiers slightly correlated to the true classification into
stronger ones.
Neural Networks
Neural networks or feedforward neural networks. These have multiple layers of neurons
arranged in fashion similar to that of human brain.
Activation function : transforming the summed weighted input from the node into the output for
that input. I have used rectified linear activation (Relu) function since it overcomes the vanishing
gradient problem, allowing models to learn faster and perform better.
Loss function : a function which needs to be minimized for correctioness of model
I have used softmax or normalized exponential function as there is multi class classification
Terms
Epoch : Since the training in neural networks is an iterative process, the training won’t just stop after
it is done. You have to specify the number of iterations you want the model to be training. Those
completed iterations are commonly called epochs. I have chosen 10
Batch size: The batch size is responsible for how many samples we want to use in one epoch, which
means how many samples are used in one forward/backward pass. I have chosen 50
Validation split : this is a spit of training data to calculate loss function and should be kept apart from
testing set in order to not let testing set to influence the model. Validation spit of 10 % is used
Deep Neural Networks CNN
A neural network with more than one hidden layer is considered a deep neural network.
Keras is the model-level library used . To handle low-level operations such as tensor manipulation and
differentiation tensorflow is the backend engine of Keras.
CNN : It is a specialized neural network that can detect specific patterns which is used to discern the
most important information in a sentence. The hidden layers called convolutional layer. I have used
one dimensional CNN which is unaffected by translations The patch of filters slide filter slide over
embedding matrix and extracts a specific pattern of n-gram.
CNN Layers Description
• Embedding Layer of Keras which takes the previously calculated integers and maps
them to a dense vector of the embedding using the embedding matrix from pre-
trained word vectors.
• Convulational layer
• GlobalMaxPooling1D layer after the embedding layer to downsample (the maximum
value of all features in the pool for each feature dimension)
• Dense layer with Renu activation function
• Dense layer with softmax activation function
CNN Training
RNN
In RNN the activation outputs are propagated in both directions. It results in looping
which provides a state to neurons giving it ability to remember the learnings.
The CNN layer is replaced by bidirectional GRU layer
Conclusion
While the classical models provided a good accuracy the neural network models improved
it further. The highest accuracy achieved using CNN as 91%. With more data the deep
learning models will outmatch classical models. Using word embedding provided an
additional 2-3% improvement in accuracy and faster training time. CNN provided the best
performance in neural networks but RNN still had good results. More training data could
make RNN perform better than CNN.
Demo

More Related Content

Similar to presentation.ppt

Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1arthi v
 
Facial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceFacial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceTakrim Ul Islam Laskar
 
Electi Deep Learning Optimization
Electi  Deep Learning OptimizationElecti  Deep Learning Optimization
Electi Deep Learning OptimizationNikolas Markou
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesValue Amplify Consulting
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningSanghamitra Deb
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsColleen Farrelly
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine LearningAyodele Odubela
 
Machine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning SystemsMachine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning SystemsClarence Chio
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidmining Content
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 

Similar to presentation.ppt (20)

C3 w1
C3 w1C3 w1
C3 w1
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
Facial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional FaceFacial Emotion Detection on Children's Emotional Face
Facial Emotion Detection on Children's Emotional Face
 
Electi Deep Learning Optimization
Electi  Deep Learning OptimizationElecti  Deep Learning Optimization
Electi Deep Learning Optimization
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Deep learning
Deep learningDeep learning
Deep learning
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Machine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning SystemsMachine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning Systems
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

presentation.ppt

  • 1. Document Classification using NLP and Deep Learning EEL 6825 Pattern Recognition Presented By: Saurabh Kumar Prasad
  • 2. Introduction What comes under Document? Blog, News article, Books, Customer Review, Research Papers Need for Document/Text Classification • Natural language cannot be easily understood by computers • Categorization of vast amount of textual data Applications • Spam Detection • Sentiment Analysis • Customer Query classification in chatbots • Fake news detection
  • 3. Steps for Document classification using Machine Learning • Data Preprocessing • Feature Engineering • Model Selection and Training • Performance Tuning • Testing
  • 4. Dataset AG corpus of News Article (http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) Collected from 2000 news sources. 120000 training samples in csv format 7600 testing samples in csv format 4 Categories - World, Sports, Business, Sci/Tech. Library used - pandas
  • 5. Preprocessing • Remove HTML tags • Remove extra whitespaces • Convert accented characters to ASCII characters • Remove special characters • Lowercase all texts • Convert number words to numeric form • Remove numbers • Remove stopwords (NLTK) • Lemmatization (NLTK) Input = SPACE.com - TORONTO, Canada -- A secondteam of rocketeers competing for the #36;10 million Ansari X Prize, a contest forprivately funded suborbital space flight, has officially announced the firstlaunch date for its manned rocket. After preprocessing = spacecom toronto canada a second team rocketeers compete million ansari x prize contest for privately fund suborbital space flight officially announce first launch date man rocket
  • 6. Feature Engineering • Count Vectors- generates vocabulary for all unique words of sentence. This in turn creates feature vector of the count of the words (Bag of Words BOW model) • TF-IDF - given a score to the words based on its Term Frequency (#words/ #Total words) and its Inverse Frequency • TF-IDF N-grams – TF-IDF also including bi-grams, tri-grams • TF-IDF N-grams character level • Word Embedding - It preserves contextually similar words. Useful for deep learning. Creates an embedding matrix using pre-trained vectors on large corpus
  • 7. Model Selection – Baseline Models Naïve Bayes Classifier This classification technique is based on Bayes’ Theorem and assumes independence among the predictors.
  • 8. Baseline Models – Logistic Regression This model uses a logistic/sigmoid function to calculate the probabilities of different values of the categorical dependent variable in presence of one or more predictors. Logit function is an estimation of log of odds in the favor of event and outputs a s-shaped curve with probability estimates.
  • 9. Baseline Models – Random Forest and XG boost Bagging Model (Random Forest) - Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree-based model family. XG Boost - Boosting models are another type of ensemble models and are part of tree based models. This model is meta-algorithm for reducing bias and variance in supervised learning. One of its strong points is that it converts weak learner i.e. classifiers slightly correlated to the true classification into stronger ones.
  • 10. Neural Networks Neural networks or feedforward neural networks. These have multiple layers of neurons arranged in fashion similar to that of human brain. Activation function : transforming the summed weighted input from the node into the output for that input. I have used rectified linear activation (Relu) function since it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. Loss function : a function which needs to be minimized for correctioness of model I have used softmax or normalized exponential function as there is multi class classification
  • 11. Terms Epoch : Since the training in neural networks is an iterative process, the training won’t just stop after it is done. You have to specify the number of iterations you want the model to be training. Those completed iterations are commonly called epochs. I have chosen 10 Batch size: The batch size is responsible for how many samples we want to use in one epoch, which means how many samples are used in one forward/backward pass. I have chosen 50 Validation split : this is a spit of training data to calculate loss function and should be kept apart from testing set in order to not let testing set to influence the model. Validation spit of 10 % is used
  • 12. Deep Neural Networks CNN A neural network with more than one hidden layer is considered a deep neural network. Keras is the model-level library used . To handle low-level operations such as tensor manipulation and differentiation tensorflow is the backend engine of Keras. CNN : It is a specialized neural network that can detect specific patterns which is used to discern the most important information in a sentence. The hidden layers called convolutional layer. I have used one dimensional CNN which is unaffected by translations The patch of filters slide filter slide over embedding matrix and extracts a specific pattern of n-gram.
  • 13. CNN Layers Description • Embedding Layer of Keras which takes the previously calculated integers and maps them to a dense vector of the embedding using the embedding matrix from pre- trained word vectors. • Convulational layer • GlobalMaxPooling1D layer after the embedding layer to downsample (the maximum value of all features in the pool for each feature dimension) • Dense layer with Renu activation function • Dense layer with softmax activation function
  • 15. RNN In RNN the activation outputs are propagated in both directions. It results in looping which provides a state to neurons giving it ability to remember the learnings. The CNN layer is replaced by bidirectional GRU layer
  • 16. Conclusion While the classical models provided a good accuracy the neural network models improved it further. The highest accuracy achieved using CNN as 91%. With more data the deep learning models will outmatch classical models. Using word embedding provided an additional 2-3% improvement in accuracy and faster training time. CNN provided the best performance in neural networks but RNN still had good results. More training data could make RNN perform better than CNN.
  • 17. Demo