SlideShare a Scribd company logo
Text Classification
Russ M. Delos Santos
Overview
Research:
Text classification for RAX Studio
Suggested Use Case:
- Account management through email.
Natural Language
Processing
1. Automatic or semi-automatic
processing of human language
2. Can be used for various
applications like
a. Sentiment Analysis
b. Intent Classification
c. Topic Labeling
Data
Pre-process to desired
text format
Features
Transform the text to
vectors (numbers)
Model
Feed the data to the
model
Prediction
Set prediction criteria
once the model
converge
Output
Output the class
General Process
Dataset / Text Corpus
- Dictionary or vocabulary which is used to train
the model
● Either tagged (for supervised learning) or untagged
(for unsupervised).
● Size depends on the algorithm used. Should be pre-
processed to remove unwanted characters, to convert
to wanted format, etc.
Dataset / Text Corpus
- Open-source dataset samples
● Amazon Reviews
● NYSK Dataset (News Articles
● Enrol Email Dataset
● Ling Spam Dataset
Feature Extraction
- Transforms texts to numbers (vector space
model)
- Choices:
● One-hot encoding
● Bag-of-words + TF*IDF
● Word2vec
One-hot encoding
- Creates a binary encoding of words. 1 is
encoded on the index of the word in the corpus
Bag-of-words
- Takes the word count of the target word in the
corpus as the feature
TF*IDF
- Term Frequency * Inverse Document
Frequency
● Frequently occurring words are typically not
important / has less weight (stopwords such as “is, are,
the, etc.”)
● Weights are assigned per word.
TF*IDF
- Term Frequency * Inverse Document
Frequency
BOW + TF*IDF
BOW + TF*IDF
word2vec
Uses the weights of the hidden layer of a neural
network as features of the words
● Can predict a context or a word based on the nearby
words in the corpus
● Uses continuous bag-of-words or skip-gram model + 1-
1-1 neural network.
word2vec
- Gives better semantic/syntactic relationships
of words through vectors
Schedule
Machine Learning Model
- A classifier algorithm that transforms an input
to the desired class
● Naive Bayes
● K-nearest neighbors
● Multilayer Perceptron
● Recurrent Neural Network + Long short-term memory
Naive Bayes
- Probabilistic model that relies on word count
● Uses bag of words as features
● Assumes that the position of words doesn't matter and
words are independent of each other
Naive Bayes
- Probabilistic model that relies on word count
K-Nearest Neighbors
- Classifies the class based on the nearest
distance from a known class
Multilayer Perceptron
- A feed-forward neural network
● Has at least 2 hidden layers
● Sigmoid function - binary classification
● Softmax function - multiclass classification
Multilayer Perceptron
Assessment
Option 1
● Features: BOW + TF*IDF
● ML Algorithm: Naive Bayes
● Pros: Easier to implement
● Cons: Word count instead of
word sequence.
● Ex. ‘Live to eat’ and ‘Eat to live’
may mean the same’
Option 2
● Features: word2vec word
embeddings
● ML Algorithm: Multilayer
Perceptron
● Pros: Produces better results,
semantically and syntactically
● Cons: Needs a big labeled
dataset to perform well
Main Blocks
ML.NET Learning Curve
- Still studying the framework.
- Not as well documented compared to Python
frameworks/libraries
- Ex. Has a method called TextCatalog.FeaturizeText() but there’s
no indication of the kind of feature extraction.
Supervised Learning Needs Big Data
- We can use open-source datasets for benchmark.
- But we need datasets with specific labels for the algorithm to
work.
Main Blocks
Model Update Criteria
- Retraining the model for every unknown word is impractical.
- Suggestion:
- Set a minimum number of occurence of new words before a
model is to be retrained
- Ignore the rare, new words since it may not affect the
entire intent, sentiment, meaning, of the text.
Implementation Plan
- Email Cleaner
- Clean special characters, HTML tags, header and footer of
the email, etc.
- Set a standard file format (tsv, csv, txt, etc. or transform to
bin)
- Use spam dataset for the mean time as benchmark (binary
classification)
- Sentence Tokenizer + Feature Extraction
- Divide emails per sentence + word2vec
- Create Neural Network
- 1 input, 2 hidden, 1 output.
- Activation function - sigmoid
References
[1]D. Jurafsky and J. Martin, Speech and
language processing. Upper Saddle River,
N.J.: Pearson Prentice Hall, 2009.
[2]https://developers.google.com/machi
ne-learning/
[3]bunch of stackoverflow /
stackexchange / Kaggle threads
[4]bunch of Medium posts

More Related Content

What's hot

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Bert
BertBert
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
ankit_ppt
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
BERT
BERTBERT
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Pranav Gupta
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Text classification
Text classificationText classification
Text classification
James Wong
 

What's hot (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
NLTK
NLTKNLTK
NLTK
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Bert
BertBert
Bert
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
BERT
BERTBERT
BERT
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Language models
Language modelsLanguage models
Language models
 
Text classification
Text classificationText classification
Text classification
 

Similar to Text Classification

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Pramati Technologies
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
Lalit Jain
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
NUPUR YADAV
 
Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...
Olga Zinkevych
 
DLT UNIT-3.docx
DLT  UNIT-3.docxDLT  UNIT-3.docx
DLT UNIT-3.docx
0567Padma
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
Bernardo Najlis
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
BERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight ManualBERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight Manual
ArkaGhosh65
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.ppt
abdoSelem1
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
Soham Saha
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
Software Infrastructure
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
Fwdays
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
IRJET Journal
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdf
AliyanAbbas1
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 

Similar to Text Classification (20)

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...Overview of text classification approaches algorithms & software v lyubin...
Overview of text classification approaches algorithms & software v lyubin...
 
DLT UNIT-3.docx
DLT  UNIT-3.docxDLT  UNIT-3.docx
DLT UNIT-3.docx
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
BERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight ManualBERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight Manual
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.ppt
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdf
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Text Classification

  • 2. Overview Research: Text classification for RAX Studio Suggested Use Case: - Account management through email.
  • 3. Natural Language Processing 1. Automatic or semi-automatic processing of human language 2. Can be used for various applications like a. Sentiment Analysis b. Intent Classification c. Topic Labeling
  • 4. Data Pre-process to desired text format Features Transform the text to vectors (numbers) Model Feed the data to the model Prediction Set prediction criteria once the model converge Output Output the class General Process
  • 5. Dataset / Text Corpus - Dictionary or vocabulary which is used to train the model ● Either tagged (for supervised learning) or untagged (for unsupervised). ● Size depends on the algorithm used. Should be pre- processed to remove unwanted characters, to convert to wanted format, etc.
  • 6. Dataset / Text Corpus - Open-source dataset samples ● Amazon Reviews ● NYSK Dataset (News Articles ● Enrol Email Dataset ● Ling Spam Dataset
  • 7. Feature Extraction - Transforms texts to numbers (vector space model) - Choices: ● One-hot encoding ● Bag-of-words + TF*IDF ● Word2vec
  • 8. One-hot encoding - Creates a binary encoding of words. 1 is encoded on the index of the word in the corpus
  • 9. Bag-of-words - Takes the word count of the target word in the corpus as the feature
  • 10. TF*IDF - Term Frequency * Inverse Document Frequency ● Frequently occurring words are typically not important / has less weight (stopwords such as “is, are, the, etc.”) ● Weights are assigned per word.
  • 11. TF*IDF - Term Frequency * Inverse Document Frequency
  • 14. word2vec Uses the weights of the hidden layer of a neural network as features of the words ● Can predict a context or a word based on the nearby words in the corpus ● Uses continuous bag-of-words or skip-gram model + 1- 1-1 neural network.
  • 15. word2vec - Gives better semantic/syntactic relationships of words through vectors
  • 17. Machine Learning Model - A classifier algorithm that transforms an input to the desired class ● Naive Bayes ● K-nearest neighbors ● Multilayer Perceptron ● Recurrent Neural Network + Long short-term memory
  • 18. Naive Bayes - Probabilistic model that relies on word count ● Uses bag of words as features ● Assumes that the position of words doesn't matter and words are independent of each other
  • 19. Naive Bayes - Probabilistic model that relies on word count
  • 20. K-Nearest Neighbors - Classifies the class based on the nearest distance from a known class
  • 21. Multilayer Perceptron - A feed-forward neural network ● Has at least 2 hidden layers ● Sigmoid function - binary classification ● Softmax function - multiclass classification
  • 23. Assessment Option 1 ● Features: BOW + TF*IDF ● ML Algorithm: Naive Bayes ● Pros: Easier to implement ● Cons: Word count instead of word sequence. ● Ex. ‘Live to eat’ and ‘Eat to live’ may mean the same’ Option 2 ● Features: word2vec word embeddings ● ML Algorithm: Multilayer Perceptron ● Pros: Produces better results, semantically and syntactically ● Cons: Needs a big labeled dataset to perform well
  • 24. Main Blocks ML.NET Learning Curve - Still studying the framework. - Not as well documented compared to Python frameworks/libraries - Ex. Has a method called TextCatalog.FeaturizeText() but there’s no indication of the kind of feature extraction. Supervised Learning Needs Big Data - We can use open-source datasets for benchmark. - But we need datasets with specific labels for the algorithm to work.
  • 25. Main Blocks Model Update Criteria - Retraining the model for every unknown word is impractical. - Suggestion: - Set a minimum number of occurence of new words before a model is to be retrained - Ignore the rare, new words since it may not affect the entire intent, sentiment, meaning, of the text.
  • 26. Implementation Plan - Email Cleaner - Clean special characters, HTML tags, header and footer of the email, etc. - Set a standard file format (tsv, csv, txt, etc. or transform to bin) - Use spam dataset for the mean time as benchmark (binary classification) - Sentence Tokenizer + Feature Extraction - Divide emails per sentence + word2vec - Create Neural Network - 1 input, 2 hidden, 1 output. - Activation function - sigmoid
  • 27. References [1]D. Jurafsky and J. Martin, Speech and language processing. Upper Saddle River, N.J.: Pearson Prentice Hall, 2009. [2]https://developers.google.com/machi ne-learning/ [3]bunch of stackoverflow / stackexchange / Kaggle threads [4]bunch of Medium posts

Editor's Notes

  1. Key processes needed for natural language processing
  2. Can be used for benchmark testing depending on the needed classes
  3. Pros: Easy to implement Cons: Outputs a large matrix in which most values are zeroes
  4. Pros: Easy since its basicaly just counting Cons: Sequence of words doesnt matter, which is largely erroneous assumption
  5. Provides the needed weight per word
  6. Sequential word embeddings - order of words matter thus providing semantics and syntax
  7. Vector space model
  8. Sample implementation result of word2vec in Python. The corpus text is initialized as corpus_raw. As you can see, the ‘daughter’ and ‘infant’ is somehow equally distanced to ‘prisoners’. ‘Kingdom’ is very closely related to ‘madam’ or the queen as told in the story
  9. Classified to a class with the highest probability of class | keyword
  10. A lazy classifier since only the distance determines the class
  11. Deep learning - uses neural network