SlideShare a Scribd company logo
1 of 16
NLP WITH NAÏVE BAYES
CLASSIFIER
WHAT IS TEXT CLASSIFICATION?
• Text Classification is an automated process of classification of text into predefined
categories. We can classify Emails into spam or non-spam, news articles into different
categories like Politics, Stock Market, Sports, etc.
• This can be done with the help of Natural Language Processing and different
Classification Algorithms like Naive Bayes, SVM and even Neural Networks in Python.
WHAT IS NATURAL LANGUAGE PROCESSING?
• Short for natural language processing, NLP is a branch of artificial intelligence which is
focused on the enabling the computers to understand and interpret the human language.
The problem with interpreting the human language is that it is not a set of rules or binary
data that can be fed into the system and understanding the context of a conversation or
reading between the lines is altogether a different ball game.
• However, with the recent advancement in Machine Learning, Deep Learning with the
help of Neural Networks and easy to use models in python has opened the doors for us
to code our way into making computers understand the complex human Language.
DATASET
• Dataset
• The datasets contains a set of columns which have been divided into a training and a
test set. The training set contains a target column identifying whether the thing is noun
or verb
• Our job is to create a ML model to predict whether the test belong to a Noun , or verb or
any else, in the form of N or V .
KEY OBJECTIVES
• This notebook explains the concepts of NLP with respect to this current competition.
NLP is the field of study that focuses on the interactions between human language and
computers. NLP sits at the intersection of computer science, artificial intelligence, and
computational
linguistics[[source](https://en.wikipedia.org/wiki/Natural_language_processing)]. NLP is
a way for computers to analyze, understand, and derive meaning from human language
in a smart and useful way.
TEXT DATA PREPROCESSING
• 1. Data Cleaning
• Before we start with any NLP project we need to pre-process the data to get it all in a consistent
format.We need to clean, tokenize and convert our data into a matrix. Some of the basic text pre-
processing techniques includes:
• Make text all lower case or uppercase so that the algorithm does not treat the same words in different
cases as different
• Removing Noise i.e everything that isn’t in a standard number or letter i.e Punctuation, Numerical values,
common non-sensical text (/n)
• Tokenization: Tokenization is just the term used to describe the process of converting the normal text
strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list
of sentences and Word tokenizer can be used to find the list of words in strings.
• Stopword Removal: Sometimes, some extremely common words which would appear to be of little value
in helping select documents matching a user need are excluded from the vocabulary entirely. These
words are called stop words
MORE DATA CLEANING STEPS AFTER
TOKENIZATION:
• Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their stem,
base or root form — generally a written word form. Example if we were to stem the following words:
“Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.
• Lemmatization: A slight variant of stemming is lemmatization. The major difference between these is,
that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root
stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you
can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running”
or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.
• Parts of speech tagging
• Create bi-grams or tri-grams And more...
TOKENIZATION
2. Tokenization
• Tokenization is a process that splits an input sequence into so-called tokens where the tokens
can be a word, sentence, paragraph etc. Base upon the type of tokens we want, tokenization
can be of various types, for instance
3. Stopwords Removal
• Now, let's get rid of the stopwords i.e words which occur very frequently but have no possible
value like a, an, the, are etc.
4. Token normalization
• Token normalisation means converting different tokens to their base forms. This can be done
either by:
• Stemming : removing and replacing suffixes to get to the root form of the word, which is called
the stem for instance cats - cat, wolves - wolv
• Lemmatization : Returns the base or dictionary form of a word, which is known as the lemma
TRANSFORMING TOKENS TO A VECTO
• Bag of Words
• The bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
• A vocabulary of known words.
• A measure of the presence of known words.
• Why is it is called a “bag” of words? That is because any information about the order or
structure of words in the document is discarded and the model is only concerned with
whether the known words occur in the document, not where they occur in the document.
• For example,
FOR EXAMPLE,
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx

More Related Content

Similar to NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx

INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Big data
Big dataBig data
Big dataIshucs
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxSoftxai
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingdhruv_chaudhari
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratchMahmoud Yasser
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?Ducat India
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptxPriyadharshiniG41
 

Similar to NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx (20)

NLP todo
NLP todoNLP todo
NLP todo
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Big data
Big dataBig data
Big data
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptx
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Textmining
TextminingTextmining
Textmining
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?
 
Ir 03
Ir   03Ir   03
Ir 03
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 

Recently uploaded

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Recently uploaded (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx

  • 1. NLP WITH NAÏVE BAYES CLASSIFIER
  • 2. WHAT IS TEXT CLASSIFICATION? • Text Classification is an automated process of classification of text into predefined categories. We can classify Emails into spam or non-spam, news articles into different categories like Politics, Stock Market, Sports, etc. • This can be done with the help of Natural Language Processing and different Classification Algorithms like Naive Bayes, SVM and even Neural Networks in Python.
  • 3. WHAT IS NATURAL LANGUAGE PROCESSING? • Short for natural language processing, NLP is a branch of artificial intelligence which is focused on the enabling the computers to understand and interpret the human language. The problem with interpreting the human language is that it is not a set of rules or binary data that can be fed into the system and understanding the context of a conversation or reading between the lines is altogether a different ball game. • However, with the recent advancement in Machine Learning, Deep Learning with the help of Neural Networks and easy to use models in python has opened the doors for us to code our way into making computers understand the complex human Language.
  • 4. DATASET • Dataset • The datasets contains a set of columns which have been divided into a training and a test set. The training set contains a target column identifying whether the thing is noun or verb • Our job is to create a ML model to predict whether the test belong to a Noun , or verb or any else, in the form of N or V .
  • 5. KEY OBJECTIVES • This notebook explains the concepts of NLP with respect to this current competition. NLP is the field of study that focuses on the interactions between human language and computers. NLP sits at the intersection of computer science, artificial intelligence, and computational linguistics[[source](https://en.wikipedia.org/wiki/Natural_language_processing)]. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way.
  • 6. TEXT DATA PREPROCESSING • 1. Data Cleaning • Before we start with any NLP project we need to pre-process the data to get it all in a consistent format.We need to clean, tokenize and convert our data into a matrix. Some of the basic text pre- processing techniques includes: • Make text all lower case or uppercase so that the algorithm does not treat the same words in different cases as different • Removing Noise i.e everything that isn’t in a standard number or letter i.e Punctuation, Numerical values, common non-sensical text (/n) • Tokenization: Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings. • Stopword Removal: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words
  • 7. MORE DATA CLEANING STEPS AFTER TOKENIZATION: • Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”. • Lemmatization: A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same. • Parts of speech tagging • Create bi-grams or tri-grams And more...
  • 8. TOKENIZATION 2. Tokenization • Tokenization is a process that splits an input sequence into so-called tokens where the tokens can be a word, sentence, paragraph etc. Base upon the type of tokens we want, tokenization can be of various types, for instance 3. Stopwords Removal • Now, let's get rid of the stopwords i.e words which occur very frequently but have no possible value like a, an, the, are etc. 4. Token normalization • Token normalisation means converting different tokens to their base forms. This can be done either by: • Stemming : removing and replacing suffixes to get to the root form of the word, which is called the stem for instance cats - cat, wolves - wolv • Lemmatization : Returns the base or dictionary form of a word, which is known as the lemma
  • 9. TRANSFORMING TOKENS TO A VECTO • Bag of Words • The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: • A vocabulary of known words. • A measure of the presence of known words. • Why is it is called a “bag” of words? That is because any information about the order or structure of words in the document is discarded and the model is only concerned with whether the known words occur in the document, not where they occur in the document. • For example,