4. • NLP, or natural language processing, is a technique used to
translate human language for a machine. In essence, it is the
automatic manipulation of natural language, such as speech
and text, by software for additional analysis to obtain the
necessary information from them.
• Computational linguistics, or the rule-based modelling of
human language, is combined with statistical, machine
learning, and deep learning models to form NLP. These make it
possible for computers to process spoken or written language.
5. Predictive Text
Pos tagging
E-mail filters
Smart Assistants
Sentimental analysis
Language Translation
Data Analysis
6. It served as a method of dividing a word, a sentence, a paragraph, or an entire written
document into manageable pieces.
We can obtain the specific keywords or words by doing that. Tokens are the smaller,
individual units. Analyzing the words that are used in the text aids in interpreting the
meaning of the text.
The text's word count should be determined. The ball is very big, for instance: ["the ,ball,
is, very, big"]
7. Stemming is the process of
stripping a word back to its root,
which attaches to suffixes and
prefixes. This operates by
removing the beginning or end of
the word while taking into account
a list of frequently occurring
prefixes and suffixes that can be
found in an inflected word.
Why Stemming:-
A smaller input
dimensional benefits from using
machine learning techniques.
Densify the training data.
shrink the dictionary's size helps to
make the document's wording
more
Form Suffix Stem
Books -s Book
coins -s coin
8. Lemmatization helps to do the
morphological analysis of the words.
It is important to have the knowledge
about the detailed dictionaries which the
algorithm can refer to link the form back
to its lemma.
Form Morphologic
al
Information
Lemma
Sleeps Third person
singular,
present
tense
Sleep
Sleep
Bowling Ing form of
the verb
Bowl
9. Topic Stemming Lemmatization
Goal Reduce inflectional forms
(Stemming chops off the
ends of the words in order
to achieve the goal
correctly)
Reduce inflectional forms
(Lemmatization refers to do
things properly with the
help of a vocabulary and
morphological analysis of
words)
Implementation Stemmers are easier to
implement and run faster
when compared to
Lemmatization
Lemmatization is slightly
difficult to implement
10. Stop words are frequent words that appear in sentences and give the sentence
more emphasis.
Stop words serve as a transitional element and guarantee proper grammar.
A stop word is, in essence, a word that is filtered out before to processing
natural language data.
This pre-processing technique is widely used.
11.
12. The preprocessing of the text or documentation uses the Bag o
f Words model.
It turns the documents into a collection of words and keeps tra
ck of how many times the most common words appear overall.
One of the most popular ways to turn tokens into a set of chara
cteristics is the bag-of-words technique.
13. Term Frequency and Inverse Document Frequency is referred to as TF-IDF.
This aids in calculating the score needed to obtain information retrieval (I
R) or summary.
The TFIDF can also be used to determine how pertinent a term is in a parti
cular document.
How to compound two measures to determine the TF-IDF:
How frequently a word appears in a document, as well as its inverse docu
ment frequency throughout a collection of documents
14. A word's significance within the context of the document corpus can be determi
ned with the use of TF-IDF.
When calculating TFIDF, the number of times a word appears in a document is t
aken into account, offset by the number of documents included in the corpus.
TF is calculated by dividing the number of terms in the document by their freq
uency.
DF is calculated by taking the logarithm of the number of documents divided by
the number of documents containing the phrase.
15. Word Embeddings vectors are one of the most common
way to encode words as vectors of numbers those
vectors can ben fed in into the Machine Learning
models for inference and also it helps to establish the
distance between two tokens
Types:-
• Word2vec
• Glove
• fasttext
16. NLP tasks like lemmatization, stemming, tokenization, noun phrase extraction, POS
tagging, N-grams, and sentiment analysis are carried out using the open-
source Python module Textblob.
Although it is quicker than NLTK, it does not include functions like dependency par
sing or vectorization.
Textblob can be used for text classification and sentiment analysis.