NLP Techniques for Text Analysis

Interaction
Hello
How are you?
I am great; thanks for asking.
How was your day?

Chatbots
Do you remember the chatbot you interacted with in last ?
https://www.pandorabots.com/mitsuku/
chatterbot-corpus/chatterbot_corpus/data/english at master · gunthercox/chatterbot-corpus ·
GitHub

What is NLP?
Natural language processing (NLP) is an integral part of AI, Computer Science,
and Linguistics. NLP is all about making computers/machines as intelligent as
human beings in the understanding of natural-communication language like text,
speech, and so on. It comprises 2 major functionalities. they are Human to machine
translation and Machine to Human translation.

Applications of NLP
•Email filters. Email filters are one of the most basic and
initial applications of NLP online. ...
•Smart assistants. ...
•Search results. ...
•Predictive text. ...
•Language translation. ...
•Digital phone calls. ...
•Data analysis. ...
•Text analytics.

Modelling
Techniques
Data Preprocessing
Tokenization
Stop Words Removal
Stemming
Lemmatization
Bag of Words
TF-IDF
Word Embeddings
Sentiment Analysis
Steps towards NLP

Tool Used - Python
Python is a high-level, interpreted, general-purpose
programming language.
Its design philosophy emphasizes code readability with the use
of significant indentation.

Python Library
• NumPy
• Pandas
• Matplotlib
• Seaborn
• NLTK

Art to read the data
Data preprocessing is a data mining
technique which is used to transform
the raw data in a useful and efficient
format..
Demo -

Tokenization –
Tokenization is a process by which sensitive data elements such
as PANs, Personally Identifiable Information elements, etc. are
replaced by surrogate values, or tokens. Tokenization (or
“masking”, or “obfuscation”) means some form of format-
preserving data protection: converting sensitive values into non-
sensitive, replacement values – tokens – the same length and
format of the original data.

•Tokens share some characteristics with the original data elements, such
as format, length, etc
•Each data element is mapped to a unique token.
•Tokens are deterministic: repeatedly generating a token for a given
value yields the same token.
•A tokenized database can be searched by tokenizing the query terms
and searching for those.

Stemming –
Stemming is the process of reducing a word to its word
stem that affixes to suffixes and prefixes or to the roots of
words known as a lemma.

Advantage of Stemming
• Stemming is a useful "normalization" technique for words
• Stemming is used in information retrieval systems like search engines.
• It is used to determine domain vocabularies in domain analysis.
• Stemming is faster because it chops words

Fun Fact -
• Google search adopted a word stemming in 2003.
Previously a search for “fish” would not have returned
“fishing” or “fishes”.

Lemmatization –
Lemmatization is a text normalization technique used
in Natural Language Processing (NLP). Essentially,
lemmatization is a technique that switches any kind of
a word to its base root mode. (Lemma)

Difference
Stemming is a process that stems or removes last few
characters from a word, often leading to incorrect
meanings and spelling.
Lemmatization considers the context and converts the
word to its meaningful base form, which is called Lemma.

Stemming vs Lemmatization
Stemming
• Stemming is a process that stems
or removes last few characters
from a word, often leading to
incorrect meanings and spelling.
• For instance, stemming the word
‘Caring‘ would return ‘Car‘.
• Stemming is used in case of large
dataset where performance is an
issue.
• It is faster to process
Lemmatization
• Lemmatization considers the
context and converts the word to
its meaningful base form, which is
called Lemma.
• For instance, lemmatizing the word
‘Caring‘ would return ‘Care‘.
• Lemmatization is computationally
expensive since it involves look-up
tables and what not.
• It is slower

Stop Words–
Stop words are a set of commonly used words in a language.
Examples of stop words in English are “a”, “the”, “is”, “are” and
etc. Stop words are commonly used in Text Mining and Natural
Language Processing (NLP) to eliminate words that are so
commonly used that they carry very little useful information.

Sample Text with Stop
Words
Sample Text without
Stop Words
Aarush Coaching Classes – A stem
learning place for kids
Aarush Coaching Classes, Stem,
Learning, Place, kids
Can Listening be exhausting ? Listening, Exhausting
I like Teaching, so I teach Like, Teaching, Teach
Stop Words Example

Modelling Techniques in NLP
Bag of Words
TF-IDF
Word Embeddings
Sentiment Analysis

Bag of Words
A bag-of-words is a representation of text that
describes the occurrence of words within a
document. It involves two things: A vocabulary of
known words. A measure of the presence of
known words.

The Bag-of-words model is an
orderless document representation —
only the counts of words matter. For
instance, in the above example "John
likes to watch movies. Mary likes
movies too", the bag-of-words
representation will not reveal that the
verb "likes" always follows a person's
name in this text.
Bag of Words - Example

TF-IDF
TF -IDF short for term frequency–inverse
document frequency, is a numerical statistic that
is intended to reflect how important a word is to
a document in a collection or corpus.

TF –IDF Explanation
• TF – IDF is multiplication of two values TF and IDF
• TF is the frequency of term divided by a total number of
terms in the document
• IDF is obtained by dividing the total number of
documents by the number of documents containing the
term and then taking the logarithmic of that quotient.

That's it 😃! the text is now ready to feed into a machine learning
algorithm.

Word Embeddings
A word embedding is a learned representation for text
where words that have the same meaning have a similar
representation.

Types
Word Embeddings Types
Word2vec Glove fastText

Sentiment Analysis
Sentiment analysis, also referred to as opinion mining, is an approach to
natural language processing (NLP) that identifies the emotional tone
behind a body of text..
“I really like the new design of your website!” → Positive
“The new design is awful!” → Negative

https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Reference :

• Less costly than employing human staff
• Provides quicker customer service response times
• Easy to implement)
Advantages of NLP

Adieu in NLP Style
https://github.com/lipika-tech
Connect with me :
https://www.youtube.com/c/aarushcoachingclasses

NLP Techniques for Text Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP Techniques for Text Analysis

Similar to NLP Techniques for Text Analysis (20)

Recently uploaded

Recently uploaded (20)

NLP Techniques for Text Analysis