3. Chatbots
Do you remember the chatbot you interacted with in last ?
https://www.pandorabots.com/mitsuku/
chatterbot-corpus/chatterbot_corpus/data/english at master · gunthercox/chatterbot-corpus ·
GitHub
4. What is NLP?
Natural language processing (NLP) is an integral part of AI, Computer Science,
and Linguistics. NLP is all about making computers/machines as intelligent as
human beings in the understanding of natural-communication language like text,
speech, and so on. It comprises 2 major functionalities. they are Human to machine
translation and Machine to Human translation.
5. Applications of NLP
•Email filters. Email filters are one of the most basic and
initial applications of NLP online. ...
•Smart assistants. ...
•Search results. ...
•Predictive text. ...
•Language translation. ...
•Digital phone calls. ...
•Data analysis. ...
•Text analytics.
7. Tool Used - Python
Python is a high-level, interpreted, general-purpose
programming language.
Its design philosophy emphasizes code readability with the use
of significant indentation.
9. Art to read the data
Data preprocessing is a data mining
technique which is used to transform
the raw data in a useful and efficient
format..
Demo -
10. Tokenization –
Tokenization is a process by which sensitive data elements such
as PANs, Personally Identifiable Information elements, etc. are
replaced by surrogate values, or tokens. Tokenization (or
“masking”, or “obfuscation”) means some form of format-
preserving data protection: converting sensitive values into non-
sensitive, replacement values – tokens – the same length and
format of the original data.
11. •Tokens share some characteristics with the original data elements, such
as format, length, etc
•Each data element is mapped to a unique token.
•Tokens are deterministic: repeatedly generating a token for a given
value yields the same token.
•A tokenized database can be searched by tokenizing the query terms
and searching for those.
13. Stemming –
Stemming is the process of reducing a word to its word
stem that affixes to suffixes and prefixes or to the roots of
words known as a lemma.
14. Advantage of Stemming
• Stemming is a useful "normalization" technique for words
• Stemming is used in information retrieval systems like search engines.
• It is used to determine domain vocabularies in domain analysis.
• Stemming is faster because it chops words
15. Fun Fact -
• Google search adopted a word stemming in 2003.
Previously a search for “fish” would not have returned
“fishing” or “fishes”.
17. Lemmatization –
Lemmatization is a text normalization technique used
in Natural Language Processing (NLP). Essentially,
lemmatization is a technique that switches any kind of
a word to its base root mode. (Lemma)
18. Difference
Stemming is a process that stems or removes last few
characters from a word, often leading to incorrect
meanings and spelling.
Lemmatization considers the context and converts the
word to its meaningful base form, which is called Lemma.
19. Stemming vs Lemmatization
Stemming
• Stemming is a process that stems
or removes last few characters
from a word, often leading to
incorrect meanings and spelling.
• For instance, stemming the word
‘Caring‘ would return ‘Car‘.
• Stemming is used in case of large
dataset where performance is an
issue.
• It is faster to process
Lemmatization
• Lemmatization considers the
context and converts the word to
its meaningful base form, which is
called Lemma.
• For instance, lemmatizing the word
‘Caring‘ would return ‘Care‘.
• Lemmatization is computationally
expensive since it involves look-up
tables and what not.
• It is slower
21. Stop Words–
Stop words are a set of commonly used words in a language.
Examples of stop words in English are “a”, “the”, “is”, “are” and
etc. Stop words are commonly used in Text Mining and Natural
Language Processing (NLP) to eliminate words that are so
commonly used that they carry very little useful information.
22. Sample Text with Stop
Words
Sample Text without
Stop Words
Aarush Coaching Classes – A stem
learning place for kids
Aarush Coaching Classes, Stem,
Learning, Place, kids
Can Listening be exhausting ? Listening, Exhausting
I like Teaching, so I teach Like, Teaching, Teach
Stop Words Example
25. Bag of Words
A bag-of-words is a representation of text that
describes the occurrence of words within a
document. It involves two things: A vocabulary of
known words. A measure of the presence of
known words.
26. The Bag-of-words model is an
orderless document representation —
only the counts of words matter. For
instance, in the above example "John
likes to watch movies. Mary likes
movies too", the bag-of-words
representation will not reveal that the
verb "likes" always follows a person's
name in this text.
Bag of Words - Example
27. TF-IDF
TF -IDF short for term frequency–inverse
document frequency, is a numerical statistic that
is intended to reflect how important a word is to
a document in a collection or corpus.
28. TF –IDF Explanation
• TF – IDF is multiplication of two values TF and IDF
• TF is the frequency of term divided by a total number of
terms in the document
• IDF is obtained by dividing the total number of
documents by the number of documents containing the
term and then taking the logarithmic of that quotient.
37. Sentiment Analysis
Sentiment analysis, also referred to as opinion mining, is an approach to
natural language processing (NLP) that identifies the emotional tone
behind a body of text..
“I really like the new design of your website!” → Positive
“The new design is awful!” → Negative