2. Concept of NLP:-
● Computer can’t yet truly understand English in the way that
humans do– but thanks to AI and NLP, they are learning it
fast, try to reach the meaning of sentence and respond
accordingly.
● AI technologies one thing in common
○ They breakup the problem into very small pieces to simplify
○ Reduce the complexity by removing extra information
○ Use AI to solve each smaller piece separately
○ Tie together the processed result
○ Finally convert the processed result to number so that the
computer can understand it
3. Corpus:-
● A corpus is a large and structured set of machine-readable
texts that have been produced in a natural communicative
setting.
OR
● A corpus can be defined as a collection of text documents. It
can be thought of as just a bunch of text files in a directory,
often alongside many other directories of text files
4. 1. Text Normalization :-
It comes under data processing.
It is a process to reduce the variation in text’s word forms to a
common form when the variations means the same thing.
The text normalization divides the text into smaller components
called tokens( usually the words in the text) and group related
tokens together.
5. 2. Sentence Segmentation:-
Dividing the whole text (corpus) into individual sentences.
Before Sentence Segmentation After Sentence Segmentation
“You want to see the dreams with close
eyes and achieve them? They’ll remain
dreams, look for AIMs and your eyes
have to stay open for a change to be
seen.”
1. You want to see the dreams with
close eyes and achieve them?
2. They’ll remain dreams, look for
AIMs and your eyes have to stay
open for a change to be seen.
6. 3. Tokenization:-
It is the process of splitting up of individual sentence into smaller
units called token(a word,a phrase,a number,a symbol).
TOKEN:- A Token is a well defined semantic unit inside a sentence
and contributes to the overall meaning of the sentence.A Token
may represent a word,a phrase,a number or a symbol.
Zain walked down four blocks to pick
up ice cream.
Tokenization
Zain walked down four blocks to pick up ice cream .
Proper Noun Verb Adv Num Noun Part verb adp noun noun Punctuation
7. 4. Removal Of Stop words , Special Characters and Numbers:-
In this step, the tokens which are not necessary are removed from the
token list.
To make it easier for the computer to focus on meaningful terms,
these words are removed..
Stopwords: Words in any language which do not add much meaning to a sentence.They
can safely be ignored without sacrificing the meaning of the sentence.
Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.
1. You want to see the dreams with close eyes and achieve them?
● the removed words would be
● to, the, and, ?
2. The outcome would be:
● You want see dreams with close eyes achieve them.
8. Converting text to a common case:-
we e convert the whole text into a similar case, preferably lower case. This ensures that the case
sensitivity of the machine does not consider the same words as different just because of different
cases.
9. Stemming:-
● The process of extracting the root from of the word by removing affixes, is
known as stemming.
● The words extracted through stemming are called stem.
Words Affixes Stem
healing ing heal
dreams s dream
studies es studi
10. Lemmatization:-
Definition: In lemmatization, the word we get after affix removal (also known as lemma) is a meaningful
one and it takes a longer time to execute than stemming.
Lemma:-is the base ,root from
Words Affixes Stem
healing ing heal
dreams s dream
studies es study
Difference between stemming and lemmatization
Stemming lemmatization
1. The stemmed words might not be meaningful. 1. The lemma word is a meaningful one.
Caring ➔ Car Caring ➔ Care