Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives
1. UNIT - III
Revathi A
Assistant Professor
Dept of Computational Intelligence
SRM Institute of Science and Technology,
Kattankulathur
2. INTRODUCTION TO NLP
• Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
3. PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives.
4. PREPROCESSING STEPS IN NLP
• There are many different natural language processing algorithms, but two main types are commonly
used:
• Rule-based system. This system uses carefully designed linguistic rules. This was used early in the
development of NLP and is still used.
• Machine learning-based system. Machine learning algorithms use statistical methods. Using a
combination of machine learning, deep learning and neural networks, natural language processing
algorithms hone their own rules through repeated processing and learning.
5. TECHNIQUES AND METHODS OF NATURAL LANGUAGE
PROCESSING
• Syntax and semantic analysis are two main techniques used in natural language processing.
• Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess
meaning from a language based on grammatical rules. Syntax NLP techniques include the following:
• Parsing. This is the grammatical analysis of a sentence. Parsing involves breaking this sentence into
parts of speech .
• Word segmentation. This is the act of taking a string of text and deriving word forms from it. For
example, a person scans a handwritten document into a computer. The algorithm can analyze the page
and recognize that the words are divided by white spaces.
• Sentence breaking. This places sentence boundaries in large texts.
• Morphological segmentation. This divides words into smaller parts called morphemesThis is
especially useful in machine translation and speech recognition.
• Stemming. This divides words with inflection in them into root forms
6. TECHNIQUES AND METHODS OF NATURAL LANGUAGE
PROCESSING
• Word sense disambiguation. This derives the meaning of a word based on context.
• Named entity recognition (NER). NER determines words that can be categorized into groups.
• Natural language generation (NLG). NLG uses a database to determine the semantics behind words
and generate new text.
7. WHAT IS NATURAL LANGUAGE PROCESSING USED FOR?
• Text classification.
• This function assigns tags to texts to put them in categories.
• Useful for sentiment analysis, which helps the natural language processing algorithm determine the
sentiment, or emotion, behind a text.
• Text extraction.
• This function automatically summarizes text and finds important pieces of data.
• Ex: keyword extraction, which pulls the most important words from the text, which can be useful
for search engine optimization.
• Machine translation.
• In this process, a computer translates text from one language, such as English, to another language,
such as French, without human intervention.
• Natural language generation.
• This process uses NLP to analyze unstructured data and automatically produce content based on
that data. Ex: GPT-3
8. UMBRELLA OF PROBLEMS
The functions listed above are used in a variety of real-world applications, including the following:
•Customer feedback analysis. Tools using AI can analyze social media reviews and filter out comments
and queries for a company.
•Customer service automation. Voice assistants on a customer service phone line can use speech
recognition to understand what the customer is saying, so that it can direct their call correctly.
•Automatic translation. Tools such as Google Translate, Bing Translator and Translate Me can translate
text, audio and documents into another language.
•Academic research and analysis. Tools using AI can analyze huge amounts of academic material and
research papers based on the metadata of the text as well as the text itself.
•Analysis and categorization of healthcare records. AI-based tools can use insights to predict and,
ideally, prevent disease.
9. UMBRELLA OF PROBLEMS
•Plagiarism detection. Tools such as Copyleaks and Grammarly use AI technology to scan documents and
detect text matches and plagiarism.
•Stock forecasting and insights into financial trading. NLP tools can analyze market history and annual
reports that contain comprehensive summaries of a company's financial performance.
•Talent recruitment in human resources. Organizations can use AI-based tools to reduce hiring time by
automating the candidate sourcing and screening process.
•Automation of routine litigation. AI-powered tools can do research, identify possible issues and
summarize cases faster than human attorneys.
•Spam detection. NLP-enabled tools can be used to classify text for language that's often used in spam
or phishing attempts. For example, AI-enabled tools can detect bad grammar, misspelled names, urgent
calls to action and threatening terms.
10. • Text mining software uses natural language processing (NLP) together with rule-based systems and
machine learning to discover hidden relationships, patterns and sentiment in text documents.
• Unstructured text is preprocessed using NLP. This preprocessing can include any of these steps:
Cleaning: Removing small words (a, an, the) and correcting misspellings.
Stemming: Reducing a word to its stem by removing prefixes and suffixes (“hire” is the stem
for both “hiring” and “hired,” for example).
Tokenizing: Dividing text into distinct words and phrases.
Tagging parts of speech: Identifying the parts of speech within text, such as nouns, verbs and
adjectives.
Parsing syntax: Analyzing the structure of sentences and phrases to determine the role of
different words. This identifies the subject, verb and object of a sentence.
TEXT MINING
11. There are different methods and techniques for text mining. In this section, The most frequent.
Basic Methods are given below
Word frequency : used to identify the most recurrent terms or concepts in a set of data. This is
particularly useful when analyzing customer reviews, social media conversations or customer
feedback.
Ex: words expensive, overpriced and overrated frequently appear on your customer reviews, it
may indicate you need to adjust your prices.
Collocation - Collocation refers to a sequence of words that commonly appear near each other. The
most common types of collocations are bigrams (a pair of words that are likely to go together, like get
started, save time or decision making) and trigrams (a combination of three words, like within
walking distance or keep in touch).
Identifying collocations — and counting them as one single word — improves the granularity of the
text, allows a better understanding of its semantic structure and, in the end, leads to more accurate text
mining results.
Concordance: Concordance is used to recognize the particular context or instance in which a word or
set of words appears. We all know that the human language can be ambiguous: the same word can be
used in many different contexts. Analyzing the concordance of a word can help understand its exact
meaning based on context.
TEXT MINING - METHODS AND TECHNIQUES
13. CLEANING TEXT DATA
Pre-processing and normalizing text
popular pre-processing techniques to pre-process, clean, and normalize the text.
○ Text tokenization and lower casing
○ Removing special characters
○ Contraction expansion
○ Removing stopwords
○ Correcting spellings
○ Stemming
○ Lemmatization
13
14. PREPROCESSING DATA USING TOKENIZATION
● Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called
tokens.
● These tokens are very useful for finding patterns and are considered as a base step for stemming and
lemmatization.
● Natural Language toolkit has very important module NLTK tokenize sentences which further
comprises of sub-modules
○ word tokenize
○ sentence tokenize
● Depending on the task, we can define our own conditions to divide the input text into meaningful
tokens.
14
15. TOKENIZATION OF WORDS
● We use the method word.tokenize() to split a sentence into words.
● The output of word tokenization can be converted to Data Frame for better text understanding in
machine learning applications.
● It can also be provided as input for further text cleaning steps such as punctuation removal, numeric
character removal or stemming.
● Machine learning models need numeric data to be trained and make a prediction.
● Word tokenization becomes a crucial part of the text (string) to numeric data conversion.
● from nltk.tokenize import word_tokenize
● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you do
n’t, it hurts even more."
● print(word_tokenize(text))
Output: ['Trying', 'to', 'grow', 'up', 'is', 'hurting', '.', 'You', 'make', 'mistakes', '.', 'You', 'try', 'to',
'learn’,’from', 'them', ',', 'and', 'when', 'you', 'don', '’', 't', ',', 'it', 'hurts', 'even', 'more', '.']
15
16. TOKENIZATION OF SENTENCES
● Sub-module available for the above is sent_tokenize.
● why sentence tokenization is needed when we have the option of word tokenization. Ex: To count
average words per sentence . This can be accomplished using NLTK sentence tokenizer as well as
NLTK word tokenizer to calculate the ratio.
● Such output serves as an important feature for machine training as the answer would be numeric.
● from nltk.tokenize import sent_tokenize
● print(sent_tokenize(text))
Output: ['Trying to grow up is hurting.', 'You make mistakes.', 'You try to learn from them, and when you don’t, it hurts even
more.']
16
17. TAGGING AND CATEGORIZING WORDS
• Tagging is the process of classifying words into their parts of speech and labeling them accordingly
known as part –of-speech tagging/ POS tagging.
• The "word classes" such as nouns, verbs, adjectives, and adverbs are not just the idle invention of
grammarians, but are useful categories for many language processing tasks. They arise from simple
analysis of the distribution of words in text.
• part –of-speech are also known as word classes or lexical categories.
• The collection of tags used for a particular task is known as tagset.
• POS tags are used to describe the lexical terms that we have within our text.
17
18. Methods:
● Rule Based
○ [IF -> THEN
● Stochastic (P=-Based)
Hidden Markov Model
18
DT VERB DT
NOUN NOUN
THE FANS WATCH RACE
THE
19. PART OF SPEECH TAGGING
Example:
I LIKE HIS WATCH
THE MAN FANS THE FLAME
THE FANS WATCH THE RACE
19
PRO VERB PRO NOUN
DT DT
NOUN VERB NOUN
NOUN NOUN
VERB
DT DT
20. PART OF SPEECH TAGGING
Why?
● Feature in the text modeling
● Autocomplete
● Words Ambiguity Resolution
20
21. USING A TAGGER
Processes a sequence of words, and attaches a part of speech tag to each word
● from nltk.tokenize import word_tokenize
● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you don’t, it hu
rts even more."
● print(word_tokenize(text))
● word=word_tokenize(text)
nltk.pos_tag(word)
[('Trying', 'VBG'), ('to', 'TO'), ('grow', 'VB'), ('up', 'RP'), ('is', 'VBZ'), ('hurting', 'VBG'), ('.', '.'), ('You', 'PRP'),
('make', 'VBP'), ('mistakes', 'NNS'), ('.', '.'), ('You', 'PRP'), ('try', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('from', 'IN'),
('them', 'PRP'), (',', ','), ('and', 'CC'), ('when', 'WRB'), ('you', 'PRP'), ('don', 'VBP'), ('’', 'JJ'), ('t', 'NN'), (',', ','), ('it',
'PRP'), ('hurts', 'VBZ'), ('even', 'RB'), ('more', 'RBR'), ('.', '.')]
Text to speech system usually performs tagging
21
22. USING A TAGGER
Example text with some homonyms:
● text = word_tokenize("They refuse to permit us to obtain the refuse permit")
● nltk.pos_tag(text)
● Output: [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'),
('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')
● Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a
verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need
to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech
systems usually perform POS-tagging.)
22
24. • N-Grams are phrases cut out of a sentence with N consecutive words.
• Unigram takes a sentence and gives us all the words in that.
• A Bigram takes a sentence and gives us sets of two consecutive words in the sentence.
• A Trigram gives sets of three consecutive words in a sentence.
• Let me explain with an example.
• Unigram - [Let] [me] [explain] [with] [an] [example.]
• Bigram [let me] [me explain] [explain with] [with an] [an example]
• Trigram [let me explain] [me explain with] [explain with an] [with an example]
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
25. • A sentence (W) is a sequence of words (w1, w2, …, wn) and the probability of the same can be
calculated as follows;
P(W) = P(w1, w2, …, wn)
• Also, the probability of an upcoming word can be calculated of a given word sequence;
P(wn | w1, w2, …, wn-1)
• The model that calculates either P(W) or P(w1, w2, …, wn) is called the language model.
How to calculate P(w1, w2, …, wn)?
P(w1, w2, …, wn) is a joint probability.
Let us calculate the joint probability P(A, B) for two events A and B.
The joint probability can be calculated using the conditional probability as follows;
Conditional probability:
• By Bayes Theorem: P(A, B) = P(A) * P(B | A)
• Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
• This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …, wn) as
follows;:
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
26. Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
• This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …,
wn) as follows;:
• Ex: to calculate the component P(our|the prime minister of), measure its relative frequency count as
follows; This can be read as, "out of the number of times we saw ‘the prime minister of’ in a corpus, how
many times was it followed by the word ‘our’".
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
Ex: probability of the sentence “the prime minister of our country”
27. Solved Example:
Training corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Vellore do engineering</s>
Test data:
<s> students are from Vellore </s>
As per the Bigram model, the test sentence can be expanded as follows to estimate the bigram probability;
P(<s> students are from Vellore </s>)
= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
To estimate bigram probabilities, we can use the following equation;
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
28. P(<s> students are from Vellore </s>)
= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
= 1/4 * 1/2 * 1/2 * 2/3 * 1/2 = 0.0208
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
count of word students = 2, count of string students are = 1
count of word are = 2, count of string are from = 1
count of word from = 3, count of string from Vellore= 2
count of word Vellore = 2, count of string Vellore </s> = 1