SlideShare a Scribd company logo
1 of 28
Download to read offline
UNIT - III
Revathi A
Assistant Professor
Dept of Computational Intelligence
SRM Institute of Science and Technology,
Kattankulathur
INTRODUCTION TO NLP
• Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives.
PREPROCESSING STEPS IN NLP
• There are many different natural language processing algorithms, but two main types are commonly
used:
• Rule-based system. This system uses carefully designed linguistic rules. This was used early in the
development of NLP and is still used.
• Machine learning-based system. Machine learning algorithms use statistical methods. Using a
combination of machine learning, deep learning and neural networks, natural language processing
algorithms hone their own rules through repeated processing and learning.
TECHNIQUES AND METHODS OF NATURAL LANGUAGE
PROCESSING
• Syntax and semantic analysis are two main techniques used in natural language processing.
• Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess
meaning from a language based on grammatical rules. Syntax NLP techniques include the following:
• Parsing. This is the grammatical analysis of a sentence. Parsing involves breaking this sentence into
parts of speech .
• Word segmentation. This is the act of taking a string of text and deriving word forms from it. For
example, a person scans a handwritten document into a computer. The algorithm can analyze the page
and recognize that the words are divided by white spaces.
• Sentence breaking. This places sentence boundaries in large texts.
• Morphological segmentation. This divides words into smaller parts called morphemesThis is
especially useful in machine translation and speech recognition.
• Stemming. This divides words with inflection in them into root forms
TECHNIQUES AND METHODS OF NATURAL LANGUAGE
PROCESSING
• Word sense disambiguation. This derives the meaning of a word based on context.
• Named entity recognition (NER). NER determines words that can be categorized into groups.
• Natural language generation (NLG). NLG uses a database to determine the semantics behind words
and generate new text.
WHAT IS NATURAL LANGUAGE PROCESSING USED FOR?
• Text classification.
• This function assigns tags to texts to put them in categories.
• Useful for sentiment analysis, which helps the natural language processing algorithm determine the
sentiment, or emotion, behind a text.
• Text extraction.
• This function automatically summarizes text and finds important pieces of data.
• Ex: keyword extraction, which pulls the most important words from the text, which can be useful
for search engine optimization.
• Machine translation.
• In this process, a computer translates text from one language, such as English, to another language,
such as French, without human intervention.
• Natural language generation.
• This process uses NLP to analyze unstructured data and automatically produce content based on
that data. Ex: GPT-3
UMBRELLA OF PROBLEMS
The functions listed above are used in a variety of real-world applications, including the following:
•Customer feedback analysis. Tools using AI can analyze social media reviews and filter out comments
and queries for a company.
•Customer service automation. Voice assistants on a customer service phone line can use speech
recognition to understand what the customer is saying, so that it can direct their call correctly.
•Automatic translation. Tools such as Google Translate, Bing Translator and Translate Me can translate
text, audio and documents into another language.
•Academic research and analysis. Tools using AI can analyze huge amounts of academic material and
research papers based on the metadata of the text as well as the text itself.
•Analysis and categorization of healthcare records. AI-based tools can use insights to predict and,
ideally, prevent disease.
UMBRELLA OF PROBLEMS
•Plagiarism detection. Tools such as Copyleaks and Grammarly use AI technology to scan documents and
detect text matches and plagiarism.
•Stock forecasting and insights into financial trading. NLP tools can analyze market history and annual
reports that contain comprehensive summaries of a company's financial performance.
•Talent recruitment in human resources. Organizations can use AI-based tools to reduce hiring time by
automating the candidate sourcing and screening process.
•Automation of routine litigation. AI-powered tools can do research, identify possible issues and
summarize cases faster than human attorneys.
•Spam detection. NLP-enabled tools can be used to classify text for language that's often used in spam
or phishing attempts. For example, AI-enabled tools can detect bad grammar, misspelled names, urgent
calls to action and threatening terms.
• Text mining software uses natural language processing (NLP) together with rule-based systems and
machine learning to discover hidden relationships, patterns and sentiment in text documents.
• Unstructured text is preprocessed using NLP. This preprocessing can include any of these steps:
Cleaning: Removing small words (a, an, the) and correcting misspellings.
Stemming: Reducing a word to its stem by removing prefixes and suffixes (“hire” is the stem
for both “hiring” and “hired,” for example).
Tokenizing: Dividing text into distinct words and phrases.
Tagging parts of speech: Identifying the parts of speech within text, such as nouns, verbs and
adjectives.
Parsing syntax: Analyzing the structure of sentences and phrases to determine the role of
different words. This identifies the subject, verb and object of a sentence.
TEXT MINING
There are different methods and techniques for text mining. In this section, The most frequent.
Basic Methods are given below
Word frequency : used to identify the most recurrent terms or concepts in a set of data. This is
particularly useful when analyzing customer reviews, social media conversations or customer
feedback.
Ex: words expensive, overpriced and overrated frequently appear on your customer reviews, it
may indicate you need to adjust your prices.
Collocation - Collocation refers to a sequence of words that commonly appear near each other. The
most common types of collocations are bigrams (a pair of words that are likely to go together, like get
started, save time or decision making) and trigrams (a combination of three words, like within
walking distance or keep in touch).
Identifying collocations — and counting them as one single word — improves the granularity of the
text, allows a better understanding of its semantic structure and, in the end, leads to more accurate text
mining results.
Concordance: Concordance is used to recognize the particular context or instance in which a word or
set of words appears. We all know that the human language can be ambiguous: the same word can be
used in many different contexts. Analyzing the concordance of a word can help understand its exact
meaning based on context.
TEXT MINING - METHODS AND TECHNIQUES
TEXT MINING - METHODS AND TECHNIQUES
CLEANING TEXT DATA
Pre-processing and normalizing text
popular pre-processing techniques to pre-process, clean, and normalize the text.
○ Text tokenization and lower casing
○ Removing special characters
○ Contraction expansion
○ Removing stopwords
○ Correcting spellings
○ Stemming
○ Lemmatization
13
PREPROCESSING DATA USING TOKENIZATION
● Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called
tokens.
● These tokens are very useful for finding patterns and are considered as a base step for stemming and
lemmatization.
● Natural Language toolkit has very important module NLTK tokenize sentences which further
comprises of sub-modules
○ word tokenize
○ sentence tokenize
● Depending on the task, we can define our own conditions to divide the input text into meaningful
tokens.
14
TOKENIZATION OF WORDS
● We use the method word.tokenize() to split a sentence into words.
● The output of word tokenization can be converted to Data Frame for better text understanding in
machine learning applications.
● It can also be provided as input for further text cleaning steps such as punctuation removal, numeric
character removal or stemming.
● Machine learning models need numeric data to be trained and make a prediction.
● Word tokenization becomes a crucial part of the text (string) to numeric data conversion.
● from nltk.tokenize import word_tokenize
● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you do
n’t, it hurts even more."
● print(word_tokenize(text))
Output: ['Trying', 'to', 'grow', 'up', 'is', 'hurting', '.', 'You', 'make', 'mistakes', '.', 'You', 'try', 'to',
'learn’,’from', 'them', ',', 'and', 'when', 'you', 'don', '’', 't', ',', 'it', 'hurts', 'even', 'more', '.']
15
TOKENIZATION OF SENTENCES
● Sub-module available for the above is sent_tokenize.
● why sentence tokenization is needed when we have the option of word tokenization. Ex: To count
average words per sentence . This can be accomplished using NLTK sentence tokenizer as well as
NLTK word tokenizer to calculate the ratio.
● Such output serves as an important feature for machine training as the answer would be numeric.
● from nltk.tokenize import sent_tokenize
● print(sent_tokenize(text))
Output: ['Trying to grow up is hurting.', 'You make mistakes.', 'You try to learn from them, and when you don’t, it hurts even
more.']
16
TAGGING AND CATEGORIZING WORDS
• Tagging is the process of classifying words into their parts of speech and labeling them accordingly
known as part –of-speech tagging/ POS tagging.
• The "word classes" such as nouns, verbs, adjectives, and adverbs are not just the idle invention of
grammarians, but are useful categories for many language processing tasks. They arise from simple
analysis of the distribution of words in text.
• part –of-speech are also known as word classes or lexical categories.
• The collection of tags used for a particular task is known as tagset.
• POS tags are used to describe the lexical terms that we have within our text.
17
Methods:
● Rule Based
○ [IF -> THEN
● Stochastic (P=-Based)
Hidden Markov Model
18
DT VERB DT
NOUN NOUN
THE FANS WATCH RACE
THE
PART OF SPEECH TAGGING
Example:
I LIKE HIS WATCH
THE MAN FANS THE FLAME
THE FANS WATCH THE RACE
19
PRO VERB PRO NOUN
DT DT
NOUN VERB NOUN
NOUN NOUN
VERB
DT DT
PART OF SPEECH TAGGING
Why?
● Feature in the text modeling
● Autocomplete
● Words Ambiguity Resolution
20
USING A TAGGER
Processes a sequence of words, and attaches a part of speech tag to each word
● from nltk.tokenize import word_tokenize
● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you don’t, it hu
rts even more."
● print(word_tokenize(text))
● word=word_tokenize(text)
nltk.pos_tag(word)
[('Trying', 'VBG'), ('to', 'TO'), ('grow', 'VB'), ('up', 'RP'), ('is', 'VBZ'), ('hurting', 'VBG'), ('.', '.'), ('You', 'PRP'),
('make', 'VBP'), ('mistakes', 'NNS'), ('.', '.'), ('You', 'PRP'), ('try', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('from', 'IN'),
('them', 'PRP'), (',', ','), ('and', 'CC'), ('when', 'WRB'), ('you', 'PRP'), ('don', 'VBP'), ('’', 'JJ'), ('t', 'NN'), (',', ','), ('it',
'PRP'), ('hurts', 'VBZ'), ('even', 'RB'), ('more', 'RBR'), ('.', '.')]
Text to speech system usually performs tagging
21
USING A TAGGER
Example text with some homonyms:
● text = word_tokenize("They refuse to permit us to obtain the refuse permit")
● nltk.pos_tag(text)
● Output: [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'),
('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')
● Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a
verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need
to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech
systems usually perform POS-tagging.)
22
23
• N-Grams are phrases cut out of a sentence with N consecutive words.
• Unigram takes a sentence and gives us all the words in that.
• A Bigram takes a sentence and gives us sets of two consecutive words in the sentence.
• A Trigram gives sets of three consecutive words in a sentence.
• Let me explain with an example.
• Unigram - [Let] [me] [explain] [with] [an] [example.]
• Bigram [let me] [me explain] [explain with] [with an] [an example]
• Trigram [let me explain] [me explain with] [explain with an] [with an example]
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
• A sentence (W) is a sequence of words (w1, w2, …, wn) and the probability of the same can be
calculated as follows;
P(W) = P(w1, w2, …, wn)
• Also, the probability of an upcoming word can be calculated of a given word sequence;
P(wn | w1, w2, …, wn-1)
• The model that calculates either P(W) or P(w1, w2, …, wn) is called the language model.
How to calculate P(w1, w2, …, wn)?
P(w1, w2, …, wn) is a joint probability.
Let us calculate the joint probability P(A, B) for two events A and B.
The joint probability can be calculated using the conditional probability as follows;
Conditional probability:
• By Bayes Theorem: P(A, B) = P(A) * P(B | A)
• Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
• This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …, wn) as
follows;:
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
• This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …,
wn) as follows;:
• Ex: to calculate the component P(our|the prime minister of), measure its relative frequency count as
follows; This can be read as, "out of the number of times we saw ‘the prime minister of’ in a corpus, how
many times was it followed by the word ‘our’".
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
Ex: probability of the sentence “the prime minister of our country”
Solved Example:
Training corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Vellore do engineering</s>
Test data:
<s> students are from Vellore </s>
As per the Bigram model, the test sentence can be expanded as follows to estimate the bigram probability;
P(<s> students are from Vellore </s>)
= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
To estimate bigram probabilities, we can use the following equation;
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
P(<s> students are from Vellore </s>)
= P(students | <s>) * P(are | students) * P(from | are)
* P(Vellore | from) * P(</s> | Vellore)
= 1/4 * 1/2 * 1/2 * 2/3 * 1/2 = 0.0208
BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
count of word students = 2, count of string students are = 1
count of word are = 2, count of string are from = 1
count of word from = 3, count of string from Vellore= 2
count of word Vellore = 2, count of string Vellore </s> = 1

More Related Content

Similar to INTRODUCTION TO Natural language processing

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingBhavya Chawla
 
Sentiment Analysis using Machine Learning.pdf
Sentiment Analysis using Machine Learning.pdfSentiment Analysis using Machine Learning.pdf
Sentiment Analysis using Machine Learning.pdfOmSatpathy
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment AnalysisRebecca Williams
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdfAnime196637
 
Natural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfNatural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfMoar Digital 360
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
AI_attachment.pptx prepared for all students
AI_attachment.pptx prepared for all  studentsAI_attachment.pptx prepared for all  students
AI_attachment.pptx prepared for all studentstalldesalegn
 
Fast and accurate sentiment classification us and naive bayes model b516001
Fast and accurate sentiment classification  us and naive bayes model b516001Fast and accurate sentiment classification  us and naive bayes model b516001
Fast and accurate sentiment classification us and naive bayes model b516001Abhisek Sahoo
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位eLearning Consortium 電子學習聯盟
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewBenjaminlapid1
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxAtulKumarUpadhyay4
 

Similar to INTRODUCTION TO Natural language processing (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Sentiment Analysis using Machine Learning.pdf
Sentiment Analysis using Machine Learning.pdfSentiment Analysis using Machine Learning.pdf
Sentiment Analysis using Machine Learning.pdf
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdf
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Natural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdfNatural Language Processing (NLP).pdf
Natural Language Processing (NLP).pdf
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
AI_attachment.pptx prepared for all students
AI_attachment.pptx prepared for all  studentsAI_attachment.pptx prepared for all  students
AI_attachment.pptx prepared for all students
 
Fast and accurate sentiment classification us and naive bayes model b516001
Fast and accurate sentiment classification  us and naive bayes model b516001Fast and accurate sentiment classification  us and naive bayes model b516001
Fast and accurate sentiment classification us and naive bayes model b516001
 
P-1.1.9.ppt
P-1.1.9.pptP-1.1.9.ppt
P-1.1.9.ppt
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
Language Modeling.docx
Language Modeling.docxLanguage Modeling.docx
Language Modeling.docx
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overview
 
Mining Opinion Features in Customer Reviews
Mining Opinion Features in Customer ReviewsMining Opinion Features in Customer Reviews
Mining Opinion Features in Customer Reviews
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 

INTRODUCTION TO Natural language processing

  • 1. UNIT - III Revathi A Assistant Professor Dept of Computational Intelligence SRM Institute of Science and Technology, Kattankulathur
  • 2. INTRODUCTION TO NLP • Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. •Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers • We have large volumes of voice and text data from various communication channels like emails, text messages, social media newsfeeds, video, audio, and more. • They use NLP software to automatically process this data, analyze the intent or sentiment in the message, and respond in real time to human communication • When text mining and machine learning are combined, automated text analysis becomes possible
  • 3. PREPROCESSING STEPS IN NLP • Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This can be done in following: • Tokenization. It substitutes sensitive information with nonsensitive information, or a token. Tokenization is often used in payment transactions to protect credit card data. • Stop word removal. Common words are removed from the text, so unique words that offer the most information about the text remain. • Lemmatization and stemming. Lemmatization groups together different inflected versions of the same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to process. • Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such as nouns, verbs or adjectives.
  • 4. PREPROCESSING STEPS IN NLP • There are many different natural language processing algorithms, but two main types are commonly used: • Rule-based system. This system uses carefully designed linguistic rules. This was used early in the development of NLP and is still used. • Machine learning-based system. Machine learning algorithms use statistical methods. Using a combination of machine learning, deep learning and neural networks, natural language processing algorithms hone their own rules through repeated processing and learning.
  • 5. TECHNIQUES AND METHODS OF NATURAL LANGUAGE PROCESSING • Syntax and semantic analysis are two main techniques used in natural language processing. • Syntax is the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess meaning from a language based on grammatical rules. Syntax NLP techniques include the following: • Parsing. This is the grammatical analysis of a sentence. Parsing involves breaking this sentence into parts of speech . • Word segmentation. This is the act of taking a string of text and deriving word forms from it. For example, a person scans a handwritten document into a computer. The algorithm can analyze the page and recognize that the words are divided by white spaces. • Sentence breaking. This places sentence boundaries in large texts. • Morphological segmentation. This divides words into smaller parts called morphemesThis is especially useful in machine translation and speech recognition. • Stemming. This divides words with inflection in them into root forms
  • 6. TECHNIQUES AND METHODS OF NATURAL LANGUAGE PROCESSING • Word sense disambiguation. This derives the meaning of a word based on context. • Named entity recognition (NER). NER determines words that can be categorized into groups. • Natural language generation (NLG). NLG uses a database to determine the semantics behind words and generate new text.
  • 7. WHAT IS NATURAL LANGUAGE PROCESSING USED FOR? • Text classification. • This function assigns tags to texts to put them in categories. • Useful for sentiment analysis, which helps the natural language processing algorithm determine the sentiment, or emotion, behind a text. • Text extraction. • This function automatically summarizes text and finds important pieces of data. • Ex: keyword extraction, which pulls the most important words from the text, which can be useful for search engine optimization. • Machine translation. • In this process, a computer translates text from one language, such as English, to another language, such as French, without human intervention. • Natural language generation. • This process uses NLP to analyze unstructured data and automatically produce content based on that data. Ex: GPT-3
  • 8. UMBRELLA OF PROBLEMS The functions listed above are used in a variety of real-world applications, including the following: •Customer feedback analysis. Tools using AI can analyze social media reviews and filter out comments and queries for a company. •Customer service automation. Voice assistants on a customer service phone line can use speech recognition to understand what the customer is saying, so that it can direct their call correctly. •Automatic translation. Tools such as Google Translate, Bing Translator and Translate Me can translate text, audio and documents into another language. •Academic research and analysis. Tools using AI can analyze huge amounts of academic material and research papers based on the metadata of the text as well as the text itself. •Analysis and categorization of healthcare records. AI-based tools can use insights to predict and, ideally, prevent disease.
  • 9. UMBRELLA OF PROBLEMS •Plagiarism detection. Tools such as Copyleaks and Grammarly use AI technology to scan documents and detect text matches and plagiarism. •Stock forecasting and insights into financial trading. NLP tools can analyze market history and annual reports that contain comprehensive summaries of a company's financial performance. •Talent recruitment in human resources. Organizations can use AI-based tools to reduce hiring time by automating the candidate sourcing and screening process. •Automation of routine litigation. AI-powered tools can do research, identify possible issues and summarize cases faster than human attorneys. •Spam detection. NLP-enabled tools can be used to classify text for language that's often used in spam or phishing attempts. For example, AI-enabled tools can detect bad grammar, misspelled names, urgent calls to action and threatening terms.
  • 10. • Text mining software uses natural language processing (NLP) together with rule-based systems and machine learning to discover hidden relationships, patterns and sentiment in text documents. • Unstructured text is preprocessed using NLP. This preprocessing can include any of these steps: Cleaning: Removing small words (a, an, the) and correcting misspellings. Stemming: Reducing a word to its stem by removing prefixes and suffixes (“hire” is the stem for both “hiring” and “hired,” for example). Tokenizing: Dividing text into distinct words and phrases. Tagging parts of speech: Identifying the parts of speech within text, such as nouns, verbs and adjectives. Parsing syntax: Analyzing the structure of sentences and phrases to determine the role of different words. This identifies the subject, verb and object of a sentence. TEXT MINING
  • 11. There are different methods and techniques for text mining. In this section, The most frequent. Basic Methods are given below Word frequency : used to identify the most recurrent terms or concepts in a set of data. This is particularly useful when analyzing customer reviews, social media conversations or customer feedback. Ex: words expensive, overpriced and overrated frequently appear on your customer reviews, it may indicate you need to adjust your prices. Collocation - Collocation refers to a sequence of words that commonly appear near each other. The most common types of collocations are bigrams (a pair of words that are likely to go together, like get started, save time or decision making) and trigrams (a combination of three words, like within walking distance or keep in touch). Identifying collocations — and counting them as one single word — improves the granularity of the text, allows a better understanding of its semantic structure and, in the end, leads to more accurate text mining results. Concordance: Concordance is used to recognize the particular context or instance in which a word or set of words appears. We all know that the human language can be ambiguous: the same word can be used in many different contexts. Analyzing the concordance of a word can help understand its exact meaning based on context. TEXT MINING - METHODS AND TECHNIQUES
  • 12. TEXT MINING - METHODS AND TECHNIQUES
  • 13. CLEANING TEXT DATA Pre-processing and normalizing text popular pre-processing techniques to pre-process, clean, and normalize the text. ○ Text tokenization and lower casing ○ Removing special characters ○ Contraction expansion ○ Removing stopwords ○ Correcting spellings ○ Stemming ○ Lemmatization 13
  • 14. PREPROCESSING DATA USING TOKENIZATION ● Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens. ● These tokens are very useful for finding patterns and are considered as a base step for stemming and lemmatization. ● Natural Language toolkit has very important module NLTK tokenize sentences which further comprises of sub-modules ○ word tokenize ○ sentence tokenize ● Depending on the task, we can define our own conditions to divide the input text into meaningful tokens. 14
  • 15. TOKENIZATION OF WORDS ● We use the method word.tokenize() to split a sentence into words. ● The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications. ● It can also be provided as input for further text cleaning steps such as punctuation removal, numeric character removal or stemming. ● Machine learning models need numeric data to be trained and make a prediction. ● Word tokenization becomes a crucial part of the text (string) to numeric data conversion. ● from nltk.tokenize import word_tokenize ● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you do n’t, it hurts even more." ● print(word_tokenize(text)) Output: ['Trying', 'to', 'grow', 'up', 'is', 'hurting', '.', 'You', 'make', 'mistakes', '.', 'You', 'try', 'to', 'learn’,’from', 'them', ',', 'and', 'when', 'you', 'don', '’', 't', ',', 'it', 'hurts', 'even', 'more', '.'] 15
  • 16. TOKENIZATION OF SENTENCES ● Sub-module available for the above is sent_tokenize. ● why sentence tokenization is needed when we have the option of word tokenization. Ex: To count average words per sentence . This can be accomplished using NLTK sentence tokenizer as well as NLTK word tokenizer to calculate the ratio. ● Such output serves as an important feature for machine training as the answer would be numeric. ● from nltk.tokenize import sent_tokenize ● print(sent_tokenize(text)) Output: ['Trying to grow up is hurting.', 'You make mistakes.', 'You try to learn from them, and when you don’t, it hurts even more.'] 16
  • 17. TAGGING AND CATEGORIZING WORDS • Tagging is the process of classifying words into their parts of speech and labeling them accordingly known as part –of-speech tagging/ POS tagging. • The "word classes" such as nouns, verbs, adjectives, and adverbs are not just the idle invention of grammarians, but are useful categories for many language processing tasks. They arise from simple analysis of the distribution of words in text. • part –of-speech are also known as word classes or lexical categories. • The collection of tags used for a particular task is known as tagset. • POS tags are used to describe the lexical terms that we have within our text. 17
  • 18. Methods: ● Rule Based ○ [IF -> THEN ● Stochastic (P=-Based) Hidden Markov Model 18 DT VERB DT NOUN NOUN THE FANS WATCH RACE THE
  • 19. PART OF SPEECH TAGGING Example: I LIKE HIS WATCH THE MAN FANS THE FLAME THE FANS WATCH THE RACE 19 PRO VERB PRO NOUN DT DT NOUN VERB NOUN NOUN NOUN VERB DT DT
  • 20. PART OF SPEECH TAGGING Why? ● Feature in the text modeling ● Autocomplete ● Words Ambiguity Resolution 20
  • 21. USING A TAGGER Processes a sequence of words, and attaches a part of speech tag to each word ● from nltk.tokenize import word_tokenize ● text = "Trying to grow up is hurting. You make mistakes. You try to learn from them, and when you don’t, it hu rts even more." ● print(word_tokenize(text)) ● word=word_tokenize(text) nltk.pos_tag(word) [('Trying', 'VBG'), ('to', 'TO'), ('grow', 'VB'), ('up', 'RP'), ('is', 'VBZ'), ('hurting', 'VBG'), ('.', '.'), ('You', 'PRP'), ('make', 'VBP'), ('mistakes', 'NNS'), ('.', '.'), ('You', 'PRP'), ('try', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('from', 'IN'), ('them', 'PRP'), (',', ','), ('and', 'CC'), ('when', 'WRB'), ('you', 'PRP'), ('don', 'VBP'), ('’', 'JJ'), ('t', 'NN'), (',', ','), ('it', 'PRP'), ('hurts', 'VBZ'), ('even', 'RB'), ('more', 'RBR'), ('.', '.')] Text to speech system usually performs tagging 21
  • 22. USING A TAGGER Example text with some homonyms: ● text = word_tokenize("They refuse to permit us to obtain the refuse permit") ● nltk.pos_tag(text) ● Output: [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN') ● Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.) 22
  • 23. 23
  • 24. • N-Grams are phrases cut out of a sentence with N consecutive words. • Unigram takes a sentence and gives us all the words in that. • A Bigram takes a sentence and gives us sets of two consecutive words in the sentence. • A Trigram gives sets of three consecutive words in a sentence. • Let me explain with an example. • Unigram - [Let] [me] [explain] [with] [an] [example.] • Bigram [let me] [me explain] [explain with] [with an] [an example] • Trigram [let me explain] [me explain with] [explain with an] [with an example] BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
  • 25. • A sentence (W) is a sequence of words (w1, w2, …, wn) and the probability of the same can be calculated as follows; P(W) = P(w1, w2, …, wn) • Also, the probability of an upcoming word can be calculated of a given word sequence; P(wn | w1, w2, …, wn-1) • The model that calculates either P(W) or P(w1, w2, …, wn) is called the language model. How to calculate P(w1, w2, …, wn)? P(w1, w2, …, wn) is a joint probability. Let us calculate the joint probability P(A, B) for two events A and B. The joint probability can be calculated using the conditional probability as follows; Conditional probability: • By Bayes Theorem: P(A, B) = P(A) * P(B | A) • Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C) • This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …, wn) as follows;: BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
  • 26. Chain rule of probability: P(A, B, C, D) = P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C) • This can be generalized and used to calculate the joint probability of our word sequence P(w1, w2, …, wn) as follows;: • Ex: to calculate the component P(our|the prime minister of), measure its relative frequency count as follows; This can be read as, "out of the number of times we saw ‘the prime minister of’ in a corpus, how many times was it followed by the word ‘our’". BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP Ex: probability of the sentence “the prime minister of our country”
  • 27. Solved Example: Training corpus: <s> I am from Vellore </s> <s> I am a teacher </s> <s> students are good and are from various cities</s> <s> students from Vellore do engineering</s> Test data: <s> students are from Vellore </s> As per the Bigram model, the test sentence can be expanded as follows to estimate the bigram probability; P(<s> students are from Vellore </s>) = P(students | <s>) * P(are | students) * P(from | are) * P(Vellore | from) * P(</s> | Vellore) To estimate bigram probabilities, we can use the following equation; BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP
  • 28. P(<s> students are from Vellore </s>) = P(students | <s>) * P(are | students) * P(from | are) * P(Vellore | from) * P(</s> | Vellore) = 1/4 * 1/2 * 1/2 * 2/3 * 1/2 = 0.0208 BIGRAM, TRIGRAM, AND NGRAM MODELS IN NLP count of word students = 2, count of string students are = 1 count of word are = 2, count of string are from = 1 count of word from = 3, count of string from Vellore= 2 count of word Vellore = 2, count of string Vellore </s> = 1