Weekairtificial intelligence 8-Module 7 NLP.pptx

Module 7
Natural Language
Processing

Natural Language Processing (NLP)
NLP is a field of computer
science and artificial
intelligence concerned with
enabling computers to
understand and
manipulate human
language.
It bridges the gap between
human communication and
machine code, allowing
computers to process
information in the way we
naturally use language.

Applications of NLP
NLP has a vast range of applications that are woven into our daily lives:
Machine Translation: Breaking down language barriers by translating text or speech from
one language to another [e.g., Google Translate].
Smart Assistants: Responding to voice commands and questions in a natural way [e.g.,
Siri, Alexa, Google Assistant].
Chatbots: Providing customer service or information through automated chat
conversations.

Cont...
• Sentiment Analysis: Extracting opinions and emotions from text data [e.g.,
social media monitoring].
• Text Summarization: Condensing large amounts of text into key points.
• Autocorrect and Predictive Text: Suggesting corrections and completions as
you type.
• Spam Filtering: Identifying and blocking unwanted emails.
• Search Engines: Ranking search results based on relevance to your query.

Challenges in Processing Human Language
Human language is complex and nuanced, which presents several
challenges for NLP:
Ambiguity: Words can have multiple meanings depending on
context (e.g., "bat" can refer to a flying mammal or a sports
equipment).
Sarcasm and Irony: Computers struggle to understand the subtle
cues that convey these forms of expression.

Cont...
• Slang and Informal Language: Keeping up with ever-evolving slang
and informal language usage.
• Incomplete Sentences and Utterances: Human conversation often
involves shortcuts and missing information that can be confusing for
machines.
NLP researchers are constantly developing techniques to address
these challenges and improve the accuracy and robustness of NLP
systems.

Key NLP Tasks
Here's a glimpse into some fundamental NLP tasks that form the building
blocks for many applications:
• Tokenization: Breaking down text into smaller units like words,
punctuation marks, or phrases.
• Part-of-Speech (POS) tagging: Identifying the grammatical function of
each word in a sentence (e.g., noun, verb, adjective).
• Named Entity Recognition (NER): Recognizing and classifying named
entities in text, such as people, organizations, locations, dates,
monetary values, etc.

1. Tokenization:
Imagine you're dissecting a sentence. Tokenization is the first step,
where you break the sentence down into its individual building blocks.
These blocks can be:
• Words: "The", "quick", "brown", "fox"
• Punctuation marks: ".", ",", "?"
• Sometimes even phrases: "New York City" (depending on the
application)

2. POS Tagging:
After you have your tokens, POS tagging assigns a grammatical
role (part-of-speech) to each one. Here's an example:
Sentence: "The quick brown fox jumps over the lazy dog."
POS Tags: (Determiner, Adjective, Adjective, Noun) (Verbs)
(Preposition, Determiner, Adjective, Noun)

3. Named Entity Recognition (NER):
This focuses on identifying and classifying specific entities within the tokens. Imagine
circling important names on a page. NER does something similar, recognizing
entities like:
• People: "Albert Einstein"
• Organizations: "Google"
• Locations: "Paris"
• Dates: "July 4th, 2024"
• Monetary values: "$100"

Practical Examples
1. Search Engines:
Tokenization: When you search for "best restaurants NYC", the search engine
breaks it down into tokens like "best", "restaurants", "NYC".
POS Tagging: It can identify "best" as an adjective, "restaurants" as a noun,
and "NYC" as a proper noun (likely a location).
NER: This helps the search engine understand you're looking for highly-rated
restaurants in New York City and refines the search results accordingly.

2. Social Media Analysis:
Tokenization: Analyzing a tweet like
"Feeling great after winning the game
#GoTeam! #Champions".
POS Tagging: It can identify "Feeling" as a
verb, "great" as an adjective, "winning" as
a verb (participle), "game" as a noun, and
hashtags as proper nouns.
NER: This might not be relevant here, but
NER could be used to identify the team
mentioned in the hashtags for further
analysis.

3. Spam Filtering:
Tokenization: Breaking down a spam email with subject
line "Free $$$ for you!".
POS Tagging: It can identify "Free" as an adjective, "$$$"
as symbols, and "you" as a pronoun.
NER: This might not have much role here, but tokenization
and POS tagging help identify the generic and promotional
nature of the email, potentially flagging it as spam.

Text Cleaning and Normalization for NLP
• Text data often comes in a raw and messy format. It can contain
inconsistencies, irrelevant information, and variations in how words are
written.
• Cleaning and normalization are crucial steps in NLP to prepare the text for
further processing. Here's a breakdown of some common techniques:

1. Removing
Stopwords:
Stopwords are very common
words that carry little meaning on
their own (e.g., "the", "a", "is").
Removing them can improve
processing efficiency and focus
the analysis on more content-rich
words.

2. Removing
Special
Characters:
• Punctuation marks, symbols,
and emojis can add noise to
the data.
• Depending on the task, you
might choose to remove them
entirely or convert them to a
standard format.

3.
Lowercasing/Uppercasing:
Text data can be written in
different cases (uppercase,
lowercase).
Converting everything to
lowercase or uppercase
ensures consistency and
simplifies further processing.

4. Normalizing Text:
This can involve:
• Expanding Abbreviations: Converting abbreviations to their full forms
(e.g., "e.g." to "for example").
• Handling Emojis: Converting emojis to text descriptions or removing
them altogether.
• Handling Numbers: Converting numbers to text (e.g., "2023" to "two
thousand twenty-three") or leaving them as numerals depending on
the task.

5. Lemmatization vs. Stemming:
These techniques aim to reduce words to their base forms. However, they have subtle
differences:
Lemmatization: This process tries to convert a word to its dictionary form (lemma),
considering its grammatical role in the sentence (e.g., "running" becomes "run",
"better" becomes "good"). It requires a morphological analysis of the word.
Stemming: This process chops off suffixes to arrive at a base form (stem) that might
not always be a real word (e.g., "running" becomes "run", "better" becomes "bet"). It's a
simpler and faster approach but can sometimes lead to incorrect base forms.

Cont...
The choice between
lemmatization and stemming
depends on your specific
application.
Lemmatization is generally
preferred for tasks where
preserving meaning and
grammatical accuracy is crucial.
Stemming can be faster and
sufficient for simpler tasks where
the exact meaning of the base
form isn't critical.

Additional
Considerations
• Text Normalization Libraries: Libraries
like NLTK (Python) and spaCy (Python)
offer functionalities for many of these text
cleaning and normalization tasks.
• Context-Specific Normalization: The
specific techniques you apply might vary
depending on your NLP task and the
nature of your text data.
• Trade-offs: There can be trade-offs
between cleaning too aggressively and
losing information, and cleaning too
lightly and introducing noise. Finding the
right balance depends on your specific
needs.

1. Social Media Sentiment Analysis:
Imagine analyzing tweets to understand public sentiment towards a
new product launch. You'd want to clean the text by:
• Removing stopwords: Words like "a", "the", "is" don't contribute much
to sentiment.
• Removing special characters: Emojis, hashtags, and punctuation can
be removed or converted for consistency.
• Lowercasing: Case variations shouldn't affect sentiment analysis.
• Normalizing slang and abbreviations: "OMG" could be converted to
"oh my god" for better understanding.

2. Web Scraping and Text Summarization:
You might scrape news articles to summarize the main
points. Here, cleaning involves:
Removing HTML tags and code: Irrelevant for textual
content.
Removing stopwords: Focus on the core information.
Normalizing text: Standardize dates, locations, etc.

3. Chatbot Development:
When building a chatbot, you need to understand user queries
effectively. Cleaning involves:
Correcting typos and misspellings: Users might make mistakes
while typing.
Removing irrelevant information: Greetings, salutations, and emojis
might not be crucial for understanding the intent.
Normalization: Standardize formats for dates, times, and
measurements.

4. Machine Translation:
Machine translation systems need clean and normalized text for accurate
translation. Cleaning involves:
Removing special characters: Symbols and emojis might not translate
well.
Handling named entities: Proper names (people, locations) should be
preserved.
Normalization: Standardize date and time formats across languages.

5. Text Classification:
Classifying emails as spam or not-spam requires
cleaned text. Cleaning involves:
Removing email headers and footers: Irrelevant for
classification.
Removing URLs and attachments: Not useful for content
analysis.
Normalization: Standardize greetings and salutations.

1. Bag-of-Words
(BoW) Model:
Concept: BoW is a simple way to represent
documents as numerical vectors.
Process:
• Each document is treated as a "bag" of words,
ignoring the order and grammar of the words.
• A vocabulary of unique words is created across
all documents in the corpus.
• Each document is represented by a vector where
each element corresponds to a word in the
vocabulary.
• The value of each element indicates the
frequency (count) of the corresponding word
appearing in that document.

Example:
Document 1: "The cat sat on the mat." Document
2: "The dog chased the cat."
Vocabulary: {the, cat, sat, on, mat, dog, chased}
Document 1 vector: [3, 1, 1, 1, 1, 0, 0] (3
occurrences of "the", etc.) Document 2 vector:
[2, 1, 0, 0, 0, 1, 1]

Limitations:
Ignores word order and context.
Doesn't capture the relationships between words.
Can be sensitive to high-frequency stopwords.

2. Term
Frequency-
Inverse
Document
Frequency
(TF-IDF):
Concept: TF-IDF builds upon BoW but considers the
importance of words within a document and across the
entire corpus.
Process:
• TF (Term Frequency) for a word in a document is
calculated as its count divided by the total number of
words in that document.
• IDF (Inverse Document Frequency) for a word is
calculated as the logarithm of the total number of
documents in the corpus divided by the number of
documents containing that word. High IDF means the
word is less frequent across documents and
potentially more informative.
• The TF-IDF weight for a word is then calculated by
multiplying TF and IDF.

Benefits:
Gives more weight to
important words (rare
but informative).
Reduces the impact
of stopwords.

3. Word Embeddings and Distributed
Representations (Word2Vec, GloVe):
Concept: Word embeddings map words to numerical vectors, capturing semantic relationships
between words. Similar words will have similar vector representations in high-dimensional space.
Techniques:
Word2Vec: Two popular architectures are Skip-gram and CBOW. They predict surrounding words
based on a given word (Skip-gram) or vice versa (CBOW). Words used for prediction and the target
word become closer in the vector space.
GloVe: Analyzes word co-occurrence statistics from a large corpus to learn word vectors. Words that
frequently co-occur are positioned closer in the vector space.

Benefits:
Captures semantic
relationships between words.
Enables tasks like word
similarity detection and analogy
completion.
Can be used as input features
for various NLP models.

4. Language Models and Pre-trained
Transformers:
Concept: Language models are statistical methods that predict the next word in a sequence based on
the preceding words. Pre-trained transformers are powerful language models trained on massive
amounts of text data.
Techniques:
Traditional Language Models (e.g., n-grams): Predict the next word based on the n preceding words
(e.g., bigrams, trigrams).
Pre-trained Transformers (e.g., BERT, GPT-3): These are complex neural network architectures
trained on massive text corpora. They learn contextual representations of words and can be fine-tuned
for various NLP tasks like text classification, question answering, and summarization.

Benefits:
Can handle
complex
relationships
between words in a
sentence.
Achieve state-of-
the-art
performance on
many NLP tasks.
Offer flexibility for
fine-tuning to
specific domains.

Here's an analogy:
BoW and TF-IDF are
like simple indexes in a
library, listing all the
words in each book
(document).
Word embeddings are
like advanced search
features that consider
synonyms and related
terms.
Language models and
pre-trained
transformers are like
highly knowledgeable
librarians who can not
only find relevant
information but also
understand the context
and relationships
between them.

Understanding Sentiment Analysis
Sentiment analysis, also known as opinion mining, is the process of computationally identifying and classifying the
emotional tone behind a piece of text. It aims to understand whether the sentiment expressed is positive, negative, or
neutral.
Here's a breakdown of the concept:
Applications:
• Social media monitoring: Analyze public opinion towards brands, products, or events.
• Customer reviews: Understand customer satisfaction and identify areas for improvement.
• Market research: Gauge audience sentiment towards specific topics or products.
• Spam filtering: Identify and filter out spam emails with negative or promotional tones.

Techniques:
• Lexicon-based approach: Uses pre-defined
dictionaries of words with positive, negative, and
neutral sentiment scores. The overall sentiment is
calculated based on the sentiment scores of the words
in the text.
• Machine learning: Trains models on labeled data (text
with known sentiment) to automatically classify new
text. Popular algorithms include Naive Bayes, Support
Vector Machines (SVM), and Logistic Regression.
• Deep learning: Utilizes neural networks like Recurrent
Neural Networks (RNNs) and Long Short-Term
Memory (LSTM) networks to capture complex
relationships between words and improve sentiment
classification accuracy.

Building Sentiment Analysis Models
1. Data Preparation:
Collect a dataset of text samples with labeled sentiment (positive,
negative, or neutral).
Preprocess the text by cleaning it (removing noise, punctuation, stop
words) and potentially normalizing it (lowercasing,
stemming/lemmatization).

2. Feature
Engineering:
For machine
learning models,
create features
that represent the
text. This could
involve:
Bag-of-Words
(BoW): Represent
the text as a
vector where each
element indicates
the frequency of a
word in the
vocabulary.
TF-IDF: Assigns
weights to words
based on their
importance within
the document and
across the corpus.
Word
Embeddings:
Represent words
as numerical
vectors capturing
semantic
relationships.

3. Model
Training:
1
Choose a suitable
machine learning or deep
learning algorithm for
sentiment classification.
2
Train the model on your
labeled data.
3
Evaluate the model's
performance on a separate
test dataset.

4. Evaluation:
Use metrics like accuracy,
precision, recall, and F1-
score to assess the
model's performance.
Fine-tune the model or
explore different
algorithms if performance
is not satisfactory.

Interpreting
Sentiment
Analysis Results
Sentiment analysis models assign a
sentiment score or class (positive, negative,
neutral) to a piece of text.
It's crucial to understand the limitations:
Models might misclassify sarcasm, irony, or
complex emotions.
Contextual information beyond the text itself
might be needed for accurate interpretation.

Cont..
Use the results as an indicator of overall
sentiment but don't rely solely on them for
drawing definitive conclusions. Analyze
the data with a critical eye and consider
the context in which the text was written.

Theoretical
Explanation
Sentiment analysis builds upon the field of Natural
Language Processing (NLP) and leverages various
techniques from machine learning and deep
learning:
Linguistics: Sentiment analysis relies on
understanding the emotional connotation of words
and phrases.
Machine Learning: Algorithms learn patterns from
labeled data to classify new text samples.
Deep Learning: Deep neural networks can capture
complex relationships between words and context,
improving classification accuracy.

1. Introduction to
Topic Modeling
and Latent
Dirichlet
Allocation (LDA)

Cont...
Latent Dirichlet Allocation (LDA) is a
popular topic modeling algorithm.
Here's the basic idea:
• Each document is assumed to be a
mixture of various topics in different
proportions.
• Each topic is represented by a
probability distribution over words in
the vocabulary.
LDA analyzes the documents in a corpus
and tries to discover these underlying
topics and their distribution across
documents.

3. Evaluating Topic
Models and
Selecting the
Optimal Number of
Topics
There's no single "best" number of topics
for LDA. Here are some approaches to
guide your selection:
• Perplexity: LDA calculates perplexity, a
measure of how well the model fits
unseen data. Lower perplexity often
indicates a better fit. However, it can be
sensitive to model parameters.
• Topic Coherence: Evaluate how well
the words within a topic are
semantically related. Various metrics
like coherence score (CoherenceModel
in Gensim) can help assess this.
• Domain Knowledge: Consider your
understanding of the domain and the
expected number of relevant themes
within the documents.

4. Introduction to Text Generation Techniques
Text generation aims to create coherent and realistic sequences of words, similar
to human-written text. Here are two common approaches:
1. Markov Chains:
A Markov chain is a statistical model that predicts the next word based on the
probability of it appearing after a specific sequence of preceding words (n-grams).
Simple and computationally efficient, but generated text can be repetitive and lack
long-range coherence.

2. Recurrent
Neural
Networks
(RNNs):
RNNs are a type of neural network
architecture specifically designed for
sequential data like text.
They can learn complex relationships
between words across longer sequences,
leading to more sophisticated and
grammatically correct text generation.
However, training RNNs often requires
large datasets and significant
computational resources.

Weekairtificial intelligence 8-Module 7 NLP.pptx

More Related Content

Similar to Weekairtificial intelligence 8-Module 7 NLP.pptx

Recently uploaded

Weekairtificial intelligence 8-Module 7 NLP.pptx