This document provides an introduction to natural language processing, including its history, goals, challenges, and applications. It discusses how NLP aims to help machines process human language like translation, summarization, and question answering. While language is complex, NLP uses techniques from linguistics, machine learning, and computer science to develop tools that analyze, understand, and generate human language.
2. The Dream
•It’d be great if machines could
–Process our email (usefully)
–Translate languages accurately
–Help us manage, summarize, and
aggregate information
–Use speech as a UI (when needed)
–Talk to us / listen to us
•But they can’t:
–Language is complex, ambiguous,
flexible, and subtle
–Good solutions need linguistics
and machine learning knowledge
•So:
3. The Mystery
• What’s now impossible for computers (and any other
species) to do is effortless for humans.
✕ ✕ ✓
4. The Mystery (continued)
• Patrick Suppes, eminent philosopher, in his 1978
autobiography:
“…the challenge to psychological theory made by linguists to
provide an adequate theory of language learning may well
be regarded as the most significant intellectual challenge to
theoretical psychology in this century.”
• So far, this challenge is still unmet in the 21st century
• Natural language processing (NLP) is the discipline in
which we study the tools that bring us closer to meeting
this challenge
5. What is NLP?
• Fundamental goal: deep understand of broad language
• Not just string processing or keyword matching!
6. What is NLP
• Computers use (analyze, understand, generate) natural language
• Text Processing
• Lexical: tokenization, part of speech, head, lemmas
• Parsing and chunking
• Semantic tagging: semantic role, word sense
• Certain expressions: named entities
• Discourse: coreference, discourse segments
• Speech Processing
• Phonetic transcription
• Segmentation (punctuations)
7. History of NLP
• First Introduced in 1950’s by Alan Turing.
• Georgetown Experiment in 1954, six Russian sentences
translated to English.
• Upto 80’s, NLP was governed by hand written rules only.
• From 80’s onward, introduction of ML gave NLP new
dimensions.
• In recent years, there has been a flurry of results showing
deep learning techniques.
8. Why Should You Care?
Trends
1. An enormous amount of knowledge is now available in
machine readable form as natural language text
2. Conversational agents are becoming an important form of
human-computer communication
3. Much of human-human communication is now mediated
by computers
9. Motivation for NLP
• Understand language analysis & generation
• Communication
• Language is a window to the mind
• Data is in linguistic form
• Data can be in Structured (table form), Semi
structured (XML form), Unstructured
(sentence form).
10. Components of NLP
There are two components of NLP as given −
1. Natural Language Understanding (NLU)
Understanding involves the following tasks −
1. Mapping the given input in natural language into useful representations.
2. Analysing different aspects of the language.
2. Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the
form of natural language from some internal representation. It involves :
1.Text planning − It includes retrieving the relevant content from
knowledge base.
2.Sentence planning − It includes choosing required words, forming
meaningful phrases, setting tone of the sentence.
3.Text Realization − It is mapping sentence plan into sentence structure.
12. Language Processing
• Level 1 – Speech sound (Phonetics & Phonology)
• Level 2 – Words & their forms (Morphology, Lexicon)
• Level 3 – Structure of sentences (Syntax, Parsing)
• Level 4 – Meaning of sentences (Semantics)
• Level 5 – Meaning in context & for a purpose
(Pragmatics)
• Level 6 – Connected sentence processing in a larger
body of text (Discourse)
13. Syntactic Analysis (Parsing) − It involves analysis of words in the
sentence for grammar and arranging words in a manner that
shows the relationship among the words.
Steps in NLP
Discourse Integration − The meaning of any sentence
depends upon the meaning of the sentence just before it. In
addition, it also brings about the meaning of immediately
succeeding sentence.
Semantic Analysis − It draws the exact meaning or the
dictionary meaning from the text. The text is checked for
meaningfulness. It is done by mapping syntactic structures
and objects in the task domain.
Lexical Analysis − It involves identifying and analysing the
structure of words. Lexical analysis is dividing the whole chunk
of txt into paragraphs, sentences, and words.
Pragmatic Analysis − During this, what was said is re-
interpreted on what it actually meant. It involves deriving
those aspects of language which require real world
knowledge.
14. Ambiguity in Natural Language
NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity :
• Lexical ambiguity − It is at very primitive level such as word-level.
For example, treating the word “board” as noun or verb?
• Syntax Level ambiguity − A sentence can be parsed in different
ways.
For example, “He lifted the beetle with red cap.” − Did he use cap to
lift the beetle or he lifted a beetle that had red cap?
• Referential ambiguity − Referring to something using pronouns.
For example, Rima went to Gauri. She said, “I am tired.” − Exactly
who is tired?
- One input can mean different meanings.
- Many inputs can mean the same thing.
17. Pattern Matching
Exact Pattern Matching
Problem: Find first match of a pattern of length M in a text stream of
length N..(N>>M)
Pattern: needle (M = 6)
Text: ianahaystackanneedleina(N=21)
Challenges:
• Brute-force is not good enough for all applications
• Theoretical challenge: Linear-time guarantee. Fundamental
Algorithmic Problem
• Practical challenge: Avoid backup in text stream. Often no room or
time to save text.
18. Knuth-Morris-Pratt (KMP) exact pattern-matching algorithm
Named after Don Knuth, Jim Morris, Vaughan Pratt
Classic algorithm that meets both challenges
• linear-time guarantee
• no backup in text stream
Basic plan (for binary alphabet)
• build DFA from pattern
• simulate DFA with text as input
Input Text
DFA for
pattern
Accept or
Reject
19. Regular-expression pattern matching
.
Search for occurrences of one of multiple patterns in a text file
Ex. (genomics)
• Fragile X syndrome is a common cause of mental retardation.
• human genome contains triplet repeats of cgg or agg
• bracketed by gcg at the beginning and ctg at the end
• number of repeats is variable, and correlated with syndrome
• use regular expression to specify pattern: gcg(cgg|agg)*ctg
• do RE pattern match on person’s genome to detect Fragile x
20. GREP
GREP stands for Global Regular Expression and Print. It was
introduced by Ken Thompson.
Basic Plan for GREP
• build DFA from RE
• simulate DFA with text as input
TEXT DFA for pattern
gcg(ccg|agg)*ctg
Accept/Reje
ct
21. Bayesian Method
• Uses Bayes Rule
• The Naive Bayes Assumption: Assume that all features
are independent given the class label Y
24. HIDDEN MARKOV MODEL
A hidden Markov model (HMM) is a statistical model, in
which the system being modeled is assumed to be a
Markov process (Memoryless process: its future and past
are independent ) with hidden states
We want to find that:
given the past data of outcomes what is the probability of
any possible outcome today.
25. Example
• If the weather yesterday was rainy and today is foggy
what is the probability that tomorrow it will be sunny?
• Using Bayes rule:
• For n days:
28. Applications
● Fighting Spam
○ Spam filters have become important as the first line of defense
against the ever-increasing problem of unwanted email.
● Information Extraction
○ Many important decisions in financial markets are increasingly
moving away from human oversight and control. Algorithmic trading
is becoming more popular.
● Summarization
○ Ability to summarize the meaning of documents and information is
becoming increasingly important.
● NLU interfaces to databases
○ intelligent natural language databases have been developed they
provide flexible options for manipulating the queries
● Intelligent Web searching
○ Natural language processing has made web search more intelligent
by transforming it from keyword based to expression based
29. Applications cont...
● Machine Translation
○ is a subfield of computational linguistics that investigates the
use of software to translate text or speech from one language
to another.
Natural Language Generation
• task of generating natural language from a machine
representation system such as a knowledge base or a logical
form
30. Conti…
• Speech Recognition
The process of enabling a computer to identify and respond
to the sounds produced in human speech.