Natural language processing (NLP) refers to interactions between human language and computers. NLP aims to make computers understand and generate human language. NLP involves tasks like translation, summarization, named entity recognition, and relationship extraction. NLP systems have inputs like speech or written text, and involve natural language understanding and generation. Understanding natural language involves analyzing aspects like syntax, semantics, and pragmatics, while generation involves choosing words and forming sentences. NLP faces challenges like ambiguity and understanding context and references.
2. INTRODUCTION
Natural language processing (NLP) describes the
interaction between human language and computers.
NLP is all about making computers understand and
generate human language.
NLP refers to AI method of communicating with an
intelligent systems using a natural language such as
English.
NLP helps developers to perform tasks like translation,
summarization, named entity recognition, relationship
extraction, speech recognition, topic segmentation, etc.
3.
4. WHEN NLP REQUIRED
when you want an intelligent system like robot to
perform as per your instructions, when you want to hear
decision from a dialogue based clinical expert system,
etc.
The input and output of an NLP system can be −
Speech
Written Text
NLP encompasses anything a computer needs
To understand natural language (typed or spoken) and
also generate the natural language
5. Natural Language Understanding (NLU)
The computer's ability to understand what we say
Mapping the given input in natural language into useful representations.
Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of natural
language from some internal representation
Text planning − It includes retrieving the relevant content from knowledge base.
Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of
the sentence.
Text Realization − It is mapping sentence plan into sentence structure.
NLU is harder than NLG
6. DIFFICULTIES IN NLU
NL has an extremely rich form and structure.
It is very ambiguous.
There can be different levels of ambiguity −
Lexical ambiguity
Syntax Level ambiguity
Referential ambiguity
1) Lexical ambiguity / homonymy/ semantic ambiguity) - (within a word)
It occurs in the sentence because of the poor vocabulary usage that leads to two or more possible
meanings.
Ex1:
My sister saw bat.
This example has four different meanings:
My sister saw a bat (saw the past tense of see) (bat the bird)
My sister saw a bat (saw the past tense of see) (bat the wooden baseball bat)
My sister saw a bat (saw as cutting) (bat the bird)
My sister saw a bat (saw as cutting) (bat the wooden baseball bat)
Ex2:
The boy carries the light box.
This example has three different meanings:
(light) not a heavy box
(light) a box that has an electric lamp
(light) a shiny box
7. 2) Syntax Level ambiguity / structural / grammatical ambiguity (within a sentence or
sequence of words)
A sentence can be parsed in different ways.
It occurs in the sentence because the sentence structure leads to two or more possible meanings.
Example (1):
I invited the person with the microphone.
This example has two different meanings:
I spoke (using the microphone) to invite the person
I invited the person who (has the microphone).
Example (2): The turkey is ready to eat.
This example has two different meanings:
I cooked the turkey, and it is ready to be eaten
The turkey bird itself is ready to eat some food.
8. 3) Referential ambiguity − Referring to something using pronouns.
we make reference to a certain entity but realize that the entity (ies) we are pointing to
is more than one.
Referential ambiguity can result because of the presence of pronouns.
For example, The boy told his father the theft. He was very upset.
He is referentially ambiguous because it can refer to both the boy and the father.
For example, Rima went to Gauri. She said, “I am tired.”
− Exactly who is tired?
One input can mean different meanings.
Many inputs can mean the same thing.
9. TERMS OF NLP
Phonology − It is study of organizing sound systematically.
letter "t" in "bet
vocal chords stop vibrating causing the "t" sound - tongue behind the teeth and the flow of air.
Morphology − It is a study of construction of words from primitive meaningful units.
Morphology focuses on how the components within a word (stems, root words, prefixes, suffixes, etc.) are arranged or
modified to create different meanings
often adds "-s" or "-es" to the end – Plurality
a "-d" or "-ed" to a verb – Past tense
suffix “-ly” is added to adjectives to create adverbs -- “happy” [adjective] and “happily” [adverb]
Morpheme − It is primitive unit of meaning in a language.
smallest meaningful part of a word
parts "un-", "break", and "-able" in the word "unbreakable".
Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role
of words in the sentence and in phrases.
Syntactic analysis, also referred to as syntax analysis or parsing, is the process of analyzing natural
language with the rules of a formal grammar
Subject, verb, noun, etc
The syntax refers to the principles and rules that govern the sentence structure of any individual
languages.
10. Semantics − It is concerned with the meaning of words and how to combine words into meaningful
phrases and sentences.
For example, it understands that a text is about “politics” and “economics” even if it doesn’t
contain the actual words but related concepts such as “election,” “Democrat,” “speaker of the
house,” or “budget,” “tax” or “inflation.”
E.g.. “colorless green idea.” This would be rejected by the Symantec analysis as
colorless Here; green doesn’t make any sense.
Word Sense Disambiguation
The word “orange,” for example, can refer to a color, a fruit, or even a city in Florida!
The same happens with the word “date,” which can mean either a particular day of the month, a
fruit, or a meeting.
11. Pragmatics − It deals with using and understanding sentences in different situations and how the
interpretation of the sentence is affected.
John saw Mary in a garden with a cat
here we can't say that John is with cat or mary is with cat
E.g., “close the window?” should be interpreted as a request instead of an order.
Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the
next sentence.
For example, the word “that” in the sentence “He wanted that” depends upon the prior discourse
context.
World Knowledge − It includes the general knowledge about the world.
Word knowledge is nothing but everyday knowledge that all speakers share about the world.
It includes the general knowledge about the structure of the world
16. Skills
Survey Analytics
Social Media Monitoring
Descriptive Analytics - pros ,cons
17.
18. COMPONENTS OF NLP /
STEPS IN NLP
Morphological and Lexical Analysis
Syntactic Analysis
Semantic Analysis
Discourse Integration
Pragmatic Analysis
19.
20.
21. GRAMMARS AND LANGUAGES
Languages - set of strings from an alphabet
Symbols
Alphabets
Strings
Words
Symbols – character / abstract entity that has no meaning itself
Eg. Letters (a-z), digits(0- n) and special characters ($,%,^,&, etc)
Alphabet – finite set of symbols – denoted by ∑ (sigma)
A={0,1} – A is an alphabet of two symbols 0 and 1
C={a,b,c} – C is an alphabet of three symbols a,b,c
D={!,@} – D is an alphabet of two symbols ! and @
String or a word – finite sequence of symbols from an alphabet
0110 and 1110 - strings from the alphabet of A above
aabbcc and ab - strings from the alphabet of C above
!@# - strings from the alphabet of D above
Language – set of strings from alphabet
Formal Language ( simply lang) – set of strings over some finite alphabet ( L over ∑ )
- it described using formal grammers
22. GRAMMERS
G = <T, N, S, R>
T is set of terminals (lexicon)
N is set of non-terminals
For NLP, we usually distinguish out a set P ⊂ N of pre-
terminals which always rewrite as terminals.
S is start symbol (one of the non-terminals)
R is rules/productions of the form X → γ,
where X is a nonterminal and γ is a sequence of
terminals and nonterminals (may be empty).
• A grammar G generates a language L
23. GRAMMATICAL STRUCTURE
1. sentence,
2. constituent,
3. phrase,
4. classification
5. structural rule
1)Sentence(S) - Sentence is a string of words
satisfying grammatical rules of a language.
- Sentence is often abbreviated to "S"
Classified
Simple
compound
complex
24. Simple - A simple sentence has the most basic elements that make it a sentence: a subject,
a verb, and a completed thought.
Ex. Joe waited for the train.
"Joe" = subject, "waited" = verb
Mary and Samantha took the bus.
"Mary and Samantha" = compound subject, "took" = verb
Compound - a sentence made up of two independent clauses (or complete sentences)
connected to one another with a coordinating conjunction. Coordinating conjunctions are
easy to remember if you think of the words "FAN BOYS":
For
And
Nor
But
Or
Yet
So
Ex: Joe waited for the train, but the train was late.
Mary and Samantha arrived at the bus station before noon, and they left on the bus before I arrived.
25. since
though
unless
until
when
whenever
whereas
wherever
while
Complex - A complex sentence joins an independent clause with one or
more dependent clauses.
Dependent clauses begin with subordinating conjunctions
after
although
as
because
before
even though
if
The dependent clauses can go first in the sentence, followed by the
independent clause, as in the following:
Tip: When the dependent clause comes first, a comma should be used
to separate the two clauses.
Because Mary and Samantha arrived at the bus station before noon, I
did not see them at the station.
While he waited at the train station, Joe realized that the train was late.
After they left on the bus, Mary and Samantha realized that Joe was
waiting at the train station.
26.
27. 2) Constituent - a syntactic arrangement that consists of parts, usually two called
"Constituents“
Examples: The phrases the man is a construction consists of two constituents the
and man
29. SYNTACTIC PROCESSING
Grammar is very essential and important to
describe the syntactic structure of well-formed
programs
Syntax refers to the structure of phrases and
the relation of words to each other within the
phrase.
Inception of natural languages like English, Hindi,
etc.
30. CONCEPT OF PARSER
Used to implement the parsing.
Input data- text
Output – structural representation of input after
checking the correct syntax as per grammer.
It also builds a data structure in the form of parse
tree or abstract syntax tree or parsing tree or
derivation tree or concrete syntax tree or other
hierarchical structure.
31. PARSER
defined as the graphical depiction of a derivation
Start symbol – root of parse tree
Parse tree – terminal (leaf node) and non-terminal
nodes ( interior nodes)
34. PHASE STRUCTURE RULES
Phrasal Category include: noun phrase, verb
phrase, prepositional phrase;
Lexical category include: noun, verb, adjective,
adverb, others.
Phrase structure rules are usually of the form A ->B
C
36. SYNTACTIC CATEGORIES (COMMON DENOTATIONS) IN
NLP
• np - noun phrase
• vp - verb phrase
• s - sentence
• det - determiner (article)
• n - noun
• tv - transitive verb (takes an object)
• iv - intransitive verb
• prep - preposition
• pp - prepositional phrase
• adj - adjective
37. SENTENCE USING PHASE STRUCTURE
Every sentence consists of an internal structure
Algorithm:
Apply rules on an proposition
The base proposition would be: S(the root, ie the
sentence).
The first production rule would be: (NP = noun phrase,
VP=verb phrase)
S->(NP,VP)
Apply rules for the branches
NP-> noun VP ->verb, NP
The verb and noun have terminal nodes which could be
any word in the lexicon.
The end is a tree with the words as terminal nodes,
which is referred as a sentence.
AST –Abstract Syntax Tree
50. PARSE TREE
Parse tree is the graphical representation of symbol.
The symbol can be terminal or non-terminal.
In parsing, the string is derived using the start symbol.
The root of the parse tree is that start symbol.
It is the graphical representation of symbol that can be
terminals or non-terminals.
Parse tree follows the precedence of operators. The
deepest sub-tree traversed first. So, the operator in the
parent node has less precedence over the operator in
the sub-tree.
The parse tree follows these points:
All leaf nodes have to be terminals.
All interior nodes have to be non-terminals.
In-order traversal gives original input string.
52. TYPES OF GRAMMER
According to Noam Chomosky, there are four types of grammars − Type 0, Type 1,
Type 2, and Type 3. The following table shows how they differ from each other −
Grammar
Type
Grammar
Accepted
Language
Accepted
Automaton
Type 0 Unrestricted
grammar
Recursively
enumerable
language
Turing
Machine
Type 1 Context-
sensitive
grammar
Context-
sensitive
language
Linear-
bounded
automaton
Type 2 Context-free
grammar
Context-free
language
Pushdown
automaton
Type 3 Regular
grammar
Regular
language
Finite state
automaton
53.
54.
55. CONTEXT FREE GRAMMAR
A context-free grammar (CFG) is a list of rules that define the set of all
well-formed sentences in a language.
Each rule has a left-hand side, which identifies a syntactic category, and
a right-hand side, which defines its alternative component parts, reading
from left to right.
E.g., the rule s --> np vp means that "a sentence is defined as a noun phrase
followed by a verb phrase." Figure 1 shows a simple CFG that describes
the sentences from a small subset of English.
56. SYNTACTIC CATEGORIES (COMMON DENOTATIONS) IN
NLP
• np - noun phrase
• vp - verb phrase
• s - sentence
• det - determiner (article)
• n - noun
• tv - transitive verb (takes an object)
• iv - intransitive verb
• prep - preposition
• pp - prepositional phrase
• adj - adjective
57.
58. A sentence in the language defined by a CFG is a series of words that can be derived by
systematically applying the rules, beginning with a rule that has s on its left-hand side.
A parse of the sentence is a series of rule applications in which a syntactic category is replaced
by the right-hand side of a rule that has that category on its left-hand side, and the final
rule application yields the sentence itself. E.g., a parse of the sentence "the giraffe dreams" is:
s => np vp => det n vp => the n vp => the giraffe vp => the giraffe iv => the giraffe dreams
Figure 1 shows a parse tree for the sentence "the giraffe dreams". Note
that the root of every subtree has a grammatical category that appears on the left-hand side of
a rule, and the children of that root are identical to the elements on the right-hand side of that
rule.
62. Example
Obtain the left most derivation for the string
aaabbabbba using the following grammar.
S aB | bA
A aS | bAA | a
B bS | aBB | b
63. S lm aB
aaBB (By applying B aBB)
aaaBBB (By applying B aBB)
aaabBB (By applying B b)
aaabbB (By applying B b)
aaabbaBB (By applying B aBB)
aaabbabB (By applying B b)
aaabbabbS (By applying B bS)
aaabbabbbA (By applying S bA)
aaabbabbba (By applying A a)
64. Example:
Is the following grammar ambiguous
S aB | bA
A aS | bAA | a
B bS | aBB | b
Generate the string aabbab. And show the
derivation using left most derivation.
S lm aB
aaBB (By applying B aBB)
aabSB(By applying B bS)
65. aabbAB (By applying S bA)
aabbaB (By applying A a)
aabbab (By applying B b)
Derivation tree S root node
a B
a B B
b S
b A
b
a
66.
67. DETERMINISTIC AND NON DETERMINISTIC
PARSER
A a deterministic algorithm which produces only a single output for the same input
even on different runs
A non-deterministic algorithm can provide different outputs for the same input
on different executions.
a non-deterministic algorithm travels in various routes to arrive at the different
outcomes
71. Recursive Transition Network
Natural Language Processing Course, Parsing, Ahmad Abdollahzadeh, Computer Engineering Faculty, Amirkabir University of Technology,
1381.
Simple transition networks are often called finite state machines (FS
Finite state machines are equivalent in expressive power to regular
grammars and thus are not powerful enough to describe all languag
that can be described by a CFG.
To get the descriptive power of CFGs, you need a notion of recursio
in the network grammar.
A recursive transition network (RTN) is like a simple transition
network, except that it allows arc labels to refer to other networks
as well as word categories.
72.
73.
74.
75. Try yourself
I) Is the following grammar ambiguous
S aB | bA
A aS | bAA | a
B bS | aBB | b
Generate the string aabbab. And show the
derivation using left most and right most derivation.
II) Construct parse tree for the above grammer
iii) Construct pars tree for the following sentence
The birds pecks the food
iv) Construct the parse tree for the following grammer
T= T + T | T - T
T = a|b|c
Input:
a - b + c
v) Construct the parse tree
E*E/E