"Natural language isthe most important part of artificial
intelligence."
John Searle
"Natural language processing is a cornerstone of artificial
intelligence, allowing computers to read and understand human
language, as well as to produce and recognize speech."
Ginni Rometty
"Natural language processing is one of the most important
fields in artificial intelligence and also one of the most difficult."
Dan Jurafsky
4.
What is NaturalLanguage Processing (NLP)?
Natural language processing is the set of methods for making human language accessible to
computers
(Jacob Eisenstein)
Natural language processing is the field at the intersection of Computer science (Artificial
intelligence) and linguistics
(Christopher Manning)
Make computers to understand natural language to do certain task humans can do
such as Machine translation, Summarization, Questions answering
(Behrooz Mansouri)
6.
“Language is theability to acquire and use complex systems of
communication, particularly the human ability to do so, and a language
is any specific example of such a system. The scientific study of
language is called linguistics.”
From Wikipedia
7.
Natural Language Processing:Terms
Natural language refers to the language that humans use to
communicate with each other, such as English, Spanish, or
Chinese
Processing
As distinguished from data processing
Question: How is data processing and natural language processing
different?
8.
Natural Language Processing:Terms
Consider the Unix wc program, which counts the total number of bytes,
words, and lines in a text file
● When used to count bytes and lines, wc is an ordinary data processing
application
● However, when it is used to count the words in a file, it requires knowledge
about what it means to be a word and thus becomes a language processing
system
9.
Computational Linguistics (CL)
•The science of doing what linguists do with language, but using computers
Natural Language Processing (NLP)
• The engineering discipline of doing what people do with language, but using
computers
Speech/Language/Text processing
Human Language Technology
10.
Natural Language Processingvs Computational
Linguistics
In linguistics, language is the object of study
● Computational methods may be brought to bear, just as in scientific disciplines like
computational biology and computational astronomy, but they play only a
supporting role
In contrast, natural language processing is focused on the design and analysis
of computational algorithms and representations for processing natural human
language
● The goal of natural language processing is to provide new computational
capabilities around human language: for example, extracting information from
texts, translating between languages, answering questions, holding a
conversation, taking instructions
11.
What does anNLP system need to
“know”?
• Language consists of many levels of structure
• Humans fluently integrate all of these in producing and
understanding language
• Ideally, so would a computer!
Question Answering
◾ Whatdoes “divergent” mean?
◾ What year was Abraham Lincoln
born?
◾ How many states were in the
United States that year?
◾ How much Chinese silk was
exported to England in the end
of the 18th century?
◾ What do scientists think about
the ethics of human cloning?
1
5
Natural Language Processing
Application
s
◾Machine Translation
◾ Information
Retrieval
◾ Question Answering
◾ Dialogue Systems
◾ Information
Extraction
◾ Summarization
◾ Sentiment Analysis
◾ ...
Core
Technologies
◾ Language modeling
◾ Part-of-speech tagging
◾ Syntactic parsing
◾ Named-entity
recognition
◾ Word sense
disambiguation
◾ Semantic role labeling
◾ ...
NLP lies at the intersection of computational linguistics and machine
learning.
9
18.
A few ofthe NLP Tasks
● Spell Checking, Keyword Search, Finding
Synonyms
● Part of Speech Tagging
● Extracting information from a website
○ Location, people, temporal expressions
● Classifying text
○ Sentiment analysis
● Machine translation
● Complex question answering
● Spoken dialog systems
19.
Knowledge & InformationExtraction
Knowledge graphs (KGs) organize data from multiple sources, capture
information about entities of interest in a given domain or task (like people,
places or events), and forge connections between them
The Google Knowledge Graph is an
enormous database of information
that enables Google to provide
immediate, factual answers to your
questions
51
Text preprocessing
Preprocessingis the first and a crucial step of NLP task
- The objective of preprocessing is to clean/harmonize the text,
reduce language fluctuations if necessary, and prepare the
tokens for being processed in the next steps
- Preprocessing partially addresses the issue with sparsity, why?
We will learn:
Text cleaning/harmonization/reduction of fluctuations
- Text normalization
- Segmentation
- Stop words
- Stemming & Lemmatization
27.
Text Normalization
Normalizationharmonizes the written forms of the words with
same meanings
Some examples:
- deleting periods
• U.S.A. → USA
- deleting hyphens
• anti-discriminatory → antidiscriminatory
- Accents
• French résumé → resume
- Umlauts
• German: Tuebingen → Tübingen
Sec. 2.2.3
28.
Text Normalization
Casefolding: reduce all letters to lower case
- It may cause ambiguity but typically helpful
• General Motors vs. general motors
• Fed vs. fed
• CAT (City Airport Train) vs. cat
Longstanding Google example:
- Search C.A.T.
Do the numbers, dates, etc. bring information?
- If included, the dictionary size may explode!
- Numbers and dates are commonly replaced by special tokens,
e.g.
• Numbers with <num>
• Dates with <dates>
Sec. 2.2.3
29.
Segmentation
Segmentation
- Splittinga compound word into tokens
French
- L'ensemble one token or two? L ? L’ ? Le ?
German compound nouns
- Halsschlagader Hals Schlag Ader?
- Compound words in German usually require compound
splitter
- A Possible algorithm (look in the link for more details):
Sec. 2.2.1
30.
Stop words
Stopwords
- The commonest words, like the, a, and, to, be
- They carry little or no semantic information
Stop words can also be important, especially in combination
with other words, e.g.:
- Phrases: “King of Denmark”, “To be or not to be”
- Titles, etc.: “Let it be”
- Definitional purposes: “flights to London”
Stop words are sometimes excluded from the corpus
- Commonly in bag-of-words approaches
31.
Stemming and MorphologicalAnalysis
• Goal: “normalize” similar words
• Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs (noun)
– Derivational Morphology
• Derive one word from another,
• Often change grammatical class
– Build(verb), building(noun); health(noun), healthy(adjective)
32.
Lemmatization
• Reduce inflectional/variantforms to base form
• E.g.,
– am, are, is be
– car, cars, car's, cars' car
• the boy's cars are different colors the boy car be different
color
• Lemmatization implies doing “proper” reduction to
dictionary headword form
33.
Stemming
Morphological variants ofa word (morphemes). Similar
terms derived from a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Grouping words with a
common stem together.
For example, a search on reads, also finds read, reading, and
readable
Stemming consists of removing suffixes and conflating the
resulting morphemes. Occasionally, prefixes are also removed.
34.
Stemming
• Reduce termsto their “roots” before indexing
• “Stemming” suggest crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
35.
Categories of Stemmer
Thefollowing diagram illustrate the various
categories of stemmer. Porter's algorithm is shown
by the red path.
Conflation methods
Manual Automatic (stemmers)
Affix Successor Table n-gram
removal variety lookup
Longest Simple
match removal