Introduction to Natural
Language
Understanding
NLP is the KING!
"Natural language is the most important part of artificial
intelligence."
John Searle
"Natural language processing is a cornerstone of artificial
intelligence, allowing computers to read and understand human
language, as well as to produce and recognize speech."
Ginni Rometty
"Natural language processing is one of the most important
fields in artificial intelligence and also one of the most difficult."
Dan Jurafsky
What is Natural Language Processing (NLP)?
Natural language processing is the set of methods for making human language accessible to
computers
(Jacob Eisenstein)
Natural language processing is the field at the intersection of Computer science (Artificial
intelligence) and linguistics
(Christopher Manning)
Make computers to understand natural language to do certain task humans can do
such as Machine translation, Summarization, Questions answering
(Behrooz Mansouri)
“Language is the ability to acquire and use complex systems of
communication, particularly the human ability to do so, and a language
is any specific example of such a system. The scientific study of
language is called linguistics.”
From Wikipedia
Natural Language Processing: Terms
Natural language refers to the language that humans use to
communicate with each other, such as English, Spanish, or
Chinese
Processing
As distinguished from data processing
Question: How is data processing and natural language processing
different?
Natural Language Processing: Terms
Consider the Unix wc program, which counts the total number of bytes,
words, and lines in a text file
● When used to count bytes and lines, wc is an ordinary data processing
application
● However, when it is used to count the words in a file, it requires knowledge
about what it means to be a word and thus becomes a language processing
system
Computational Linguistics (CL)
• The science of doing what linguists do with language, but using computers
Natural Language Processing (NLP)
• The engineering discipline of doing what people do with language, but using
computers
Speech/Language/Text processing
Human Language Technology
Natural Language Processing vs Computational
Linguistics
In linguistics, language is the object of study
● Computational methods may be brought to bear, just as in scientific disciplines like
computational biology and computational astronomy, but they play only a
supporting role
In contrast, natural language processing is focused on the design and analysis
of computational algorithms and representations for processing natural human
language
● The goal of natural language processing is to provide new computational
capabilities around human language: for example, extracting information from
texts, translating between languages, answering questions, holding a
conversation, taking instructions
What does an NLP system need to
“know”?
• Language consists of many levels of structure
• Humans fluently integrate all of these in producing and
understanding language
• Ideally, so would a computer!
Communication With Machines
~50-
70s 1
2
~80
s
toda
y
Conversational Agents
Conversational agents
contain:
● Speech recognition
● Language analysis
● Dialogue processing
● Information retrieval
● Text to speech
1
3
1
4
Question Answering
◾ What does “divergent” mean?
◾ What year was Abraham Lincoln
born?
◾ How many states were in the
United States that year?
◾ How much Chinese silk was
exported to England in the end
of the 18th century?
◾ What do scientists think about
the ethics of human cloning?
1
5
Machine Translation
1
6
Natural Language Processing
Application
s
◾ Machine Translation
◾ Information
Retrieval
◾ Question Answering
◾ Dialogue Systems
◾ Information
Extraction
◾ Summarization
◾ Sentiment Analysis
◾ ...
Core
Technologies
◾ Language modeling
◾ Part-of-speech tagging
◾ Syntactic parsing
◾ Named-entity
recognition
◾ Word sense
disambiguation
◾ Semantic role labeling
◾ ...
NLP lies at the intersection of computational linguistics and machine
learning.
9
A few of the NLP Tasks
● Spell Checking, Keyword Search, Finding
Synonyms
● Part of Speech Tagging
● Extracting information from a website
○ Location, people, temporal expressions
● Classifying text
○ Sentiment analysis
● Machine translation
● Complex question answering
● Spoken dialog systems
Knowledge & Information Extraction
Knowledge graphs (KGs) organize data from multiple sources, capture
information about entities of interest in a given domain or task (like people,
places or events), and forge connections between them
The Google Knowledge Graph is an
enormous database of information
that enables Google to provide
immediate, factual answers to your
questions
Sentiment Analysis
Determine whether the meaning behind data is positive, negative, or
neutral
Machine Translation
Low resource languages can be
challenging?
6,800 living languages
600 with written
tradition
100 spoken by 95% of
population
Question Answering
IBM-Watson Defeats Humans in
"Jeopardy!"
Spoken Dialog Systems
Level Of
Linguistic
Knowledge
25
51
Text preprocessing
 Preprocessing is the first and a crucial step of NLP task
- The objective of preprocessing is to clean/harmonize the text,
reduce language fluctuations if necessary, and prepare the
tokens for being processed in the next steps
- Preprocessing partially addresses the issue with sparsity, why?
We will learn:
 Text cleaning/harmonization/reduction of fluctuations
- Text normalization
- Segmentation
- Stop words
- Stemming & Lemmatization
Text Normalization
 Normalization harmonizes the written forms of the words with
same meanings
 Some examples:
- deleting periods
• U.S.A. → USA
- deleting hyphens
• anti-discriminatory → antidiscriminatory
- Accents
• French résumé → resume
- Umlauts
• German: Tuebingen → Tübingen
Sec. 2.2.3
Text Normalization
 Case folding: reduce all letters to lower case
- It may cause ambiguity but typically helpful
• General Motors vs. general motors
• Fed vs. fed
• CAT (City Airport Train) vs. cat
 Longstanding Google example:
- Search C.A.T.
 Do the numbers, dates, etc. bring information?
- If included, the dictionary size may explode!
- Numbers and dates are commonly replaced by special tokens,
e.g.
• Numbers with <num>
• Dates with <dates>
Sec. 2.2.3
Segmentation
 Segmentation
- Splitting a compound word into tokens
 French
- L'ensemble  one token or two? L ? L’ ? Le ?
 German compound nouns
- Halsschlagader  Hals Schlag Ader?
- Compound words in German usually require compound
splitter
- A Possible algorithm (look in the link for more details):
Sec. 2.2.1
Stop words
 Stop words
- The commonest words, like the, a, and, to, be
- They carry little or no semantic information
 Stop words can also be important, especially in combination
with other words, e.g.:
- Phrases: “King of Denmark”, “To be or not to be”
- Titles, etc.: “Let it be”
- Definitional purposes: “flights to London”
 Stop words are sometimes excluded from the corpus
- Commonly in bag-of-words approaches
Stemming and Morphological Analysis
• Goal: “normalize” similar words
• Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs (noun)
– Derivational Morphology
• Derive one word from another,
• Often change grammatical class
– Build(verb), building(noun); health(noun), healthy(adjective)
Lemmatization
• Reduce inflectional/variant forms to base form
• E.g.,
– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be different
color
• Lemmatization implies doing “proper” reduction to
dictionary headword form
Stemming
Morphological variants of a word (morphemes). Similar
terms derived from a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Grouping words with a
common stem together.
For example, a search on reads, also finds read, reading, and
readable
Stemming consists of removing suffixes and conflating the
resulting morphemes. Occasionally, prefixes are also removed.
Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggest crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Categories of Stemmer
The following diagram illustrate the various
categories of stemmer. Porter's algorithm is shown
by the red path.
Conflation methods
Manual Automatic (stemmers)
Affix Successor Table n-gram
removal variety lookup
Longest Simple
match removal
Comparison of stemmers
Narural_Language_Processing_Coursework_UNIT1

Narural_Language_Processing_Coursework_UNIT1

  • 1.
  • 2.
  • 3.
    "Natural language isthe most important part of artificial intelligence." John Searle "Natural language processing is a cornerstone of artificial intelligence, allowing computers to read and understand human language, as well as to produce and recognize speech." Ginni Rometty "Natural language processing is one of the most important fields in artificial intelligence and also one of the most difficult." Dan Jurafsky
  • 4.
    What is NaturalLanguage Processing (NLP)? Natural language processing is the set of methods for making human language accessible to computers (Jacob Eisenstein) Natural language processing is the field at the intersection of Computer science (Artificial intelligence) and linguistics (Christopher Manning) Make computers to understand natural language to do certain task humans can do such as Machine translation, Summarization, Questions answering (Behrooz Mansouri)
  • 6.
    “Language is theability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system. The scientific study of language is called linguistics.” From Wikipedia
  • 7.
    Natural Language Processing:Terms Natural language refers to the language that humans use to communicate with each other, such as English, Spanish, or Chinese Processing As distinguished from data processing Question: How is data processing and natural language processing different?
  • 8.
    Natural Language Processing:Terms Consider the Unix wc program, which counts the total number of bytes, words, and lines in a text file ● When used to count bytes and lines, wc is an ordinary data processing application ● However, when it is used to count the words in a file, it requires knowledge about what it means to be a word and thus becomes a language processing system
  • 9.
    Computational Linguistics (CL) •The science of doing what linguists do with language, but using computers Natural Language Processing (NLP) • The engineering discipline of doing what people do with language, but using computers Speech/Language/Text processing Human Language Technology
  • 10.
    Natural Language Processingvs Computational Linguistics In linguistics, language is the object of study ● Computational methods may be brought to bear, just as in scientific disciplines like computational biology and computational astronomy, but they play only a supporting role In contrast, natural language processing is focused on the design and analysis of computational algorithms and representations for processing natural human language ● The goal of natural language processing is to provide new computational capabilities around human language: for example, extracting information from texts, translating between languages, answering questions, holding a conversation, taking instructions
  • 11.
    What does anNLP system need to “know”? • Language consists of many levels of structure • Humans fluently integrate all of these in producing and understanding language • Ideally, so would a computer!
  • 12.
  • 13.
    Conversational Agents Conversational agents contain: ●Speech recognition ● Language analysis ● Dialogue processing ● Information retrieval ● Text to speech 1 3
  • 14.
  • 15.
    Question Answering ◾ Whatdoes “divergent” mean? ◾ What year was Abraham Lincoln born? ◾ How many states were in the United States that year? ◾ How much Chinese silk was exported to England in the end of the 18th century? ◾ What do scientists think about the ethics of human cloning? 1 5
  • 16.
  • 17.
    Natural Language Processing Application s ◾Machine Translation ◾ Information Retrieval ◾ Question Answering ◾ Dialogue Systems ◾ Information Extraction ◾ Summarization ◾ Sentiment Analysis ◾ ... Core Technologies ◾ Language modeling ◾ Part-of-speech tagging ◾ Syntactic parsing ◾ Named-entity recognition ◾ Word sense disambiguation ◾ Semantic role labeling ◾ ... NLP lies at the intersection of computational linguistics and machine learning. 9
  • 18.
    A few ofthe NLP Tasks ● Spell Checking, Keyword Search, Finding Synonyms ● Part of Speech Tagging ● Extracting information from a website ○ Location, people, temporal expressions ● Classifying text ○ Sentiment analysis ● Machine translation ● Complex question answering ● Spoken dialog systems
  • 19.
    Knowledge & InformationExtraction Knowledge graphs (KGs) organize data from multiple sources, capture information about entities of interest in a given domain or task (like people, places or events), and forge connections between them The Google Knowledge Graph is an enormous database of information that enables Google to provide immediate, factual answers to your questions
  • 20.
    Sentiment Analysis Determine whetherthe meaning behind data is positive, negative, or neutral
  • 21.
    Machine Translation Low resourcelanguages can be challenging? 6,800 living languages 600 with written tradition 100 spoken by 95% of population
  • 22.
  • 23.
  • 25.
  • 26.
    51 Text preprocessing  Preprocessingis the first and a crucial step of NLP task - The objective of preprocessing is to clean/harmonize the text, reduce language fluctuations if necessary, and prepare the tokens for being processed in the next steps - Preprocessing partially addresses the issue with sparsity, why? We will learn:  Text cleaning/harmonization/reduction of fluctuations - Text normalization - Segmentation - Stop words - Stemming & Lemmatization
  • 27.
    Text Normalization  Normalizationharmonizes the written forms of the words with same meanings  Some examples: - deleting periods • U.S.A. → USA - deleting hyphens • anti-discriminatory → antidiscriminatory - Accents • French résumé → resume - Umlauts • German: Tuebingen → Tübingen Sec. 2.2.3
  • 28.
    Text Normalization  Casefolding: reduce all letters to lower case - It may cause ambiguity but typically helpful • General Motors vs. general motors • Fed vs. fed • CAT (City Airport Train) vs. cat  Longstanding Google example: - Search C.A.T.  Do the numbers, dates, etc. bring information? - If included, the dictionary size may explode! - Numbers and dates are commonly replaced by special tokens, e.g. • Numbers with <num> • Dates with <dates> Sec. 2.2.3
  • 29.
    Segmentation  Segmentation - Splittinga compound word into tokens  French - L'ensemble  one token or two? L ? L’ ? Le ?  German compound nouns - Halsschlagader  Hals Schlag Ader? - Compound words in German usually require compound splitter - A Possible algorithm (look in the link for more details): Sec. 2.2.1
  • 30.
    Stop words  Stopwords - The commonest words, like the, a, and, to, be - They carry little or no semantic information  Stop words can also be important, especially in combination with other words, e.g.: - Phrases: “King of Denmark”, “To be or not to be” - Titles, etc.: “Let it be” - Definitional purposes: “flights to London”  Stop words are sometimes excluded from the corpus - Commonly in bag-of-words approaches
  • 31.
    Stemming and MorphologicalAnalysis • Goal: “normalize” similar words • Morphology (“form” of words) – Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class – dog, dogs (noun) – Derivational Morphology • Derive one word from another, • Often change grammatical class – Build(verb), building(noun); health(noun), healthy(adjective)
  • 32.
    Lemmatization • Reduce inflectional/variantforms to base form • E.g., – am, are, is  be – car, cars, car's, cars'  car • the boy's cars are different colors  the boy car be different color • Lemmatization implies doing “proper” reduction to dictionary headword form
  • 33.
    Stemming Morphological variants ofa word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using Stemming in Information Retrieval. Grouping words with a common stem together. For example, a search on reads, also finds read, reading, and readable Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.
  • 34.
    Stemming • Reduce termsto their “roots” before indexing • “Stemming” suggest crude affix chopping – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress
  • 35.
    Categories of Stemmer Thefollowing diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal
  • 36.