This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
3. Lecture Outline
● What is Natural Language Processing?
● Why is NLP hard?
● Brief history of NLP
● Fundamental tasks in NLP
● Some applications of NLP
3
4. What is Natural Language
Processing?
● A field of computer science, artificial
intelligence, and computational linguistics
● To get computers to perform useful tasks
involving human languages
− Human-Machine communication
− Improving human-human communication
● E.g Machine Translation
− Extracting information from texts
4
5. Why is NLP interesting?
● Languages involve many human activities
− Reading, writing, speaking, listening
● Voice can be used as an user interface in many
applications
− Remote controls, virtual assistants like siri,...
● NLP is used to acquire insights from massive
amount of textual data
− E.g., hypotheses from medical, health reports
● NLP has many applications
● NLP is hard!
5
6. Why is NLP hard?
● Highly ambiguous
● Sentence I made her duck may have different
meanings (from Jurafsky book)
− I cooked waterfowl for her.
− I cooked waterfowl belong to her.
− I created the (plaster?) duck she owns.
− I caused her to quickly lower her head or body.
− I waved my magic wand and turned her into
undifferentiated waterfowl.
6
7. Why is NLP hard?
I shot an elephant in my pajamas.
7
8. Why is NLP hard?
● Natural languages are highly ambiguous at all levels
− Lexical (word’s meaning)
− Syntactic
− Semantic
− Discourse
● Natural languages are fuzzy
● Natural languages involve reasoning about the world
− E.g., It is unlikely that an elephant wears a pajamas
8
9. Brief history of NLP
● Foundational Insights: 1940s and 1950s
− Two foundational paradigms
●
Automaton
● Probabilistic/Information-Theoretic models
● The two camps: 1957-1970
− Symbolic paradigm: the work of Chomsky and others on
formal language theory and generative syntax (1950s ~
mid 1960s)
− Stochastic paradigm
● In departments of statistics
9
10. Brief history of NLP
● Four paradigms: 1970-1983, explosion in
research in speech and language processing
− Stochastic paradigm
− Logic-based paradigm
− Natural language understanding
− Discourse modeling paradigm
● Empiricism and Finite State Models Redux:
1983-1993
10
11. Brief history of NLP
● The Fields Comes Together: 1994-1999
− Probabilistic and data-driven models had become
quite standard
● The Rise of Machine Learning: 2000-now
− Large amount of spoken and textual data become
available
− Widespread availability of high-performance
computing systems
11
12. Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
12
13. Word Segmentation
● In some languages, there is no space between
words, or a word may contain smaller syllables
− 毎年うちの研究室の学生が1-2名国語研でアルバイトさせ
てもらっているので、今日は新しくアルバイトする B4 学
生の紹介である。
− Nhật Bản luôn là thị trường thương mại quan trọng của
Việt Nam (Nhật_Bản luôn là thị_ trường thương_mại
quan_trọng của Việt_Nam)
● In such languages, word segmentation is the first
step of NLP systems.
13
14. Word Segmentation
● A possible solution is maximum matching
− Start by pointing at the beginning of a string, then choose the
longest word in the the dictionary that matches the input at the
current position.
− Nhật_Bản luôn là thị trường thương mại quan trọng của Việt
Nam
● Nhật_Bản is a word in dictionary, but “Nhật Bản luôn” is not
● Problems:
− Maxmatching could not deal with unknown words
− Dependency between words in the same sentences is not
exploited
14
15. Word Segmentation
● Most successful word segmentation tools are
based on machine-learning techniques.
● Word segmentation tools obtained high
accuracy
− vn.vitk (https://github.com/phuonglh/vn.vitk)
obtained 97% accuracy on test data
15
16. POS Tagging
● Each word in a sentence can be classified in to
classes, such as verbs, adjectives, nouns, etc
● POS Tagging is a process of tagging words in a
sentences to particular part-of-speech, based on:
− Its definition
− Its context in the sentence
● The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.
16
17. Sequence Labeling
● Many NLP problems can be viewed as
sequence labeling
● Each token in a sequence is assigned a label.
● Labels of tokens are dependent on the labels
of other tokens in the sequence, particularly
their neighbors.
17
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
18. Sequence Labeling as Classification
● Classify each token independently
● Use as features, information about the
surrounding tokens (sliding window).
18
John saw the saw and decided to take it to the table.
classifier
NNP
19. Probabilistic Sequence Models
• Model probabilities of pairs (token sequences,
tag sequences) from annotated data set.
• Exploit dependency between tokens
• Typical sequence models
• Hidden Markov Models (HMMs)
• Conditional Random Fields (CRF)
19
20. Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
20
21. Syntax Analysis
● The task of recognizing a sentence and assigning a
syntactic structure to it.
My dog also likes eating sausage.
(ROOT
(S
(NP (PRP$ My) (NN dog))
(ADVP (RB also))
(VP (VBZ likes)
(S
(VP (VBG eating)
(NP (NN sausage)))))
(. .)))
S
NP
PRP NN
ADVP
RB
My dog also
VP
VBZ S
VP
VBG
eating
NP
NN sausage
.
.
21
22. Syntax analysis
● An important task in NLP with many
applications
− Intermediate stage of representation for semantic
analysis
− Play an important role in applications like question
answering and information extraction
− E.g., What books were written by British women
authors before 1800?
22
23. Syntax analysis
● A challenging task in NLP
− Ambiguity problem: one sentence may have many
possible parsing trees
● Vietnamese language processing (VNLP) still
lacks accurate syntax parsers (in my
understanding)
− Accuracy about 78 ~ 84%
23
25. Fundamental Tasks in NLP
● Word Segmentation
● Part-of-speech (POS) tagging
● Syntactic Analysis
● Semantic Analysis
25
26. Semantic Analysis
● Two levels
● Lexical semantics
− Representing meaning of words
− Word sense disambiguation (e.g., word bank)
• Compositional semantics
− How words combined to form a larger meaning.
26
27. Meaning representations
• First order predicate calculus
•E.g., Maharani serves vegetarian food.
=> Serves(Maharani, vegetarian food)
•E.g., I only have five dollars and I don’t have a lot of
time
=> Have(Speaker, FiveDollars) ∧ ¬Have(Speaker,
LotOfTime)
27
29. Lecture Outline
● What is Natural Language Processing?
● Why is NLP hard?
● Brief history of NLP
● Fundamental tasks in NLP
● Some NLP applications
29
30. Some applications
● Information Retrieval
● Information Extraction
● Question Answering
● Text Summarization
● Machine Translation
30
32. Query Query processing
Search (Vector
space model or
probabilistic)
Ranked
documents
Indexing
Document
collection
32
Architecture of an ad hoc IR system
33. Information Extraction
● To extract from unstructured text, information
which pre-specified or pre-defined in templates
− Fill a number of slots/attributes
● Example: use template [PERSON, go,
LOCATION, TIME] to extract information about
the destination of an individual goes.
− “President Obama went to Hanoi yesterday.
− [PERSON = “President Obama”, go, LOCATION =
“Hanoi”, TIME = “yesterday”]
33
34. Question Answering
● A system that automatically return answers for an
user’s question by retrieving information from a
collected documents.
● Differences from information retrieval system:
− QA system’s goal is to respond exact answer instead of
documents related to users’ question.
● Q: who did invent the internet? A: Robert E. Kahn and Vint
Cerf.
− QA system requires more complicated semantic
analysis.
34
36. The figure is credited by Dr Ngo Xuan Bach: http://tinyurl.com/jk2dv33 36
37. Text Summarization
● Text summarization is process of distilling the
most important information from a text to produce
an abridge version for a particular task or user.
● Useful in the era of information explosion
● Categories of text summarization:
− Single-document/Multi-document summarization
− Extractive/Abstractive summarization
− Query-focused text summarization
37
38. Example of text summarisation
• https://www.bloomberg.com/view/articles/
2016-08-23/china-s-super-bus-exposes-dark-
side-of-p2p-lending
• It looked like the future: a wide, elevated Chinese bus that would speed
atop tracks straddling the road while multiple lanes of traffic flowed below.
And the future looked surprisingly near. In early August, a prototype of the
Transit Elevated Bus -- or TEB -- was tested in northern China.
• Demand for such loans has exploded in recent years, growing in volume
from $4.3 billion in 2013 to $71 billion in 2015. The appeal is twofold.
First, China's big state-owned banks have traditionally focused their
attention on other companies in the state sector, at the expense of
consumers and small businesses.
• Meanwhile, cash-rich Chinese are anxious to find yields higher than the
anemic rates paid by China's state banks, which typically fall below 3
percent. China's dodgy stock markets aren't a terribly appealing alternative,
while the attractiveness of Chinese real estate varies by region.
Output by Skype’s Summarization chatbot
38
39. Machine Translation
● The use computer to automatic some or all of
the process of translating one language to the
other one.
● Fully automatic machine translation is one of
the most challenging and hot topic in NLP.
● Recent advances of Deep Learning raise the
trend of Neural Machine Translation.
39
40. Example (Google translation)
It looked like the future: a wide, elevated Chinese
bus that would speed atop tracks straddling the
road while multiple lanes of traffic flowed below.
And the future looked surprisingly near.
Nó trông giống như tương lai: một rộng, xe buýt cao
Trung Quốc sẽ tăng tốc trên đường ray trải dài
đường trong khi nhiều tuyến đường giao thông chảy
bên dưới. Và tương lai có vẻ ngạc nhiên gần.
40
42. Bernard Vauquois' pyramid showing comparative depths of
intermediary representation, interlingual machine translation at the
peak, followed by transfer-based, then direct translation.
42
43. How to learn NLP?
• Have background/knowledge about:
• Probabilistic and Statistics
• Basic math (linear algebra, calculus)
• Machine Learning
• Programming
• Read textbook or attend online NLP courses:
• Speech and Language Processing, by Jurafsky, Daniel and
Martin, James H.
• Youtube’s playlist (Dan Jurafsky & Chris Manning: Natural
Language Processing): http://tinyurl.com/lb57fxf
44. How to learn NLP?
• Practice with programming exercises:
• 100 NLP drill exercises: https://github.com/minhpqn/nlp_100_drill_exercises
• NLP Programming Tutorial, by Graham Neubig: http://www.phontron.com/
teaching.php
• Compete in Kaggle data science challenges
(kaggle.com)
45. Try some NLP applications
• Try Stanford CoreNLP and Stanford Parser
demo
• http://nlp.stanford.edu:8080/corenlp
• http://nlp.stanford.edu:8080/parser
• Solve SAT-style math questions
• http://euclid.allenai.org
46. References
1. Speech and Language Processing, by Jurafsky,
Daniel and Martin, James H.
2. An Introduction to Natural Language
Processing series (http://tinyurl.com/hdg58wx)
47. References
• An Introduction to Natural Language Processing - Section
1 (http://tinyurl.com/ztkwb2b)
• An Introduction to Natural Language Processing - Section
2: Some Brief History (http://tinyurl.com/j48or27)
• An Introduction to Natural Language Processing - Section
3: Fundamental Tasks in NLP (http://tinyurl.com/zk7dgzv)
• An Introduction to Natural Language Processing - Section
4: Some Applications (http://tinyurl.com/jk2dv33)
• An Introduction to Natural Language Processing (http://
tinyurl.com/hdg58wx)
47