NLP from scratch

11
Bryan Hang Zhang
Natural Language Processing
Almost From Scratch

22
• Presents a deep neural network architecture for NLP tasks
• Presents results comparable to state-of-art on 4 NLP tasks
• Part of Speech tagging
• Chunking
• Named Entity Recognition
• Semantic Role Labeling
• Presents word embeddings learned from a large unlabelled corpus
and shows an improvement in results by using these features
• Presents results of joint training for the above tasks.

33
• Propose a uniﬁed neural network architecture and
learning algorithm that can be applied to various NLP
tasks
• Instead of creating hand-crafted features, we can acquire
task-speciﬁc features ( internal representation) from
great amount of labelled and unlabelled training data.
Motivation

44
• Part of Speech Tagging
• Successively assign Part-of-Speech tags to words in a text
sequence automatically.
• Chunking
• Chunking is also called shallow parsing and it's basically the
identiﬁcation of parts of speech and short phrases (like noun
phrases)
• Named Entity Recognition
• classify the elements in the text into predeﬁned categories
such as person, location etc.
Task Introduction

55
• SRL is sometimes also called shallow semantic parsing, is a task
in consisting of the detection of the semantic arguments
associated with the predicate or verb of a sentence and their
classiﬁcation into their speciﬁc roles.
Semantic Role Labeling
e.g 1. Mark sold the car to Mary.
agent
represent
predicate
theme recipient
e.g 2.

66
State-of-the-art systems
experiment setup

99
• Traditional Approach:
• hand-design features
• New Approach:
• multi-layer neural networks.
The Networks

1010
• Transforming Words into Feature Vectors
• Extracting Higher Level Features from Word Feature Vector
• Training
• Benchmark Result
Bullet

1111
• Training
Bullet

1313
Window approach network Sentence approach networkWindow approach network Sentence approach network
Two Approaches Overview

1414
• K Discrete Features construct a Matrix as a lookup table
Lookup tables
K discrete feature
Matrix
Lookup Tables

1515
• Window Size: for example, 5
• Raw text features:
• — Lower case word
• — Capitalised feature
Words to Features: Window Approach

1616
Window Approach
My Name is Bryan
PADDING PADDING My Name is Bryan PADDING PADDING
PADDING PADDING My Name is
PADDING My Name is Bryan
My Name is Bryan PADDING
Name is Bryan PADDING PADDING

1717
Word to Features
Words to features!
My
Word index!
Caps index!
Vocabulary size (130,000)!
Number of
options (5)!
50!
5!
6"
Word Lookup Table
Caps Lookup
Table

1818
Words to Features
PADDING
PADDING
My
Name
is
Words to features!
PADDING
PADDING
My
Name
is
275!
7"

1919
• Training
Bullet

2020
Extracting Higher Level Features
Word Feature Vectors
L-layer Neural Network
L Neural Network
l
Extracting Higher Level Features From
L Neural Network
l
Word Feature Vecto
L Neural Network
l
Any feed forward neural network with L layers cane be
seen as a composition of function corresponding to
each layer l
: parameters

2121
Window approach
t = 3,dwi n = 2
w1
1
w1
2
M
w1
3
M
w5
K−1
w5
K
ndow approach
Window approach
t = 3,dwi n = 2
w1
1
w1
2
M
w1
3
M
w5
K−1
w5
K
dow approach
Words to features!
PADDING
PADDING
My
Name
is
275!
7"
This is a window vector

2222
Linear Layer (window approach)
yer Linear Layer
Window approach
Parameters to be
trained
€
nhu
l
l hidden unit
a
f1
✓ = hLTW ([w
Linear Layer The fixed size vec
network layers which perform a ne
f
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu
nl
hu is usually called the number of h
HardTanh Layer Several linear l
function, to extract highly non-linear
number of hidden units
of the l th layer
Linear Layer The fixed size vector f1
✓ can
network layers which perform a ne transforma
fl
✓ = Wl
fl
✓
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu are the par
nl
hu is usually called the number of hidden unit
HardTanh Layer Several linear layers are of
function, to extract highly non-linear features. I
10
hWi1
[w]t+dw
Linear Layer The fixed size vector f1
✓ can be fed to o
network layers which perform a ne transformations over th
fl
✓ = Wl
fl 1
✓ + bl
,
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu are the parameters to be
nl
hu is usually called the number of hidden units of the lth la
HardTanh Layer Several linear layers are often stacked, i
function, to extract highly non-linear features. If no non-linea
10
To be trained
linear layers stacked
interleaved with nonlinearity function to
extract highly non linear features. with
out non linearity, the network would be
just a linear model.

2323
HardTanh Layer
yer • Non-linear feature
Window approach HardTanh Layer
• Non-linear feature
Window approach
Using hardTanh instead of
hyperbolic Tanh is to make
the computation cheaper

2424
• Window Approach works well for most NLP tasks . However, it
fails with Semantic Role Labelling.
Window Approach Remark
Reason: the tag of a word depends on the
verb ( predicate) chosen beforehand in
the sentence . If the verb falls outside the
window then one cannot expect this word
to be tagged correctly. Then it requires
the consideration of sentence approach.

2525
Convoluntional Layer : Sentence Approachal Layer Convolutional Layer
Sentence approach
sentence
→1
generalisation of window approach,
windows in a sequence can be all taken
into consideration

2626
Neural Network Architecture!
Look"up"Table"
Words!
Linear"Layer"
Hard"Tanh"
Linear"Layer"
Convolu7on"
Max"Over"Time"
3"

2828
Time Delay Neural Neural NetworkTime Delay Neural Network

2929
Max Layer: Sentence Approach
yer
l
Max Layer
Sentence approach
hidden unit t=0 t
The max of the hidden units
over t = 0 - t

3030
Tagging SchemeTagging Schemes

3131
• Training
Bullet

3232
Training
Maximising the log-likelihood with respect to Theta
Training
Training
is the training set

3333
Training: Word Level Log-Likelihood
Training
Word Level
Log-Likelihood
soft max all
over tags
cross-entropy, it is not ideal because of the tag of a
word in the sentence and its neighbouring tags

3434
Training: Sentence Level Log-Likelihood
Sentence Level Log-Likelihood
transition score to jump from tag k to tag iAk,l
Sentence score for a tag path
€
[i ]1
T

3535
Training Sentence Level Log-Likelihood
Training
Sentence Level
Log-Likelihood
Conditional likelihood
by normalizing w.r.t all possible paths

3636
TrainingTraining
recursive Forward algorithm
Inference: Viterbi algorithm (replace logAdd by
max)

3737
• Training
Bullet

3838
• use lower case words in the dictionary
• add ‘caps’ feature to words that have at least one non-initial
capital letter.
• number with in a word are replaced with the string ‘Number’
Pre-processing

3939
Hyper-parametersHyper-parameters

4040
Benchmark Result
Sentences with similar words should be
tagged in the same way.
e.g.
The cat sat on the mat.
The feline sat on the mat.

4141
Neighbouring Words
neighboring words
neighboring words
word embeddings in the word lookup table of a SRL neural
network trained from scratch. 10 nearest neighbours using
Euclidean metirc.

4242
• The Lookup table can also be trained on unlabelled data by
optimising it to learn a language model.
• This gives words features that map similar words to similar
vectors (semantically)
Word Embeddings

4343
Sentence Embedding
Document Embedding
Word Embedding

4444
Ranking Language Model

4545
Tremendous Unlabelled Data
Lots of Unlabeled Data
• Two window approach (11) networks (100HU) trained on
two corpus
• LM1
– Wikipedia: 631 Mwords
– order dictionary words by frequency
– increase dictionary size: 5000, 10; 000, 30; 000, 50; 000,
100; 000
– 4 weeks of training
• LM2
– Wikipedia + Reuter=631+221=852M words
– initialized with LM1, dictionary size is 130; 000
– 30,000 additional most frequent Reuters words
– 3 additional weeks of training

4646
Word Embeddings
Word Embeddings
neighboring words

4747
Benchmark PerformanceBenchmark Performance

4949
Xiv
Natural Language Processing (almost) from Scratch
Lookup Table
Linear
Lookup Table
Linear
HardTanh HardTanh
Linear
Task 1
Linear
Task 2
M2
(t1) ⇥ · M2
(t2) ⇥ ·
LTW 1
...
LTW K
M1
⇥ ·
n1
hu n1
hu
n2
hu,(t1)
= #tags n2
hu,(t2)
= #tags
Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with
the window approach architecture presented in Figure 1. Lookup tables as well as the ﬁrst
hidden layer are shared. The last layer is task speciﬁc. The principle is the same with more
than two tasks.
5.2 Multi-Task Benchmark Results
Table 9 reports results obtained by jointly trained models for the POS, CHUNK, NER and
SRL tasks using the same setup as Section 4.5. We trained jointly POS, CHUNK and NER
using the window approach network. As we mentioned earlier, SRL can be trained only
with the sentence approach network, due to long-range dependencies related to the verb
Joint Training

5252
The Temptation
• Suffix Features
– Use last two characters as feature
• Gazetters
– 8,000 locations, person names, organizations
and misc entries from CoNLL 2003
• POS
– use POS as a feature for CHUNK & NER
• CHUNK
– use CHUNK as a feature for SRL

5454
Ensembles
10 Neural Network
→
voting ensemble: voting ten network outputs on a per tag basis
joined ensemble: parameters of the combining layer were trained on the
existing training set while keeping the networks ﬁxed.

5555
ConclusionConclusion
• Achievements
– “All purpose" neural network architecture for NLP tagging
– Limit task-specic engineering
– Rely on very large unlabeled datasets
– We do not plan to stop here
• Critics
– Why forgetting NLP expertise for neural network training
skills?
• NLP goals are not limited to existing NLP task
• Excessive task-specic engineering is not desirable
– Why neural networks?
• Scale on massive datasets
• Discover hidden representations
• Most of neural network technology existed in 1997 (Bottou,
1997)

NLP from scratch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NLP from scratch

Similar to NLP from scratch (20)

Recently uploaded

Recently uploaded (20)

NLP from scratch