The document presents a neural network architecture for various natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. It shows results comparable to state-of-the-art using word embeddings learned from a large unlabeled corpus, and improved results from joint training of the tasks. The network transforms words into feature vectors, extracts higher-level features through neural layers, and is trained via backpropagation. Benchmark results demonstrate performance on par with traditional task-specific systems without heavy feature engineering.
2. 22
• Presents a deep neural network architecture for NLP tasks
• Presents results comparable to state-of-art on 4 NLP tasks
• Part of Speech tagging
• Chunking
• Named Entity Recognition
• Semantic Role Labeling
• Presents word embeddings learned from a large unlabelled corpus
and shows an improvement in results by using these features
• Presents results of joint training for the above tasks.
3. 33
• Propose a unified neural network architecture and
learning algorithm that can be applied to various NLP
tasks
• Instead of creating hand-crafted features, we can acquire
task-specific features ( internal representation) from
great amount of labelled and unlabelled training data.
Motivation
4. 44
• Part of Speech Tagging
• Successively assign Part-of-Speech tags to words in a text
sequence automatically.
• Chunking
• Chunking is also called shallow parsing and it's basically the
identification of parts of speech and short phrases (like noun
phrases)
• Named Entity Recognition
• classify the elements in the text into predefined categories
such as person, location etc.
Task Introduction
5. 55
• SRL is sometimes also called shallow semantic parsing, is a task
in consisting of the detection of the semantic arguments
associated with the predicate or verb of a sentence and their
classification into their specific roles.
Semantic Role Labeling
e.g 1. Mark sold the car to Mary.
agent
represent
predicate
theme recipient
e.g 2.
14. 1414
• K Discrete Features construct a Matrix as a lookup table
Lookup tables
K discrete feature
Matrix
Lookup Tables
15. 1515
• Window Size: for example, 5
• Raw text features:
• — Lower case word
• — Capitalised feature
Words to Features: Window Approach
16. 1616
Window Approach
My Name is Bryan
PADDING PADDING My Name is Bryan PADDING PADDING
PADDING PADDING My Name is
PADDING My Name is Bryan
My Name is Bryan PADDING
Name is Bryan PADDING PADDING
17. 1717
Word to Features
Words to features!
My
Word index!
Caps index!
Vocabulary size (130,000)!
Number of
options (5)!
50!
5!
6"
Word Lookup Table
Caps Lookup
Table
19. 1919
• Transforming Words into Feature Vectors
• Extracting Higher Level Features from Word Feature Vector
• Training
• Benchmark Result
Bullet
20. 2020
Extracting Higher Level Features
Word Feature Vectors
L-layer Neural Network
Word Feature Vectors
L Neural Network
l
Extracting Higher Level Features From
Word Feature Vectors
L Neural Network
l
Word Feature Vecto
L Neural Network
l
Any feed forward neural network with L layers cane be
seen as a composition of function corresponding to
each layer l
: parameters
21. 2121
Window approach
t = 3,dwi n = 2
w1
1
w1
2
M
w1
3
M
w5
K−1
w5
K
ndow approach
Window approach
t = 3,dwi n = 2
w1
1
w1
2
M
w1
3
M
w5
K−1
w5
K
dow approach
Words to features!
PADDING
PADDING
My
Name
is
275!
7"
This is a window vector
22. 2222
Linear Layer (window approach)
yer Linear Layer
Window approach
Parameters to be
trained
€
nhu
l
l hidden unit
a
f1
✓ = hLTW ([w
Linear Layer The fixed size vec
network layers which perform a ne
f
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu
nl
hu is usually called the number of h
HardTanh Layer Several linear l
function, to extract highly non-linear
number of hidden units
of the l th layer
Linear Layer The fixed size vector f1
✓ can
network layers which perform a ne transforma
fl
✓ = Wl
fl
✓
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu are the par
nl
hu is usually called the number of hidden unit
HardTanh Layer Several linear layers are of
function, to extract highly non-linear features. I
10
hWi1
[w]t+dw
Linear Layer The fixed size vector f1
✓ can be fed to o
network layers which perform a ne transformations over th
fl
✓ = Wl
fl 1
✓ + bl
,
where Wl 2 Rnl
hu⇥nl 1
hu and bl 2 Rnl
hu are the parameters to be
nl
hu is usually called the number of hidden units of the lth la
HardTanh Layer Several linear layers are often stacked, i
function, to extract highly non-linear features. If no non-linea
10
To be trained
linear layers stacked
interleaved with nonlinearity function to
extract highly non linear features. with
out non linearity, the network would be
just a linear model.
23. 2323
HardTanh Layer
yer • Non-linear feature
Window approach HardTanh Layer
• Non-linear feature
Window approach
Using hardTanh instead of
hyperbolic Tanh is to make
the computation cheaper
24. 2424
• Window Approach works well for most NLP tasks . However, it
fails with Semantic Role Labelling.
Window Approach Remark
Reason: the tag of a word depends on the
verb ( predicate) chosen beforehand in
the sentence . If the verb falls outside the
window then one cannot expect this word
to be tagged correctly. Then it requires
the consideration of sentence approach.
25. 2525
Convoluntional Layer : Sentence Approachal Layer Convolutional Layer
Sentence approach
sentence
→1
generalisation of window approach,
windows in a sequence can be all taken
into consideration
33. 3333
Training: Word Level Log-Likelihood
Training
Word Level
Log-Likelihood
soft max all
over tags
cross-entropy, it is not ideal because of the tag of a
word in the sentence and its neighbouring tags
34. 3434
Training: Sentence Level Log-Likelihood
Sentence Level Log-Likelihood
transition score to jump from tag k to tag iAk,l
Sentence score for a tag path
€
[i ]1
T
35. 3535
Training Sentence Level Log-Likelihood
Training
Sentence Level
Log-Likelihood
Conditional likelihood
by normalizing w.r.t all possible paths
37. 3737
• Transforming Words into Feature Vectors
• Extracting Higher Level Features from Word Feature Vector
• Training
• Benchmark Result
Bullet
38. 3838
• use lower case words in the dictionary
• add ‘caps’ feature to words that have at least one non-initial
capital letter.
• number with in a word are replaced with the string ‘Number’
Pre-processing
42. 4242
• The Lookup table can also be trained on unlabelled data by
optimising it to learn a language model.
• This gives words features that map similar words to similar
vectors (semantically)
Word Embeddings
45. 4545
Tremendous Unlabelled Data
Lots of Unlabeled Data
• Two window approach (11) networks (100HU) trained on
two corpus
• LM1
– Wikipedia: 631 Mwords
– order dictionary words by frequency
– increase dictionary size: 5000, 10; 000, 30; 000, 50; 000,
100; 000
– 4 weeks of training
• LM2
– Wikipedia + Reuter=631+221=852M words
– initialized with LM1, dictionary size is 130; 000
– 30,000 additional most frequent Reuters words
– 3 additional weeks of training
49. 4949
Xiv
Natural Language Processing (almost) from Scratch
Lookup Table
Linear
Lookup Table
Linear
HardTanh HardTanh
Linear
Task 1
Linear
Task 2
M2
(t1) ⇥ · M2
(t2) ⇥ ·
LTW 1
...
LTW K
M1
⇥ ·
n1
hu n1
hu
n2
hu,(t1)
= #tags n2
hu,(t2)
= #tags
Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with
the window approach architecture presented in Figure 1. Lookup tables as well as the first
hidden layer are shared. The last layer is task specific. The principle is the same with more
than two tasks.
5.2 Multi-Task Benchmark Results
Table 9 reports results obtained by jointly trained models for the POS, CHUNK, NER and
SRL tasks using the same setup as Section 4.5. We trained jointly POS, CHUNK and NER
using the window approach network. As we mentioned earlier, SRL can be trained only
with the sentence approach network, due to long-range dependencies related to the verb
Joint Training
52. 5252
The Temptation
• Suffix Features
– Use last two characters as feature
• Gazetters
– 8,000 locations, person names, organizations
and misc entries from CoNLL 2003
• POS
– use POS as a feature for CHUNK & NER
• CHUNK
– use CHUNK as a feature for SRL
54. 5454
Ensembles
10 Neural Network
→
voting ensemble: voting ten network outputs on a per tag basis
joined ensemble: parameters of the combining layer were trained on the
existing training set while keeping the networks fixed.
55. 5555
ConclusionConclusion
• Achievements
– “All purpose" neural network architecture for NLP tagging
– Limit task-specic engineering
– Rely on very large unlabeled datasets
– We do not plan to stop here
• Critics
– Why forgetting NLP expertise for neural network training
skills?
• NLP goals are not limited to existing NLP task
• Excessive task-specic engineering is not desirable
– Why neural networks?
• Scale on massive datasets
• Discover hidden representations
• Most of neural network technology existed in 1997 (Bottou,
1997)