SlideShare a Scribd company logo
1 of 78
Download to read offline
deep learning
for natural language processing
Sergey I. Nikolenko1,2
AINL FRUCT 2016
St. Petersburg, November 10, 2016
1
Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg
2
Steklov Institute of Mathematics at St. Petersburg
Random facts: November 10 is the UNESCO World Science Day for Peace and Development;
on November 10, 1871, Henry Morton Stanley correctly presumed he finally found Dr. Livingstone
plan
• The deep learning revolution has not left natural language
processing alone.
• DL in NLP has started with standard architectures (RNN, CNN)
but then has branched out into new directions.
• Our plan for today:
(1) intro to distributed word representations;
(2) a primer on sentence embeddings and character-level models;
(3) a ((very-)very) brief overview of the most promising directions in
modern NLP based on deep learning.
• We will concentrate on directions that have given rise to new
models and architectures.
2
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• feedforward NNs are the basic building block;
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• autoencoders map a (possibly distorted) input to itself, usually for
feature engineering;
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• convolutional NNs apply NNs with shared weights to certain
windows in the previous layer (or input), collecting first local and
then more and more global features;
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• recurrent NNs have a hidden state and propagate it further, used
for sequence learning;
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• in particular, LSTM (long short-term memory) and GRU (gated
recurrent unit) units are an important RNN architecture often used
for NLP, good for longer dependencies.
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
basic nn architectures
• Basic neural network architectures that have been adapted for
deep learning over the last decade:
• main idea: in an LSTM, 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + …, so unless the LSTM
actually wants to forget something, 𝜕𝑐 𝑡
𝜕𝑐 𝑡−1
= 1 + …, and the
gradients do not vanish.
• Deep learning refers to several layers, any network mentioned
above can be deep or shallow, usually in several different ways.
3
word embeddings,
sentence embeddings,
and character-level models
word embeddings
• Distributional hypothesis in linguistics: words with similar
meaning will occur in similar contexts.
• Distributed word representations map words to a Euclidean
space (usually of dimension several hundred):
• started in earnest in (Bengio et al. 2003; 2006), although there
were earlier ideas;
• word2vec (Mikolov et al. 2013): train weights that serve best for
simple prediction tasks between a word and its context:
continuous bag-of-words (CBOW) and skip-gram;
• Glove (Pennington et al. 2014): train word weights to decompose
the (log) cooccurrence matrix.
5
word embeddings
• Difference between skip-gram and CBOW architectures:
• CBOW model predicts a word from its local context;
• skip-gram model predicts context words from the current word.
5
word embeddings
• The CBOW word2vec model operates as follows:
• inputs are one-hot word representations of dimension 𝑉 ;
• the hidden layer is the matrix of vector embeddings 𝑊;
• the hidden layer’s output is the average of input vectors;
• as output we get an estimate 𝑢 𝑗 for each word, and the posterior
is a simple softmax:
̂𝑝(𝑖|𝑐1, … , 𝑐 𝑛) =
exp(𝑢 𝑗)
∑ 𝑉
𝑗′=1
exp(𝑢 𝑗′ )
.
• In skip-gram, it’s the opposite:
• we predict each context word from the central word;
• so now there are several multinomial distributions, one softmax
for each context word:
̂𝑝(𝑐 𝑘|𝑖) =
exp(𝑢 𝑘𝑐 𝑘
)
∑
𝑉
𝑗′=1
exp(𝑢 𝑗′ )
.
5
word embeddings
• How do we train a model like that?
• E.g., in skip-gram we choose 𝜃 to maximize
𝐿(𝜃) = ∏
𝑖∈𝐷
⎛⎜
⎝
∏
𝑐∈𝐶(𝑖)
𝑝(𝑐 ∣ 𝑖; 𝜃)⎞⎟
⎠
= ∏
(𝑖,𝑐)∈𝐷
𝑝(𝑐 ∣ 𝑖; 𝜃),
and we parameterize
𝑝(𝑐 ∣ 𝑖; 𝜃) =
exp( ̃𝑤⊤
𝑐 𝑤𝑖)
∑ 𝑐′ exp( ̃𝑤⊤
𝑐′ 𝑤𝑖)
.
6
word embeddings
• This leads to the total likelihood
arg max
𝜃
∏
(𝑖,𝑐)∈𝐷
𝑝(𝑐 ∣ 𝑖; 𝜃) = arg max
𝜃
∑
(𝑖,𝑐)∈𝐷
𝑝(𝑐 ∣ 𝑖; 𝜃) =
= arg max
𝜃
∑
(𝑖,𝑐)∈𝐷
(exp( ̃𝑤⊤
𝑐 𝑤𝑖) − log ∑
𝑐′
exp( ̃𝑤⊤
𝑐′ 𝑤𝑖)) ,
which we maximize with negative sampling.
• Question: why do we need separate ̃𝑤 and 𝑤 vectors?
• Live demo: nearest neighbors, simple geometric relations.
6
how to use word vectors
• Next we can use recurrent architectures on top of word vectors.
• E.g., LSTMs for sentiment analysis:
• Train a network of LSTMs for language modeling, then use either
the last output or averaged hidden states for sentiment.
• We will see a lot of other architectures later.
7
up and down from word embeddings
• Word embeddings are the first step of most DL models in NLP.
• But we can go both up and down from word embeddings.
• First, a sentence is not necessarily the sum of its words.
• Second, a word is not quite as atomic as the word2vec model
would like to think.
8
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• The simplest idea is to use the sum and/or mean of word
embeddings to represent a sentence/paragraph:
• a baseline in (Le and Mikolov 2014);
• a reasonable method for short phrases in (Mikolov et al. 2013)
• shown to be effective for document summarization in (Kageback
et al. 2014).
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and
Mikolov 2014):
• a sentence/paragraph vector is an additional vector for each
paragraph;
• acts as a “memory” to provide longer context;
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW)
(Le and Mikolov 2014):
• the model is forced to predict words randomly sampled from a
specific paragraph;
• the paragraph vector is trained to help predict words from the
same paragraph in a small window.
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• A number of convolutional architectures (Ma et al., 2015;
Kalchbrenner et al., 2014).
• (Kiros et al. 2015): skip-thought vectors capture the meanings of
a sentence by training from skip-grams constructed on
sentences.
• (Djuric et al. 2015): model large text streams with hierarchical
neural language models with a document level and a token
level.
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• Recursive neural networks (Socher et al., 2012):
• a neural network composes a chunk of text with another part in a
tree;
• works its way up from word vectors to the root of a parse tree.
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• Recursive neural networks (Socher et al., 2012):
• by training this in a supervised way, one can get a very effective
approach to sentiment analysis (Socher et al. 2013).
9
sentence embeddings
• How do we combine word vectors into “text chunk” vectors?
• A similar effect can be achieved with CNNs.
• Unfolding Recursive Auto-Encoder model (URAE) (Socher et al.,
2011) collapses all word embeddings into a single vector
following the parse tree and then reconstructs back the original
sentence; applied to paraphrasing and paraphrase detection.
9
deep recursive networks
• Deep recursive networks for sentiment analysis (Irsoy, Cardie,
2014).
• First idea: decouple leaves and internal nodes.
• In recursive networks, we apply the same weights throughout
the tree:
𝑥 𝑣 = 𝑓(𝑊 𝐿 𝑥𝑙(𝑣) + 𝑊 𝑅 𝑥 𝑟(𝑣) + 𝑏).
• Now, we use different matrices for leaves (input words) and
hidden nodes:
• we can now have fewer hidden units than the word vector
dimension;
• we can use ReLU: sparse inputs and dense hidden units do not
cause a discrepancy.
10
deep recursive networks
• Second idea: add depth to get hierarchical representations:
ℎ(𝑖)
𝑣 = 𝑓(𝑊
(𝑖)
𝐿 ℎ
(𝑖)
𝑙(𝑣) + 𝑊
(𝑖)
𝑅 ℎ
(𝑖)
𝑟(𝑣) + 𝑉 (𝑖)
ℎ(𝑖−1)
𝑣 + 𝑏(𝑖)
).
• An excellent architecture for sentiment analysis... if you have
the parse trees. 10
character-level models
• Word embeddings have important shortcomings:
• vectors are independent but words are not; consider, in particular,
morphology-rich languages like Russian/Ukrainian;
• the same applies to out-of-vocabulary words: a word embedding
cannot be extended to new words;
• word embedding models may grow large; it’s just lookup, but the
whole vocabulary has to be stored in memory with fast access.
• E.g., “polydistributional” gets 48 results on Google, so you
probably have never seen it, and there’s very little training data:
• Do you have an idea what it means? Me too.
11
character-level models
• Hence, character-level representations:
• began by decomposing a word into morphemes (Luong et al. 2013;
Botha and Blunsom 2014; Soricut and Och 2015);
• but this adds errors since morphological analyzers are also
imperfect, and basically a part of the problem simply shifts to
training a morphology model;
• two natural approaches on character level: LSTMs and CNNs;
• in any case, the model is slow but we do not have to apply it to
every word, we can store embeddings of common words in a
lookup table as before and only run the model for rare words – a
nice natural tradeoff.
11
character-level models
• C2W (Ling et al. 2015) is based on bidirectional LSTMs:
11
character-level models
• The approach of Deep Structured Semantic Model (DSSM)
(Huang et al., 2013; Gao et al., 2014a; 2014b):
• sub-word embeddings: represent a word as a bag of trigrams;
• vocabulary shrinks to |𝑉 |3
(tens of thousands instead of millions),
but collisions are very rare;
• the representation is robust to misspellings (very important for
user-generated texts).
11
character-level models
• ConvNet (Zhang et al. 2015): text understanding from scratch,
from the level of symbols, based on CNNs.
• Character-level models and extensions to appear to be very
important, especially for morphology-rich languages like
Russian/Ukrainian.
11
word vectors with external information
• Other modifications of word embeddings add external
information.
• E.g., the RC-NET model (Xu et al. 2014) extends skip-grams with
relations (semantic and syntactic) and categorical knowledge
(sets of synonyms, domain knowledge etc.).
12
word vectors with external information
• The basic word2vec model gets a regularizer for every relation
that tries to bring it closer to a linear relation between the
vectors, so that, e.g.,
𝑤Hinton − 𝑤Wimbledon ≈ 𝑟born at ≈ 𝑤Euler − 𝑤Basel
12
word sense disambiguation
• Another important problem with both word vectors and
char-level models: homonyms.
• How do we distinguish different senses of the same word?
• the model usually just chooses one meaning;
• e.g., let’s check nearest neighbors for the word коса and other
homonyms.
• We have to add latent variables for different meaning and infer
them from context.
• To train the meanings with latent variables — Bayesian inference
with stochastic variational inference (Bartunov et al., 2015).
13
general approaches
text generation with rnns
• Language modeling and text generation is a natural direct
application of NN-based NLP; word embeddings started as a
“neural probabilistic language model” (Bengio et al., 2003).
• First idea – sequence learning with RNNs/LSTMs.
• Surprisingly, simple RNNs can produce quite reasonably-looking
text even by training character by character, with no knowledge
of the words (“The Unreasonable Effectiveness...”), including the
famous example from (Sutskever et al. 2011):
The meaning of life is the tradition of the ancient human reproduction: it is less
favorable to the good boy for when to remove her bigger...
• This is, of course, not “true understanding” (whatever that
means), only short-term memory effects.
• We need to go deeper in terms of both representations and
sequence modeling.
15
text generation with rnns
• One can change diversity (sampling temperature) and get different styles of
absurdistic texts. Random example with seed «обещал, на рождество, но
спустя семь лет. имя глав».
• Low diversity gets you Lucky’s monologue from “Waiting for Godot”:
обещал, на рождество, но спустя семь лет. имя главного командования в составе
советского союза с 1976 года. после проведения в 1992 году в составе советского
союза (1977). в 1967 году в составе советского союза состоялся в 1952 году в составе
советско-финской войны 1877 - 1877 годов. в 1966 году в составе советского союза
с 1965 года по 1977 год...
• Mid-range diversity produces something close to meaningful text:
обещал, на рождество, но спустя семь лет. имя главного рода собственно вновь
образовалась в россии и народном состоянии. после присказа с постановлением
союза писателей россии и генеральной диссертации о спортивном училище с 1980
года. в 1970-х годах был основан в составе комитета высшего совета театра в
польши. в 1957 - 1962 годах - начальник батальона сан-аухаров...
• High diversity leads to Khlebnikov’s zaum:
обещал, на рождество, но спустя семь лет. имя главы философии пововпели nol-
lнози - врайу-7 на луосече. человеческая восстания покторов извоенного чомпде
и э. дроссенбурга, … карл уним-общекрипских. эйелем хфечак от этого списка
сравнивала имущно моря в юнасториансический индристское носительских
женатов в церкви испании.... 16
poroshok
• And here are some poroshki generated with LSTMs from a
relatively small dataset:
заходит к солнцу отдаётесь
что он летел а может быть
и вовсе не веду на стенке
на поле пять и новый год
и почему то по башке
в квартире голуби и боли
и повзрослел и умирать
страшней всего когда ты выпил
без показания зонта
однажды я тебя не вышло
и ты
я захожу в макдоналисту
надену отраженный дождь
под ужин почему местами
и вдруг подставил человек
ты мне привычно верил крышу
до дна
я подползает под кроватью
чтоб он исписанный пингвин
и ты мне больше никогда
но мы же после русских классик
барто солдаты для любви
17
modern char-based language model: kim et al., 2015
18
dssm
• A general approach to NLP based on CNNs is given by Deep
Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,
2014a; 2014b):
• one-hot target vectors for classification (speech recognition,
image recognition, language modeling).
19
dssm
• A general approach to NLP based on CNNs is given by Deep
Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,
2014a; 2014b):
• vector-valued targets for semantic matching.
19
dssm
• A general approach to NLP based on CNNs is given by Deep
Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,
2014a; 2014b):
• can capture different targets (one-hot, vector);
• to train with vector targets – reflection: bring source and target
vectors closer.
19
dssm
• A general approach to NLP based on CNNs is given by Deep
Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,
2014a; 2014b):
• DSSMs can be applied in a number of different contexts when
we can specify a supervised dataset:
• semantic word embeddings: word by context;
• web search: web documents by query;
• question answering: knowledge base relation/entity by pattern;
• recommendations: interesting documents by read/liked
documents;
• translation: target sentence by source sentence;
• text/image: labels by images or vice versa.
• Basically, this is an example of a general architecture that can
be trained to do almost anything.
19
dssm
• A general approach to NLP based on CNNs is given by Deep
Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al.,
2014a; 2014b):
• Deep Structured Semantic Models (DSSM) (Huang et al., 2013;
Gao et al., 2014a; 2014b): a deep convolutional architecture
trained on similar text pairs.
• Can be used for information retrieval: model relevance by
bringing relevant documents closer to their queries (both
document and query go through the same convolutional
architecture).
• November 2, 2016: a post by Yandex saying that they use
(modified) DSSM in their new Palekh search algorithm.
19
dqn
• (Guo, 2015): generating text with deep reinforcement learning.
• Begin with easy parts, then iteratively decode the hard parts
with DQN.
• Next, we proceed to specific NLP problems that have led to
interesting developments.
20
dependency parsing
dependency parsing
• We mentioned parse trees; but how do we construct them?
• Current state of the art – continuous-state parsing: current state
is encoded in ℝ 𝑑
.
• Stack LSTMs (Dyer et al., 2015) – the parser manipulates three
basic data structures:
(1) a buffer 𝐵 that contains the sequence of words, with state 𝑏 𝑡;
(2) a stack 𝑆 that stores partially constructed parses, with state 𝑠 𝑡;
(3) a list 𝐴 of actions already taken by the parser, with state 𝑎 𝑡.
• 𝑏 𝑡, 𝑠 𝑡, and 𝑎 𝑡 are hidden states of stack LSTMs, LSTMs that have a
stack pointer: new inputs are added from the right, but the
current location of the stack pointer shows which cell’s state is
used to compute new memory cell contents.
22
dependency parsing with morphology
• Important extension – (Ballesteros et al., 2015):
• in morphologically rich natural languages, we have to take into
account morphology;
• so they represent the words by bidirectional character-level LSTMs;
• report improved results in Arabic, Basque, French, German,
Hebrew, Hungarian, Korean, Polish, Swedish, and Turkish;
• this direction probably can be further improved (and where’s
Russian or Ukrainian in the list above?..).
23
evaluation for sequence-to-sequence models
• Next we will consider specific models for machine translation,
dialog models, and question answering.
• But how do we evaluate NLP models that produce text?
• Quality metrics for comparing with reference sentences
produced by humans:
• BLEU (Bilingual Evaluation Understudy): reweighted precision (incl.
multiple reference translations);
• METEOR: harmonic mean of unigram precision and unigram recall;
• TER (Translation Edit Rate): number of edits between the output
and reference divided by the average number of reference words;
• LEPOR: combine basic factors and language metrics with tunable
parameters.
• The same metrics apply to paraphrasing and, generally, all
problems where the (supervised) answer should be a free-form
text.
24
machine translation
machine translation
• Translation is a very convenient problem for modern NLP:
• on one hand, it is very practical, obviously important;
• on the other hand, it’s very high-level, virtually impossible without
deep understanding, so if we do well on translation, we probably
do something right about understanding;
• on the third hand (oops), it’s quantifiable (BLEU, TER etc.) and has
relatively large available datasets (parallel corpora).
26
machine translation
• Statistical machine translation (SMT): model conditional
probability 𝑝(𝑦 ∣ 𝑥) of target 𝑦 (translation) given source 𝑥 (text).
• Classical SMT: model log 𝑝(𝑦 ∣ 𝑥) with a linear combination of
features and then construct these features.
• NNs have been used both for reranking the best lists of possible
translations and as part of feature functions:
26
machine translation
• NNs are still used for feature engineering with state of the art
results, but here we are more interested in
sequence-to-sequence modeling.
• Basic idea:
• RNNs can be naturally used to probabilistically model a sequence
𝑋 = (𝑥1, 𝑥2, … , 𝑥 𝑇 ) as 𝑝(𝑥1), 𝑝(𝑥2 ∣ 𝑥1), …,
𝑝(𝑥 𝑇 ∣ 𝑥<𝑇 ) = 𝑝(𝑥 𝑇 ∣ 𝑥 𝑇−1, … , 𝑥1), and then the joint probability
𝑝(𝑋) is just their product
𝑝(𝑋) = 𝑝(𝑥1)𝑝(𝑥2 ∣ 𝑥1) … 𝑝(𝑥 𝑘 ∣ 𝑥<𝑘) … 𝑝(𝑥 𝑇 ∣ 𝑥<𝑇 );
• this is how RNNs are used for language modeling;
• we predict next word based on the hidden state learned from all
previous parts of the sequence;
• in translation, maybe we can learn the hidden state from one
sentence and apply to another.
26
machine translation
• Direct application – bidirectional LSTMs (Bahdanau et al. 2014):
• But do we really translate word by word?
26
machine translation
• No, we first understand the whole sentence; hence
encoder-decoder architectures (Cho et al. 2014):
26
machine translation
• But compressing the entire sentence to a fixed-dimensional
vector is hard; quality drops dramatically with length.
• Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015):
• encoder RNNs are bidirectional, so at every word we have a
“focused” representation with both contexts;
• attention NN takes the state and local representation and outputs
a relevance score – should we translate this word right now?
26
machine translation
• Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015):
• formally very simple: we compute attention weights 𝛼 𝑖𝑗 and
reweigh context vectors with them:
𝑒 𝑖𝑗 = 𝑎(𝑠 𝑖−1, 𝑗), 𝛼 𝑖𝑗 = softmax(𝑒 𝑖𝑗; 𝑒 𝑖∗),
𝑐 𝑖 =
𝑗=1
∑
𝑇 𝑥
𝛼 𝑖𝑗ℎ 𝑗, and now 𝑠 𝑖 = 𝑓(𝑠 𝑖−1, 𝑦 𝑖−1, 𝑐 𝑖).
26
machine translation
• We get better word order in the sentence as a whole:
• Attention is an “old” idea (Larochelle, Hinton, 2010), and can be
applied to other RNN architectures, e.g., image processing and
speech recognition; in other sequence-based NLP tasks:
• syntactic parsing (Vinyals et al. 2014),
• modeling pairs of sentences (Yin et al. 2015),
• question answering (Hermann et al. 2015),
• “Show, Attend, and Tell” (Xu et al. 2015).
26
google translate
• Sep 26, 2016: Wu et al., Google’s Neural Machine Translation
System: Bridging the Gap between Human and Machine
Translation:
• this very recent paper shows how Google Translate actually works;
• the basic architecture is the same: encoder, decoder, attention;
• RNNs have to be deep enough to capture language irregularities,
so 8 layers for encoder and decoder each:
27
google translate
• Sep 26, 2016: Wu et al., Google’s Neural Machine Translation
System: Bridging the Gap between Human and Machine
Translation:
• but stacking LSTMs does not really work: 4-5 layers are OK, 8
layers don’t work;
• so they add residual connections between the layers, similar to
(He, 2015):
27
google translate
• Sep 26, 2016: Wu et al., Google’s Neural Machine Translation
System: Bridging the Gap between Human and Machine
Translation:
• and it makes sense to make the bottom layer bidirectional in
order to capture as much context as possible:
27
google translate
• Sep 26, 2016: Wu et al., Google’s Neural Machine Translation
System: Bridging the Gap between Human and Machine
Translation:
• GNMT also uses two ideas for word segmentation:
• wordpiece model: break words into wordpieces (with a separate
model); example from the paper:
Jet makers feud over seat width with big orders at stake
becomes
_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
• mixed word/character model: use word model but for
out-of-vocabulary words convert them into characters (specifically
marked so that they cannot be confused); example from the paper:
Miki becomes <B>M <M>i <M>k <E>i
27
dialog and conversation
dialog and conversational models
• Dialog models attempt to model and predict dialogue;
conversational models actively talk to a human.
• Applications – automatic chat systems for business etc., so we
want to convey information.
• Vinyals and Le (2015) use seq2seq (Sutskever et al. 2014):
• feed previous sentences ABC as context to the RNN;
• predict the next word of reply WXYZ based on the previous word
and hidden state.
• They get a reasonable conversational model, both general
(MovieSubtitles) and in a specific domain (IT helpdesk).
29
dialog and conversational models
• Hierarchical recurrent encoder decoder architecture (HRED); first
proposed for query suggestion in IR (Sordoni et al. 2015), used
for dialog systems in (Serban et al. 2015).
• The dialogue as a two-level system: a sequence of utterances,
each of which is in turn a sequence of words. To model this
two-level system, HRED trains:
(1) encoder RNN that maps each utterance in a dialogue into a single
utterance vector;
(2) context RNN that processes all previous utterance vectors and
combines them into the current context vector;
(3) decoder RNN that predicts the tokens in the next utterance, one at
a time, conditional on the context RNN.
29
dialog and conversational models
• HRED architecture:
• (Serban et al. 2015) report promising results in terms of both
language models (perplexity) and expert evaluation.
29
dialog and conversational models
• Some recent developments:
• (Li et al., 2016a) apply, again, reinforcement learning (DQN) to
improve dialogue generation;
• (Li et al., 2016b) add personas with latent variables, so dialogue
can be more consistent (yes, it’s the same Li);
• (Wen et al., 2016) use snapshot learning, adding some weak
supervision in the form of particular events occurring in the
output sequence (whether we still want to say something or have
already said it);
• (Su et al., 2016) improve dialogue systems with online active
reward learning, a tool from reinforcement learning.
• Generally, chatbots are becoming commonplace but it is still a
long way to go before actual general-purpose dialogue.
29
question answering
question answering
• Question answering (QA) is one of the hardest NLP challenges,
close to true language understanding.
• Let us begin with evaluation:
• it’s easy to find datasets for information retrieval;
• these questions can be answered knowledge base approaches:
map questions to logical queries over a graph of facts;
• in a multiple choice setting (Quiz Bowl), map the question and
possible answers to a semantic space and find nearest neighbors
(Socher et al. 2014);
• but this is not exactly general question answering.
• (Weston et al. 2015): a dataset of simple (for humans) questions
that do not require any special knowledge.
• But require reasoning and understanding of semantic
structure...
31
question answering
• Sample questions:
Task 1: Single Supporting Fact
Mary went to the bathroom.
John moved to the hallway.
Mary travelled to the office.
Where is Mary?
A: office
Task 4: Two Argument Relations
The office is north of the bedroom.
The bedroom is north of the bathroom.
The kitchen is west of the garden.
What is north of the bedroom? A: office
What is the bedroom north of? A: bathroom
Task 7: Counting
Daniel picked up the football.
Daniel dropped the football.
Daniel got the milk.
Daniel took the apple.
How many objects is Daniel holding?
A: two
Task 10: Indefinite Knowledge
John is either in the classroom or the playground.
Sandra is in the garden.
Is John in the classroom?
A: maybe
Is John in the office?
A: no
Task 15: Basic Deduction
Sheep are afraid of wolves.
Cats are afraid of dogs.
Mice are afraid of cats.
Gertrude is a sheep.
What is Gertrude afraid of?
A: wolves
Task 20: Agent’s Motivations
John is hungry.
John goes to the kitchen.
John grabbed the apple there.
Daniel is hungry.
Where does Daniel go? A: kitchen
Why did John go to the kitchen? A: hungry
• One problem is that we have to remember the context set
throughout the whole question... 31
question answering
• ...so the current state of the art are memory networks (Weston et
al. 2014).
• An array of objects (memory) and the following components
learned during training:
I (input feature map) converts the input to the internal feature
representation;
G (generalization) updates old memories after receiving new input;
O (output feature map) produces new output given a new input and
a memory state;
R (response) converts the output of O into the output response
format (e.g., text).
31
question answering
• Dynamic memory networks (Kumar et al. 2015).
• Episodic memory unit that chooses which parts of the input to
focus on with an attention mechanism:
31
question answering
• End-to-end memory networks (Sukhbaatar et al. 2015).
• A continuous version of memory networks, with multiple hops
(computational steps) per output symbol.
• Regular memory networks require supervision on each layer;
end-to-end ones can be trained with input-output pairs:
31
question answering
• There are plenty of other extensions; one problem is how to link
QA systems with knowledge bases to answer questions that
require both reasoning and knowledge.
• Google DeepMind is also working on QA (Hermann et al. 2015):
• a CNN-based approach to QA, also tested on the same dataset;
• perhaps more importantly, a relatively simple and straightforward
way to convert unlabeled corpora to questions;
• e.g., given a newspaper article and its summary, they construct
(context, query, answer) triples that could then be used for
supervised training of text comprehension models.
• I expect a lot of exciting things to happen here.
• But allow me to suggest...
31
what? where? when?
• «What? Where? When?»: a team game of answering questions.
Sometimes it looks like this...
32
what? where? when?
• ...but usually it looks like this:
32
what? where? when?
• Teams of ≤ 6 players answer questions, whoever gets the most
correct answers wins.
• db.chgk.info – database of about 300K questions.
• Some of them come from “Своя игра”, a Jeopardy clone but
often with less direct questions:
• Современная музыка
На самом деле первое слово в названии ЭТОГО коллектива
совпадает с фамилией шестнадцатого президента США, а
исказили его для того, чтобы приобрести соответствующее
названию доменное имя.
• Россия в начале XX века
В ЭТОМ году в России было собрано 5,3 миллиарда пудов
зерновых.
• Чёрное и белое
ОН постоянно меняет черное на белое и наоборот, а его
соседа в этом вопросе отличает постоянство.
32
what? where? when?
• Most are “Что? Где? Когда?” questions, even harder for
automated analysis:
• Ягоды черники невзрачные и довольно простецкие. Какой автор
утверждал, что не случайно использовал в своей книге одно из
названий черники?
• В середине тридцатых годов в Москве выходила газета под названием
«Советское метро». Какие две буквы мы заменили в предыдущем
предложении?
• Русская примета рекомендует 18 июня полоть сорняки. Согласно
второй части той же приметы, этот день можно считать благоприятным
для НЕЁ. Назовите ЕЁ словом латинского происхождения.
• Соблазнитель из венской оперетты считает, что после НЕГО женская
неприступность уменьшается вчетверо. Назовите ЕГО одним словом.
• I believe it is a great and very challenging QA dataset.
• How far in the future do you think it is? :)
32
thank you!
Thank you for your attention!
33

More Related Content

What's hot

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Innovation Quotient Pvt Ltd
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Divya Gera
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 

What's hot (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Language models
Language modelsLanguage models
Language models
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 

Viewers also liked

AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinLidia Pivovarova
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovLidia Pivovarova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovLidia Pivovarova
 
AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaLidia Pivovarova
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...Lidia Pivovarova
 

Viewers also liked (20)

AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, Kazorin
 
AINL 2016: Skornyakov
AINL 2016: SkornyakovAINL 2016: Skornyakov
AINL 2016: Skornyakov
 
AINL 2016: Ustalov
AINL 2016: Ustalov AINL 2016: Ustalov
AINL 2016: Ustalov
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, Ledovaya
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 

Similar to AINL 2016: Nikolenko

Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...AIST
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
NLP using Deep learning
NLP using Deep learningNLP using Deep learning
NLP using Deep learningBabu Priyavrat
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 

Similar to AINL 2016: Nikolenko (20)

Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Tensorflow
TensorflowTensorflow
Tensorflow
 
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...Sergey Nikolenko and  Anton Alekseev  User Profiling in Text-Based Recommende...
Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommende...
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
NLP using Deep learning
NLP using Deep learningNLP using Deep learning
NLP using Deep learning
 
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
FinalReport
FinalReportFinalReport
FinalReport
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 

More from Lidia Pivovarova

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Lidia Pivovarova
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classificationLidia Pivovarova
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesLidia Pivovarova
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текстаLidia Pivovarova
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyLidia Pivovarova
 

More from Lidia Pivovarova (7)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 

Recently uploaded

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 

Recently uploaded (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

AINL 2016: Nikolenko

  • 1. deep learning for natural language processing Sergey I. Nikolenko1,2 AINL FRUCT 2016 St. Petersburg, November 10, 2016 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg Random facts: November 10 is the UNESCO World Science Day for Peace and Development; on November 10, 1871, Henry Morton Stanley correctly presumed he finally found Dr. Livingstone
  • 2. plan • The deep learning revolution has not left natural language processing alone. • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions. • Our plan for today: (1) intro to distributed word representations; (2) a primer on sentence embeddings and character-level models; (3) a ((very-)very) brief overview of the most promising directions in modern NLP based on deep learning. • We will concentrate on directions that have given rise to new models and architectures. 2
  • 3. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • feedforward NNs are the basic building block; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 4. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • autoencoders map a (possibly distorted) input to itself, usually for feature engineering; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 5. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • convolutional NNs apply NNs with shared weights to certain windows in the previous layer (or input), collecting first local and then more and more global features; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 6. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • recurrent NNs have a hidden state and propagate it further, used for sequence learning; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 7. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • in particular, LSTM (long short-term memory) and GRU (gated recurrent unit) units are an important RNN architecture often used for NLP, good for longer dependencies. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 8. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • main idea: in an LSTM, 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + …, so unless the LSTM actually wants to forget something, 𝜕𝑐 𝑡 𝜕𝑐 𝑡−1 = 1 + …, and the gradients do not vanish. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
  • 10. word embeddings • Distributional hypothesis in linguistics: words with similar meaning will occur in similar contexts. • Distributed word representations map words to a Euclidean space (usually of dimension several hundred): • started in earnest in (Bengio et al. 2003; 2006), although there were earlier ideas; • word2vec (Mikolov et al. 2013): train weights that serve best for simple prediction tasks between a word and its context: continuous bag-of-words (CBOW) and skip-gram; • Glove (Pennington et al. 2014): train word weights to decompose the (log) cooccurrence matrix. 5
  • 11. word embeddings • Difference between skip-gram and CBOW architectures: • CBOW model predicts a word from its local context; • skip-gram model predicts context words from the current word. 5
  • 12. word embeddings • The CBOW word2vec model operates as follows: • inputs are one-hot word representations of dimension 𝑉 ; • the hidden layer is the matrix of vector embeddings 𝑊; • the hidden layer’s output is the average of input vectors; • as output we get an estimate 𝑢 𝑗 for each word, and the posterior is a simple softmax: ̂𝑝(𝑖|𝑐1, … , 𝑐 𝑛) = exp(𝑢 𝑗) ∑ 𝑉 𝑗′=1 exp(𝑢 𝑗′ ) . • In skip-gram, it’s the opposite: • we predict each context word from the central word; • so now there are several multinomial distributions, one softmax for each context word: ̂𝑝(𝑐 𝑘|𝑖) = exp(𝑢 𝑘𝑐 𝑘 ) ∑ 𝑉 𝑗′=1 exp(𝑢 𝑗′ ) . 5
  • 13. word embeddings • How do we train a model like that? • E.g., in skip-gram we choose 𝜃 to maximize 𝐿(𝜃) = ∏ 𝑖∈𝐷 ⎛⎜ ⎝ ∏ 𝑐∈𝐶(𝑖) 𝑝(𝑐 ∣ 𝑖; 𝜃)⎞⎟ ⎠ = ∏ (𝑖,𝑐)∈𝐷 𝑝(𝑐 ∣ 𝑖; 𝜃), and we parameterize 𝑝(𝑐 ∣ 𝑖; 𝜃) = exp( ̃𝑤⊤ 𝑐 𝑤𝑖) ∑ 𝑐′ exp( ̃𝑤⊤ 𝑐′ 𝑤𝑖) . 6
  • 14. word embeddings • This leads to the total likelihood arg max 𝜃 ∏ (𝑖,𝑐)∈𝐷 𝑝(𝑐 ∣ 𝑖; 𝜃) = arg max 𝜃 ∑ (𝑖,𝑐)∈𝐷 𝑝(𝑐 ∣ 𝑖; 𝜃) = = arg max 𝜃 ∑ (𝑖,𝑐)∈𝐷 (exp( ̃𝑤⊤ 𝑐 𝑤𝑖) − log ∑ 𝑐′ exp( ̃𝑤⊤ 𝑐′ 𝑤𝑖)) , which we maximize with negative sampling. • Question: why do we need separate ̃𝑤 and 𝑤 vectors? • Live demo: nearest neighbors, simple geometric relations. 6
  • 15. how to use word vectors • Next we can use recurrent architectures on top of word vectors. • E.g., LSTMs for sentiment analysis: • Train a network of LSTMs for language modeling, then use either the last output or averaged hidden states for sentiment. • We will see a lot of other architectures later. 7
  • 16. up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • Second, a word is not quite as atomic as the word2vec model would like to think. 8
  • 17. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • The simplest idea is to use the sum and/or mean of word embeddings to represent a sentence/paragraph: • a baseline in (Le and Mikolov 2014); • a reasonable method for short phrases in (Mikolov et al. 2013) • shown to be effective for document summarization in (Kageback et al. 2014). 9
  • 18. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov 2014): • a sentence/paragraph vector is an additional vector for each paragraph; • acts as a “memory” to provide longer context; 9
  • 19. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW) (Le and Mikolov 2014): • the model is forced to predict words randomly sampled from a specific paragraph; • the paragraph vector is trained to help predict words from the same paragraph in a small window. 9
  • 20. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A number of convolutional architectures (Ma et al., 2015; Kalchbrenner et al., 2014). • (Kiros et al. 2015): skip-thought vectors capture the meanings of a sentence by training from skip-grams constructed on sentences. • (Djuric et al. 2015): model large text streams with hierarchical neural language models with a document level and a token level. 9
  • 21. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • a neural network composes a chunk of text with another part in a tree; • works its way up from word vectors to the root of a parse tree. 9
  • 22. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • by training this in a supervised way, one can get a very effective approach to sentiment analysis (Socher et al. 2013). 9
  • 23. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A similar effect can be achieved with CNNs. • Unfolding Recursive Auto-Encoder model (URAE) (Socher et al., 2011) collapses all word embeddings into a single vector following the parse tree and then reconstructs back the original sentence; applied to paraphrasing and paraphrase detection. 9
  • 24. deep recursive networks • Deep recursive networks for sentiment analysis (Irsoy, Cardie, 2014). • First idea: decouple leaves and internal nodes. • In recursive networks, we apply the same weights throughout the tree: 𝑥 𝑣 = 𝑓(𝑊 𝐿 𝑥𝑙(𝑣) + 𝑊 𝑅 𝑥 𝑟(𝑣) + 𝑏). • Now, we use different matrices for leaves (input words) and hidden nodes: • we can now have fewer hidden units than the word vector dimension; • we can use ReLU: sparse inputs and dense hidden units do not cause a discrepancy. 10
  • 25. deep recursive networks • Second idea: add depth to get hierarchical representations: ℎ(𝑖) 𝑣 = 𝑓(𝑊 (𝑖) 𝐿 ℎ (𝑖) 𝑙(𝑣) + 𝑊 (𝑖) 𝑅 ℎ (𝑖) 𝑟(𝑣) + 𝑉 (𝑖) ℎ(𝑖−1) 𝑣 + 𝑏(𝑖) ). • An excellent architecture for sentiment analysis... if you have the parse trees. 10
  • 26. character-level models • Word embeddings have important shortcomings: • vectors are independent but words are not; consider, in particular, morphology-rich languages like Russian/Ukrainian; • the same applies to out-of-vocabulary words: a word embedding cannot be extended to new words; • word embedding models may grow large; it’s just lookup, but the whole vocabulary has to be stored in memory with fast access. • E.g., “polydistributional” gets 48 results on Google, so you probably have never seen it, and there’s very little training data: • Do you have an idea what it means? Me too. 11
  • 27. character-level models • Hence, character-level representations: • began by decomposing a word into morphemes (Luong et al. 2013; Botha and Blunsom 2014; Soricut and Och 2015); • but this adds errors since morphological analyzers are also imperfect, and basically a part of the problem simply shifts to training a morphology model; • two natural approaches on character level: LSTMs and CNNs; • in any case, the model is slow but we do not have to apply it to every word, we can store embeddings of common words in a lookup table as before and only run the model for rare words – a nice natural tradeoff. 11
  • 28. character-level models • C2W (Ling et al. 2015) is based on bidirectional LSTMs: 11
  • 29. character-level models • The approach of Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • sub-word embeddings: represent a word as a bag of trigrams; • vocabulary shrinks to |𝑉 |3 (tens of thousands instead of millions), but collisions are very rare; • the representation is robust to misspellings (very important for user-generated texts). 11
  • 30. character-level models • ConvNet (Zhang et al. 2015): text understanding from scratch, from the level of symbols, based on CNNs. • Character-level models and extensions to appear to be very important, especially for morphology-rich languages like Russian/Ukrainian. 11
  • 31. word vectors with external information • Other modifications of word embeddings add external information. • E.g., the RC-NET model (Xu et al. 2014) extends skip-grams with relations (semantic and syntactic) and categorical knowledge (sets of synonyms, domain knowledge etc.). 12
  • 32. word vectors with external information • The basic word2vec model gets a regularizer for every relation that tries to bring it closer to a linear relation between the vectors, so that, e.g., 𝑤Hinton − 𝑤Wimbledon ≈ 𝑟born at ≈ 𝑤Euler − 𝑤Basel 12
  • 33. word sense disambiguation • Another important problem with both word vectors and char-level models: homonyms. • How do we distinguish different senses of the same word? • the model usually just chooses one meaning; • e.g., let’s check nearest neighbors for the word коса and other homonyms. • We have to add latent variables for different meaning and infer them from context. • To train the meanings with latent variables — Bayesian inference with stochastic variational inference (Bartunov et al., 2015). 13
  • 35. text generation with rnns • Language modeling and text generation is a natural direct application of NN-based NLP; word embeddings started as a “neural probabilistic language model” (Bengio et al., 2003). • First idea – sequence learning with RNNs/LSTMs. • Surprisingly, simple RNNs can produce quite reasonably-looking text even by training character by character, with no knowledge of the words (“The Unreasonable Effectiveness...”), including the famous example from (Sutskever et al. 2011): The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger... • This is, of course, not “true understanding” (whatever that means), only short-term memory effects. • We need to go deeper in terms of both representations and sequence modeling. 15
  • 36. text generation with rnns • One can change diversity (sampling temperature) and get different styles of absurdistic texts. Random example with seed «обещал, на рождество, но спустя семь лет. имя глав». • Low diversity gets you Lucky’s monologue from “Waiting for Godot”: обещал, на рождество, но спустя семь лет. имя главного командования в составе советского союза с 1976 года. после проведения в 1992 году в составе советского союза (1977). в 1967 году в составе советского союза состоялся в 1952 году в составе советско-финской войны 1877 - 1877 годов. в 1966 году в составе советского союза с 1965 года по 1977 год... • Mid-range diversity produces something close to meaningful text: обещал, на рождество, но спустя семь лет. имя главного рода собственно вновь образовалась в россии и народном состоянии. после присказа с постановлением союза писателей россии и генеральной диссертации о спортивном училище с 1980 года. в 1970-х годах был основан в составе комитета высшего совета театра в польши. в 1957 - 1962 годах - начальник батальона сан-аухаров... • High diversity leads to Khlebnikov’s zaum: обещал, на рождество, но спустя семь лет. имя главы философии пововпели nol- lнози - врайу-7 на луосече. человеческая восстания покторов извоенного чомпде и э. дроссенбурга, … карл уним-общекрипских. эйелем хфечак от этого списка сравнивала имущно моря в юнасториансический индристское носительских женатов в церкви испании.... 16
  • 37. poroshok • And here are some poroshki generated with LSTMs from a relatively small dataset: заходит к солнцу отдаётесь что он летел а может быть и вовсе не веду на стенке на поле пять и новый год и почему то по башке в квартире голуби и боли и повзрослел и умирать страшней всего когда ты выпил без показания зонта однажды я тебя не вышло и ты я захожу в макдоналисту надену отраженный дождь под ужин почему местами и вдруг подставил человек ты мне привычно верил крышу до дна я подползает под кроватью чтоб он исписанный пингвин и ты мне больше никогда но мы же после русских классик барто солдаты для любви 17
  • 38. modern char-based language model: kim et al., 2015 18
  • 39. dssm • A general approach to NLP based on CNNs is given by Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • one-hot target vectors for classification (speech recognition, image recognition, language modeling). 19
  • 40. dssm • A general approach to NLP based on CNNs is given by Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • vector-valued targets for semantic matching. 19
  • 41. dssm • A general approach to NLP based on CNNs is given by Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • can capture different targets (one-hot, vector); • to train with vector targets – reflection: bring source and target vectors closer. 19
  • 42. dssm • A general approach to NLP based on CNNs is given by Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • DSSMs can be applied in a number of different contexts when we can specify a supervised dataset: • semantic word embeddings: word by context; • web search: web documents by query; • question answering: knowledge base relation/entity by pattern; • recommendations: interesting documents by read/liked documents; • translation: target sentence by source sentence; • text/image: labels by images or vice versa. • Basically, this is an example of a general architecture that can be trained to do almost anything. 19
  • 43. dssm • A general approach to NLP based on CNNs is given by Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • Deep Structured Semantic Models (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): a deep convolutional architecture trained on similar text pairs. • Can be used for information retrieval: model relevance by bringing relevant documents closer to their queries (both document and query go through the same convolutional architecture). • November 2, 2016: a post by Yandex saying that they use (modified) DSSM in their new Palekh search algorithm. 19
  • 44. dqn • (Guo, 2015): generating text with deep reinforcement learning. • Begin with easy parts, then iteratively decode the hard parts with DQN. • Next, we proceed to specific NLP problems that have led to interesting developments. 20
  • 46. dependency parsing • We mentioned parse trees; but how do we construct them? • Current state of the art – continuous-state parsing: current state is encoded in ℝ 𝑑 . • Stack LSTMs (Dyer et al., 2015) – the parser manipulates three basic data structures: (1) a buffer 𝐵 that contains the sequence of words, with state 𝑏 𝑡; (2) a stack 𝑆 that stores partially constructed parses, with state 𝑠 𝑡; (3) a list 𝐴 of actions already taken by the parser, with state 𝑎 𝑡. • 𝑏 𝑡, 𝑠 𝑡, and 𝑎 𝑡 are hidden states of stack LSTMs, LSTMs that have a stack pointer: new inputs are added from the right, but the current location of the stack pointer shows which cell’s state is used to compute new memory cell contents. 22
  • 47. dependency parsing with morphology • Important extension – (Ballesteros et al., 2015): • in morphologically rich natural languages, we have to take into account morphology; • so they represent the words by bidirectional character-level LSTMs; • report improved results in Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish, Swedish, and Turkish; • this direction probably can be further improved (and where’s Russian or Ukrainian in the list above?..). 23
  • 48. evaluation for sequence-to-sequence models • Next we will consider specific models for machine translation, dialog models, and question answering. • But how do we evaluate NLP models that produce text? • Quality metrics for comparing with reference sentences produced by humans: • BLEU (Bilingual Evaluation Understudy): reweighted precision (incl. multiple reference translations); • METEOR: harmonic mean of unigram precision and unigram recall; • TER (Translation Edit Rate): number of edits between the output and reference divided by the average number of reference words; • LEPOR: combine basic factors and language metrics with tunable parameters. • The same metrics apply to paraphrasing and, generally, all problems where the (supervised) answer should be a free-form text. 24
  • 50. machine translation • Translation is a very convenient problem for modern NLP: • on one hand, it is very practical, obviously important; • on the other hand, it’s very high-level, virtually impossible without deep understanding, so if we do well on translation, we probably do something right about understanding; • on the third hand (oops), it’s quantifiable (BLEU, TER etc.) and has relatively large available datasets (parallel corpora). 26
  • 51. machine translation • Statistical machine translation (SMT): model conditional probability 𝑝(𝑦 ∣ 𝑥) of target 𝑦 (translation) given source 𝑥 (text). • Classical SMT: model log 𝑝(𝑦 ∣ 𝑥) with a linear combination of features and then construct these features. • NNs have been used both for reranking the best lists of possible translations and as part of feature functions: 26
  • 52. machine translation • NNs are still used for feature engineering with state of the art results, but here we are more interested in sequence-to-sequence modeling. • Basic idea: • RNNs can be naturally used to probabilistically model a sequence 𝑋 = (𝑥1, 𝑥2, … , 𝑥 𝑇 ) as 𝑝(𝑥1), 𝑝(𝑥2 ∣ 𝑥1), …, 𝑝(𝑥 𝑇 ∣ 𝑥<𝑇 ) = 𝑝(𝑥 𝑇 ∣ 𝑥 𝑇−1, … , 𝑥1), and then the joint probability 𝑝(𝑋) is just their product 𝑝(𝑋) = 𝑝(𝑥1)𝑝(𝑥2 ∣ 𝑥1) … 𝑝(𝑥 𝑘 ∣ 𝑥<𝑘) … 𝑝(𝑥 𝑇 ∣ 𝑥<𝑇 ); • this is how RNNs are used for language modeling; • we predict next word based on the hidden state learned from all previous parts of the sequence; • in translation, maybe we can learn the hidden state from one sentence and apply to another. 26
  • 53. machine translation • Direct application – bidirectional LSTMs (Bahdanau et al. 2014): • But do we really translate word by word? 26
  • 54. machine translation • No, we first understand the whole sentence; hence encoder-decoder architectures (Cho et al. 2014): 26
  • 55. machine translation • But compressing the entire sentence to a fixed-dimensional vector is hard; quality drops dramatically with length. • Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015): • encoder RNNs are bidirectional, so at every word we have a “focused” representation with both contexts; • attention NN takes the state and local representation and outputs a relevance score – should we translate this word right now? 26
  • 56. machine translation • Soft attention (Luong et al. 2015a; 2015b; Jean et al. 2015): • formally very simple: we compute attention weights 𝛼 𝑖𝑗 and reweigh context vectors with them: 𝑒 𝑖𝑗 = 𝑎(𝑠 𝑖−1, 𝑗), 𝛼 𝑖𝑗 = softmax(𝑒 𝑖𝑗; 𝑒 𝑖∗), 𝑐 𝑖 = 𝑗=1 ∑ 𝑇 𝑥 𝛼 𝑖𝑗ℎ 𝑗, and now 𝑠 𝑖 = 𝑓(𝑠 𝑖−1, 𝑦 𝑖−1, 𝑐 𝑖). 26
  • 57. machine translation • We get better word order in the sentence as a whole: • Attention is an “old” idea (Larochelle, Hinton, 2010), and can be applied to other RNN architectures, e.g., image processing and speech recognition; in other sequence-based NLP tasks: • syntactic parsing (Vinyals et al. 2014), • modeling pairs of sentences (Yin et al. 2015), • question answering (Hermann et al. 2015), • “Show, Attend, and Tell” (Xu et al. 2015). 26
  • 58. google translate • Sep 26, 2016: Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation: • this very recent paper shows how Google Translate actually works; • the basic architecture is the same: encoder, decoder, attention; • RNNs have to be deep enough to capture language irregularities, so 8 layers for encoder and decoder each: 27
  • 59. google translate • Sep 26, 2016: Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation: • but stacking LSTMs does not really work: 4-5 layers are OK, 8 layers don’t work; • so they add residual connections between the layers, similar to (He, 2015): 27
  • 60. google translate • Sep 26, 2016: Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation: • and it makes sense to make the bottom layer bidirectional in order to capture as much context as possible: 27
  • 61. google translate • Sep 26, 2016: Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation: • GNMT also uses two ideas for word segmentation: • wordpiece model: break words into wordpieces (with a separate model); example from the paper: Jet makers feud over seat width with big orders at stake becomes _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake • mixed word/character model: use word model but for out-of-vocabulary words convert them into characters (specifically marked so that they cannot be confused); example from the paper: Miki becomes <B>M <M>i <M>k <E>i 27
  • 63. dialog and conversational models • Dialog models attempt to model and predict dialogue; conversational models actively talk to a human. • Applications – automatic chat systems for business etc., so we want to convey information. • Vinyals and Le (2015) use seq2seq (Sutskever et al. 2014): • feed previous sentences ABC as context to the RNN; • predict the next word of reply WXYZ based on the previous word and hidden state. • They get a reasonable conversational model, both general (MovieSubtitles) and in a specific domain (IT helpdesk). 29
  • 64. dialog and conversational models • Hierarchical recurrent encoder decoder architecture (HRED); first proposed for query suggestion in IR (Sordoni et al. 2015), used for dialog systems in (Serban et al. 2015). • The dialogue as a two-level system: a sequence of utterances, each of which is in turn a sequence of words. To model this two-level system, HRED trains: (1) encoder RNN that maps each utterance in a dialogue into a single utterance vector; (2) context RNN that processes all previous utterance vectors and combines them into the current context vector; (3) decoder RNN that predicts the tokens in the next utterance, one at a time, conditional on the context RNN. 29
  • 65. dialog and conversational models • HRED architecture: • (Serban et al. 2015) report promising results in terms of both language models (perplexity) and expert evaluation. 29
  • 66. dialog and conversational models • Some recent developments: • (Li et al., 2016a) apply, again, reinforcement learning (DQN) to improve dialogue generation; • (Li et al., 2016b) add personas with latent variables, so dialogue can be more consistent (yes, it’s the same Li); • (Wen et al., 2016) use snapshot learning, adding some weak supervision in the form of particular events occurring in the output sequence (whether we still want to say something or have already said it); • (Su et al., 2016) improve dialogue systems with online active reward learning, a tool from reinforcement learning. • Generally, chatbots are becoming commonplace but it is still a long way to go before actual general-purpose dialogue. 29
  • 68. question answering • Question answering (QA) is one of the hardest NLP challenges, close to true language understanding. • Let us begin with evaluation: • it’s easy to find datasets for information retrieval; • these questions can be answered knowledge base approaches: map questions to logical queries over a graph of facts; • in a multiple choice setting (Quiz Bowl), map the question and possible answers to a semantic space and find nearest neighbors (Socher et al. 2014); • but this is not exactly general question answering. • (Weston et al. 2015): a dataset of simple (for humans) questions that do not require any special knowledge. • But require reasoning and understanding of semantic structure... 31
  • 69. question answering • Sample questions: Task 1: Single Supporting Fact Mary went to the bathroom. John moved to the hallway. Mary travelled to the office. Where is Mary? A: office Task 4: Two Argument Relations The office is north of the bedroom. The bedroom is north of the bathroom. The kitchen is west of the garden. What is north of the bedroom? A: office What is the bedroom north of? A: bathroom Task 7: Counting Daniel picked up the football. Daniel dropped the football. Daniel got the milk. Daniel took the apple. How many objects is Daniel holding? A: two Task 10: Indefinite Knowledge John is either in the classroom or the playground. Sandra is in the garden. Is John in the classroom? A: maybe Is John in the office? A: no Task 15: Basic Deduction Sheep are afraid of wolves. Cats are afraid of dogs. Mice are afraid of cats. Gertrude is a sheep. What is Gertrude afraid of? A: wolves Task 20: Agent’s Motivations John is hungry. John goes to the kitchen. John grabbed the apple there. Daniel is hungry. Where does Daniel go? A: kitchen Why did John go to the kitchen? A: hungry • One problem is that we have to remember the context set throughout the whole question... 31
  • 70. question answering • ...so the current state of the art are memory networks (Weston et al. 2014). • An array of objects (memory) and the following components learned during training: I (input feature map) converts the input to the internal feature representation; G (generalization) updates old memories after receiving new input; O (output feature map) produces new output given a new input and a memory state; R (response) converts the output of O into the output response format (e.g., text). 31
  • 71. question answering • Dynamic memory networks (Kumar et al. 2015). • Episodic memory unit that chooses which parts of the input to focus on with an attention mechanism: 31
  • 72. question answering • End-to-end memory networks (Sukhbaatar et al. 2015). • A continuous version of memory networks, with multiple hops (computational steps) per output symbol. • Regular memory networks require supervision on each layer; end-to-end ones can be trained with input-output pairs: 31
  • 73. question answering • There are plenty of other extensions; one problem is how to link QA systems with knowledge bases to answer questions that require both reasoning and knowledge. • Google DeepMind is also working on QA (Hermann et al. 2015): • a CNN-based approach to QA, also tested on the same dataset; • perhaps more importantly, a relatively simple and straightforward way to convert unlabeled corpora to questions; • e.g., given a newspaper article and its summary, they construct (context, query, answer) triples that could then be used for supervised training of text comprehension models. • I expect a lot of exciting things to happen here. • But allow me to suggest... 31
  • 74. what? where? when? • «What? Where? When?»: a team game of answering questions. Sometimes it looks like this... 32
  • 75. what? where? when? • ...but usually it looks like this: 32
  • 76. what? where? when? • Teams of ≤ 6 players answer questions, whoever gets the most correct answers wins. • db.chgk.info – database of about 300K questions. • Some of them come from “Своя игра”, a Jeopardy clone but often with less direct questions: • Современная музыка На самом деле первое слово в названии ЭТОГО коллектива совпадает с фамилией шестнадцатого президента США, а исказили его для того, чтобы приобрести соответствующее названию доменное имя. • Россия в начале XX века В ЭТОМ году в России было собрано 5,3 миллиарда пудов зерновых. • Чёрное и белое ОН постоянно меняет черное на белое и наоборот, а его соседа в этом вопросе отличает постоянство. 32
  • 77. what? where? when? • Most are “Что? Где? Когда?” questions, even harder for automated analysis: • Ягоды черники невзрачные и довольно простецкие. Какой автор утверждал, что не случайно использовал в своей книге одно из названий черники? • В середине тридцатых годов в Москве выходила газета под названием «Советское метро». Какие две буквы мы заменили в предыдущем предложении? • Русская примета рекомендует 18 июня полоть сорняки. Согласно второй части той же приметы, этот день можно считать благоприятным для НЕЁ. Назовите ЕЁ словом латинского происхождения. • Соблазнитель из венской оперетты считает, что после НЕГО женская неприступность уменьшается вчетверо. Назовите ЕГО одним словом. • I believe it is a great and very challenging QA dataset. • How far in the future do you think it is? :) 32
  • 78. thank you! Thank you for your attention! 33