Supervised Learning of Universal Sentence Representations
from Natural Language Inference Data
2017/11/7 B4 Hiroki Shimanaka
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo ̈ıc Barrault, Antoine Bordes
EMNLP 2017
Abstract & Introduction (1)
Many modern NLP systems rely on word embeddings, previously
trained in an un-supervised manner on large corpora, as base features.
Word embeddings
 Word2Vec (Mikolov et al., 2013)
 GloVe (Pennington et al., 2014)
Efforts to obtain embeddings for larger chunks of text, such as
sentences, have however not been so successful.
Unsupervised Sentence embeddings
 SkipThought (Kiros et al., 2015)
 FastSent (Hill et al., 2016)
1
2
In this paper, they show how universal sentence representations
trained using the supervised data of the Stanford Natural Language
Inference datasets.
It can consistently outperform unsu-pervised methods like SkipThought
vec-tors (Kiros et al., 2015) on a wide range of transfer tasks.
Abstract & Introduction (2)
3
Approach
This work combines two research directions, which they describe in
what follows.
First, they ex-plain how the NLI task can be used to train univer-sal sentence
encoding models using the SNLI task.
Second, they describe the architectures that we investigated for the sentence
encoder, which, in their opinion, covers a suitable range of sentence
encoders currently in use.
4
SNLI (Stanford Natural Language Inference) dataset
The SNLI dataset consists of 570k human-generated English sentence
pairs, manually la-beled with one of three categories: entailment,
contradiction and neutral.
https://nlp.stanford.edu/projects/snli/
5
The Natural Language Inference task
Once the sentence vectors are generated, 3
matching methods are applied to extract
relations between 𝑢 and 𝑣.
(i) concatenation of the two representa-tions (u, v)
(ii) element-wise product u ∗ v
(iii) absolute element-wise difference |u − v|
The resulting vector, which captures
information from both the premise and the
hypothesis, is fed into a 3-class classifier
consisting of multiple fully-connected layers
culminating in a softmax layer.
6
Sentence encoder architectures (1)
LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014)
A sentence is represented by the last hid-den vector.
BiGRU-last
It con-catenates the last hidden state of a forward GRU, and the last hidden
state of a backward GRU.
7
Sentence encoder architectures (2)
BiLSTM with mean/max pooling
8
Sentence encoder architectures (3)
It uses an attention mecha-nism
over the hidden states of a
BiLSTM to gen-erate a
representation 𝑢 of an input
sentence.
 self-attentive sentence encoder (Liu et al., 2016; Lin et al., 2017)
9
Sentence encoder architectures (4)
Hierarchical ConvNet
It is like AdaSent (Zhao et al., 2015).
The final representation 𝑢 =
[𝑢1, 𝑢2, 𝑢3, 𝑢4] concatenates
representations at different levels of
the input sentence.
10
Training details
For all their models trained on SNLI, they use SGD with a learning rate
of 0.1 and a weight decay of 0.99.
At each epoch, they divide the learning rate by 5 if the dev accuracy
decreases.
they use mini-batches of size 64 and training is stopped when the
learning rate goes under the threshold of 10−5.
For the classifier, they use a multi-layer perceptron with 1 hidden-layer
of 512 hidden units.
They use open-source GloVe vectors trained on Common Crawl 840B
with 300 dimensions as fixed word embeddings.
11
Evaluation of sentence representations (1)
Binary and multi-class classification
sentiment analysis (MR, SST)
question-type (TREC)
product reviews (CR)
sub-jectivity/objectivity (SUBJ)
opinion polarity (MPQA)
Entailment and semantic relatedness
They also evaluate on the SICK dataset for both entailment (SICK-E) and
semantic relatedness (SICK-R).
Semantic Textual Similarity (STS14 (Agirre et al., 2014)).
12
Evaluation of sentence representations (2)
Paraphrase detection
Sentence pairs have been human-annotated according to whether they cap-
ture a paraphrase/semantic equivalence relation-ship.
Caption-Image retrieval
The caption-image retrieval task evaluates joint image and language feature
models (Hodosh et al., 2013; Lin et al., 2014).
The goal is either to rank a large collec-tion of images by their relevance with
respect to a given query caption (Image Retrieval), or ranking captions by
their relevance for a given query image (Caption Retrieval).
13
Result (1)
14
Result (2)
15
Result (3)
16
Conclusion
This paper studies the effects of training sentence embeddings with
supervised data by testing on 12 different transfer tasks.
They showed that mod-els learned on NLI can perform better than
mod-els trained in unsupervised conditions or on other supervised tasks.
They showed that a BiLSTM network with max pooling makes the best
current universal sen-tence encoding methods, outperforming existing
approaches like SkipThought vectors.
17
Reference
Supervised Learning of Universal Sentence Representations from
Natural Language Inference Data, Conneau et al., EMNLP 2017

[Paper Reading] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

  • 1.
    Supervised Learning ofUniversal Sentence Representations from Natural Language Inference Data 2017/11/7 B4 Hiroki Shimanaka Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo ̈ıc Barrault, Antoine Bordes EMNLP 2017
  • 2.
    Abstract & Introduction(1) Many modern NLP systems rely on word embeddings, previously trained in an un-supervised manner on large corpora, as base features. Word embeddings  Word2Vec (Mikolov et al., 2013)  GloVe (Pennington et al., 2014) Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Unsupervised Sentence embeddings  SkipThought (Kiros et al., 2015)  FastSent (Hill et al., 2016) 1
  • 3.
    2 In this paper,they show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets. It can consistently outperform unsu-pervised methods like SkipThought vec-tors (Kiros et al., 2015) on a wide range of transfer tasks. Abstract & Introduction (2)
  • 4.
    3 Approach This work combinestwo research directions, which they describe in what follows. First, they ex-plain how the NLI task can be used to train univer-sal sentence encoding models using the SNLI task. Second, they describe the architectures that we investigated for the sentence encoder, which, in their opinion, covers a suitable range of sentence encoders currently in use.
  • 5.
    4 SNLI (Stanford NaturalLanguage Inference) dataset The SNLI dataset consists of 570k human-generated English sentence pairs, manually la-beled with one of three categories: entailment, contradiction and neutral. https://nlp.stanford.edu/projects/snli/
  • 6.
    5 The Natural LanguageInference task Once the sentence vectors are generated, 3 matching methods are applied to extract relations between 𝑢 and 𝑣. (i) concatenation of the two representa-tions (u, v) (ii) element-wise product u ∗ v (iii) absolute element-wise difference |u − v| The resulting vector, which captures information from both the premise and the hypothesis, is fed into a 3-class classifier consisting of multiple fully-connected layers culminating in a softmax layer.
  • 7.
    6 Sentence encoder architectures(1) LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) A sentence is represented by the last hid-den vector. BiGRU-last It con-catenates the last hidden state of a forward GRU, and the last hidden state of a backward GRU.
  • 8.
    7 Sentence encoder architectures(2) BiLSTM with mean/max pooling
  • 9.
    8 Sentence encoder architectures(3) It uses an attention mecha-nism over the hidden states of a BiLSTM to gen-erate a representation 𝑢 of an input sentence.  self-attentive sentence encoder (Liu et al., 2016; Lin et al., 2017)
  • 10.
    9 Sentence encoder architectures(4) Hierarchical ConvNet It is like AdaSent (Zhao et al., 2015). The final representation 𝑢 = [𝑢1, 𝑢2, 𝑢3, 𝑢4] concatenates representations at different levels of the input sentence.
  • 11.
    10 Training details For alltheir models trained on SNLI, they use SGD with a learning rate of 0.1 and a weight decay of 0.99. At each epoch, they divide the learning rate by 5 if the dev accuracy decreases. they use mini-batches of size 64 and training is stopped when the learning rate goes under the threshold of 10−5. For the classifier, they use a multi-layer perceptron with 1 hidden-layer of 512 hidden units. They use open-source GloVe vectors trained on Common Crawl 840B with 300 dimensions as fixed word embeddings.
  • 12.
    11 Evaluation of sentencerepresentations (1) Binary and multi-class classification sentiment analysis (MR, SST) question-type (TREC) product reviews (CR) sub-jectivity/objectivity (SUBJ) opinion polarity (MPQA) Entailment and semantic relatedness They also evaluate on the SICK dataset for both entailment (SICK-E) and semantic relatedness (SICK-R). Semantic Textual Similarity (STS14 (Agirre et al., 2014)).
  • 13.
    12 Evaluation of sentencerepresentations (2) Paraphrase detection Sentence pairs have been human-annotated according to whether they cap- ture a paraphrase/semantic equivalence relation-ship. Caption-Image retrieval The caption-image retrieval task evaluates joint image and language feature models (Hodosh et al., 2013; Lin et al., 2014). The goal is either to rank a large collec-tion of images by their relevance with respect to a given query caption (Image Retrieval), or ranking captions by their relevance for a given query image (Caption Retrieval).
  • 14.
  • 15.
  • 16.
  • 17.
    16 Conclusion This paper studiesthe effects of training sentence embeddings with supervised data by testing on 12 different transfer tasks. They showed that mod-els learned on NLI can perform better than mod-els trained in unsupervised conditions or on other supervised tasks. They showed that a BiLSTM network with max pooling makes the best current universal sen-tence encoding methods, outperforming existing approaches like SkipThought vectors.
  • 18.
    17 Reference Supervised Learning ofUniversal Sentence Representations from Natural Language Inference Data, Conneau et al., EMNLP 2017

Editor's Notes

  • #3 Word embedding が様々なNLPタスクで役に立っている。Word embeddings では文のベクトルを生成しにくい。他にも文ベクトルを生成するツールが存在するが、それらもあまりいい成果を挙げれていない。
  • #4 そこでこの論文では SNLI の教師あり datasets を用いてどのように文ベクトルを学習するかを示す。 またこの方法で様々なタスクでSkipThoughtのような教師なし学習の方法より良い結果を出せることができる。
  • #5 この方法は以下の2つの研究方向を組み合わせたもの出す。 一つ目の方向性として、SNLI datasets を用いた NLIタスクをどのようにしてユニバーサルセンテンスエンコーディングに用いるかというもの 二つ目の方向性として、どのセンテンスエンコーダがこのタスクに向いているかというもの
  • #6 Flickr30kのキャプションをもとに含意、矛盾、中立する文をクラウドソーシングで収集したもの
  • #7 512次元の隠れ層とsoftmax層により3値分類する
  • #13 分類問題 意味的関係
  • #14 言い換えの認識 キャプション、画像検索
  • #16 今まで最も優れていた文エンコーダであるSkipthought-LN は64M, 1ヶ月 提案手法は 57k, 1日
  • #18 本稿では、12の異なる転送タスクをテストすることにより、教師付きデータを含む訓練センテンス埋め込みの効果を研究する。 彼らは、NLIで学んだモデルは、監督されていない状態や他の管理対象タスクで訓練されたモデルよりも優れたパフォーマンスを発揮できることを示しました。 彼らは、最大プールを持つBiLSTMネットワークがSkipThoughtベクトルのような既存のアプローチより優れた現在の普遍的なセンシングエンコーディングメソッドを作成することを示しました。
  • #19 本稿では、12の異なる転送タスクをテストすることにより、教師付きデータを含む訓練センテンス埋め込みの効果を研究する。 彼らは、NLIで学んだモデルは、監督されていない状態や他の管理対象タスクで訓練されたモデルよりも優れたパフォーマンスを発揮できることを示しました。 彼らは、最大プールを持つBiLSTMネットワークがSkipThoughtベクトルのような既存のアプローチより優れた現在の普遍的なセンシングエンコーディングメソッドを作成することを示しました。