DCNN for text
B01902004 蔡捷恩
A CNN for modeling Sentences
Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. "A convolutional neural network for modelling sentences."arXiv:1404.2188
(2014).
Sentence model
• Sentence -> feature vector, that’s all !
• However, it is the core of:
• Sentiment analysis, paraphrase detection,
entailment recognition, summarisation,
discourse analysis, machine translation,
grounded language learning, image retrieval …
contribution
• Does not rely on parse tree
• Easily applicable to any language ?
How to model a sentence?
• Composition based method
• Need human knowledge to compose
• Automatically extracted logical forms
• Ex. RNN, TDNN
Brief network structure
• Interleaving k-max pooling & 1-dim-conv. +
TDNN => generate a sentence graph
A kind of syntax tree ?
NN sentence model with syntax tree
(Recursive NN, RecNN)
Reference syntax tree
while training
Share weight
and stack up to
form the network
RNN for sentence model
Linear “structure”
Back to DCNN
• Convolution
• TDNN
• K-max pooling( Dynamic k-max pooling)
Convolution
Narrow type, win=5
wide type, win=5 (0-padding)
Max-TDNN
GOAL: recognize features independent of time-shift
(i.e. sequence position)
Take a look at
DCNN
Need to be optimized during training
If we use Max-TDNN
K-max pooling
• Given k, no matter how many dimension an
input get, pool the top-k ones as output, “the
order of output corresponds to their input”
• Better than max-TDNN by:
– Preserve the order of features
– Discern more finely how high activated feature
react
• Guarantee the length of input to FC
independent of sentence length
Only fully connected need fix length
• Intermediate layers can be more flexible
• Dynamic k-max Pooling !
Dynamic k-max Pooling
• K is a function of length of the input sentence
and depth of the network
The k of currently concerned layer
Fixed k-max pooling’s k at the top
Total # of conv. in the network ( the depth)
Input sentence length
Folding
• Feature detectors in different rows are
independent of each other until the top fully
connected layer
• Simply do vector sum
+
Properties
• Sensitive to the order of words
• Filters of the first layer model n-grams, n ≤m
• Properties invariance of absolute position
captured by upper layer convs.
• Induce feature graph property
Experiments
Sentiment analysis
Stanford Sentiment Treebank
Movie review, 5 scense, +/- label
Experiments
Question type prediction
on TREC
Experiments
Twitter sentiment dataset, binary label
Experiments
• Visualizing feature detectors
Think about it
• Can this kind of k-max pooling apply to image
tasks ?
A CNN for matching nature language
sentences
Hu, Baotian, et al. "Convolutional neural network architectures for matching natural language sentences." Advances in Neural Information Processing Systems. 2014
Why convolution approach
• No need prior knowledge
Contribution
• Hierarchical sentence modeling
• The capturing of rich matching patterns at
different levels of abstraction
Convolutional Sentence Modeling
Word2vec pre-trained
2-window max pooling
Fixed input len
A trick on zero-padding
• The variable length of sentence may be in a
fairly broad range
• Introduce gate operation
• g(z) = <0> while z = <0>, otherwise, <1>
• No bias !
Conv + Max pool
Composition
RNN vs ConvNet
ConvNet RNN
Hierarchical
structure
W L
Parallelism W L
Capture far away
information
- -
Explainable W L
Variety L W
Architecture-I
• Drawback: in forward phase, the representation of each sentence
Is built without knowledge of each other
Architecture-II
• Build directly on the interaction space between 2 sentences
• From 1D to 2D convolution
Good trick at pooling
2D max-pooling
Model Generality
• Arc-II subsumes Arc-I as a special case
Cost function
• Large margin objective
e(.)
Experiment – Sentence Completion
Experiment – Matching Response to
Tweet
Experiment – Paraphrase Identification
• Determine whether two sentences have the
same meaning
Discussion
• Sequence is important
Zhang, Xiang, and Yann LeCun. "Text Understanding from Scratch." arXiv preprint arXiv:1502.01710 (2015)
Text Understanding from Scratch
Contribution
• Character-level input
• No OOV
• Work for both English and Chinese
The model
character encoding space
Not encoded character or space
=> All-zero vector
Fixed length window
H e l l o w o r l
More detail
What about various input length?
• Set to the longest sentence we are going to
see (1014 character used in their experiments)
Data augmentation - Thesaurus
• Thesaurus: “a book that lists words in groups
of synonyms and related concepts”
• http://www.libreoffice.org/
Comparison models
• Bag-of-word: 5000 most freq. words
• Bag-of-centroids: 5000-means word vectors
on Google News corpus
DBpedia Ontology Classification
DBpedia Ontology Classification
Amazon review sentiment analysis
• 1~5 indicating user’s subjective rating of a
product.
• Collected by SNAP project
Amazon review sentiment analysis
Amazon review sentiment analysis
Yahoo! Answer Topic Classification
Yahoo! Answer Topic Classification
News Categorization in English
News Categorization in English
News Categorization in Chinese
• SogouCA and SogouCS
• pypinyin package + jieba Chinese
segmentation system
News Categorization in Chinese
Conclusion
• We can play a lot of trick with Pooling
Thank you

Dcnn for text

Editor's Notes

  • #3 University of Oxford
  • #6 Each word or phrase a vector => compose, word meaning vectors => sentence meaning vector 強項: 不再是naïve or rule based組合,而是隨著前後文調整
  • #7 K max pooling 取幾個k是由network的其他layer predict的 上面幾個layer的filters 可以整合很遠的文字之間的關係 => 就像CNN for image一樣,是bottom up,由細往大方向看 或其實可以把他的feature graph想成是syntax tree 回到前面跨語言的問題,為什麼說可以跨語言? => 只有network 自己看得懂的語法
  • #9 讀完最後一個字後的network output當作sentence vector
  • #11 Convolution 的效果: 一個1~win-gram的feature detector Wide type有這些好處: 保證all weight filter reach the entire sentence 直覺,傳統ngram也會記錄開頭與結尾 注意: feature越convolve越寬,這和CNN不一樣 => k-max pooling非常重要,否則一定overfit
  • #12 為了解sentence長度不一 Word representation d 維 等於是convolution完直接在每一維上maxpooling => 可以想成是取跟這句話最relevant的feature 缺點: Narrow convoluation 會使邊緣文字被忽略的效果加劇 Max pooling 的缺點: 無法分辨relevance feature是一個還兩個、先後是怎麼出現的
  • #13 以免你忘了,這裡使用wide convolution
  • #15 像SPP的觀點
  • #16 Smoothly converge to ktop
  • #18 好了,講完網路架構,來講講sentence modeling 的性質吧
  • #19 所以第一層的m通常很大(10) 4. 不用外部support
  • #20 非NN based的只用unigram + bigram feature
  • #21 QA dataset,看是哪種question
  • #22 Training set1.6M 是自動標的 Testing set 400則手動
  • #23 四個filter在 movie那組dataset的top five響應
  • #24 SPP只在最後一層做 spatial pyramid
  • #25 解semantic matching 的問題,像是wordnet
  • #26 先小結一下我準備這篇報告時的發現
  • #27 1. Through layer by layer composition and pooling
  • #29 沒有bias => 搭配上maxpooling, padding 的影響消失
  • #30 Just for illustration purpose, we present a dramatic choice of parameters ( by turning off some elements in W1
  • #31 要BPTT,每個time step是sequencial CNN透過越來越high level的feature, RNN透過memory來達到far away information 第一層feature可以拿出來看,下一篇paper也有拿出來看
  • #32 Siamese 順序會出問題
  • #37 wordEmbed => sum of wordvec, MLP siamese training SenMLP => whole sentence as input, siamese training
  • #40 SENNA+MLP perform over bag-of-word like models
  • #43 69個character 6層conv 3層FC,每3層conv一個droupout 0.5
  • #44 一小一大2種convnet ReLU activation
  • #45 蠻鳥的
  • #46 在image和speech上,我們會加一些noise,那在text上呢? 背後是wordnet 還是要控制一下,換到常見的meaning的機率比較高
  • #47 他們說因為要large scale的dataset但是之前的sentiment analysis沒有在large scale的dataset上做過,所以自己隨便舉了機個model來比 5000 dim feature Multinomial logistic regression
  • #48 Dbpedia => knowledge base, 有一群人維護,從wikipedia來 14 ontology classes
  • #50 Review text that come with rating
  • #52 不像IMDB的dataset可能語帶保留 => word2vec training 的目標是語法架構,這種training可以直接朝著想要的目標發展
  • #57 英文有alphebat 中文怎麼半? Pinyin => 拼音 Segment?