4. • Introduced by Yann LeCun in 1998*
• Have been super successful in the area of vision. Almost become bread and butter for computer
vision problems.
• CNN treats image as a signal in spatial domain.
• Have many nice properties that makes super useful :
• Spatially invariant – translation, rotation
• Local structure
• Fast (concurrent calculations in each layer)
Convolutional Neural Nets (CNN)
* LeNet-5 in "Gradient-based learning applied to document recognition" 5
5. Basics of CNN
• Input : Image
• Image is nothing but a signal in space.
• Represented by matrix with values (RGB)
• Each value ~ wavelength of Red, Green and Blue signals respectively.
• 2 Key operations are : Convolution & Pooling
6
6. • In simplest terms : given 2 signals x() and h(), convolution combines the
2 signals:
• In the discrete space:
• For our case image is x()
• h() is called filter/kernel/feature detector. Well known concept in the world
of image processing.
Convolution
7
7. • Ex: Filters for edge detection,
blurring, sharpen, etc
• It is usually a small matrix -
3x3, 5x5, 5x7 etc
• There are well known
predefined filters
https://en.wikipedia.org/wiki/Kernel_(image_processing)
8
8. 1 0 1
0 1 0
1 0 1
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
4
1*1 + 1*0 + 1*1
0*0 + 1*1 + 1*0
0*1 + 0*0 + 1*1
• Convolved feature is nothing but taking a part of the image and applying
filter over it - taking pairwise products and adding them.
9
9. • Convolved feature map is nothing but sliding the filter over entire image
and applying convolution at each step, as shown in diagram below:
1 0 1
0 1 0
1 0 1
https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn
Filter
10
10. • Image processing over past many decades has
built many filters for specific tasks.
• In DL (CNN) rather than using predefined filters,
we learn the filters.
• We start with small random values and update
them using gradients
? ? ?
? ? ?
? ? ?
11
11. • It’s a simple technique for down/sub sampling.
• In CNNs, down sampling, or "pooling" layers are often placed after
convolutional layers.
• They are used mainly to reduce the feature map dimensionality for
computational efficiency. This in turn improves actual performance.
• Takes disjoint chunks of the image (typically 2×22×2) and aggregates
them into a single value.
• Average, max, min, etc. Most popular is max-pooling.
Pooling
https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
12
12. Putting it all together
https://adeshpande3.github.io
13
13. Language Modeling
• Filter out good sentences from bad ones.
• Good = semantically and syntactically correct.
• Model it via probability distribution over sequences of words Pr (w1, w2, ….., wn)
• Assign a probability to a sentence such that
S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
14
14. Language Modeling
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction :
• The office is about fifteen minuets from my house.
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition :
• P(I saw a van) >> P(eyes awe of an)
• Summarization, question – answering, etc., etc.
15
15. • Unary Language Models: Assumes each word occurs completely independent
• Overly simplistic !
• Binary Language Models: A word in a sentence is influenced by its immediate
predecessor (a.k.a Bigram setting)
• This too is naïve but goes long way in understanding some key concepts.
16
16. • N-gram models: try to capture long term dependencies.
Pr (w1, w2, ….., wn) = 𝑖=1
𝑛
Pr (wi| w1, w2, ….., wi-1)
• This captures how likely is a sentence in a given language.
17
17. Deep Learning + Language Modeling
• Traditionally uses architecture such as Recurrent Neural Networks (RNN).
• Sequential processing : one unit after other.
• Over time advancements happened and concepts like : 2 way ordering
(Bidirectional), memory(LSTM), attention etc got added.
• Some people explored the possibility of using CNN for Language modeling:
• Pixels spread in space. So they are nothing but signal in space.
• Words/tokens/characters spread in time. So they are nothing but signal in time.
18
19. • Input for any NLP task are sentences/paras/docs in the form of matrix
• Each row of this matrix represents a unit/token of text – character, morpheme,
word etc (typically row = 1-hot or embedding representation of that unit)
• Unlike images, where filter slides over local patches of an image; in NLP we
typically use filters that slide over full rows of the matrix i.e. the “width” of our
filters is usually the same as the width of the input matrix. [1D or temporal
convolutions]
• The height, or region size varies. Typically, window slides over 2-5 words at a
time.
20
21. • Lots of success of CNNs is attributed to :
• Location Invariance : where a object in a image comes doesn’t matter so much
• Local Compositionality : bunch of local objects combine/compose to give more complex
objects.
22
22. • In CNN+NLP, both aforementioned properties go for a toss
• Where a word comes in a sentence can change the meaning drastically.
Man bites dog.
Dog bites man.
• Parts of phrases could be separated by several other words. Words do compose in some ways,
but how exactly this works, what higher level representations actually “mean” – these aren’t as
obvious as in the Computer Vision case.
“Tim said Robert has lot of experience, he feels you should definitely meet him”
• Both key advantages gone, why are we even thinking of applying CNNs to text ?
RNNs should be the way to go.
23
23. • “All models are wrong, but some are useful”
• This is not about CNNs vs RNNs (may be both are bad!)
• This is about
• Understanding key difficulties
• Are there some aspects of language modeling where CNNs can do a better job.
• Helps us to better understand strength & weakness of each model.
• Turns out that CNNs applied to certain NLP problems perform quite well. Esp
classification tasks - Sentiment Analysis, Spam Detection or Topic
Categorization.
• CNNs are usually fast, very fast.
24
24. Major works in this sub-area
• Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014
• Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. COLING-2014
• Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with
Convolutional-Pooling Structure for Information Retrieval. CIKM ’14.
• Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-Speech
Tagging. ICML-14.
• Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text
Classification, 1–9.
• Wenpeng Yin, Hinrich Schutze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: attention-based
convolutional neural network for modeling sentence pairs.
25
25. • Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hinrich Schutze. 2016. Combining recurrent
and convolutional neural networks for relation classification. In Proceedings of NAACL HLT.
pages 534–539.
• Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang.2016. Learning text representation using
recurrent convolutional neural network with highway layers. SIGIR Workshop on Neural
Information Retrieval
• Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with
gated convolutional networks. arXiv preprint arXiv:1612.08083
• Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schutze Comparative Study of CNN and
RNN for Natural Language Processing
• Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language
Models. (Uses a hybrid of CNN and RNN)
26
27. Character-Aware Neural Language Models *
• Problem statement: Given t words w1, w2, ….., wt ; predict wt+1
• Traditional models : words fed as inputs in the form of word embedding.
• Here input embedding is replaced by output of character level CNN.
• Uses sub word information.
• Traditionally sub word information is fed in terms of morphemes;
Unbreakable : Un ("not") – break (root word) – able (“can be done”)
* “Character-Aware Neural Language Models” Y kim et. al 2015 28
28. • Identifying morphemes is non trivial. Requires morphological tagging as
preprocessing.
• Y Kim et. al leverage sub word via through a character-level CNN.
• Learn embedding for each character.
• A word w is then nothing but embeddings of it constituent characters.
• For each word, we apply convolution on its character embeddings to obtain features.
• These are then fed to LSTM via highway layers.
• Does not use word embeddings at all.
• In most language models, large % of parameters are because of word
embeddings. Thus, we get much smaller number of parameter to learn.
29
29. Details
C - vocabulary of characters.
D - dimensionality of character embeddings.
R - matrix character embeddings.
Let word wk = [c1,....,cl] i.e. made from l characters, where l
is length of wk
Character-level representation of wk is given by matrix
Ck ∈ ℝ D X l, where jth column corresponds to character
embedding for jth character of word wk
Apply filter/kernel H to Ck to obtain feature map fk.
ith element of fk is given by:
is not : ith to (i-w+1)th columns of Ck
is called Frobenius product
|C|
D R
l
D Ck
c1 c2 cl
l - w +1
fk 30
30. • To capture most important feature - we take max over time
yk is the feature corresponding to filter H when applied to word wk.
(~ find most important character n-gram)
• Likewise, they apply multiple h filters : H1, …., Hh.
• Then, yk = is the input representation of word wk.
At this point of time we can either:
• Construct MLP over yk
• Feed yk to LSTM
31
31. Instead to gain improvements, rather than feeding yk to LSTM, they pass it via Highway
network*
Highway network:
Basic idea: carry some part input directly to output.
While remaining input is processed and then taken forward.
Very similar to residual networks.
F() is typically : affine transformation followed by tanh.
In Highway networks, we learn “what parts of input to be
carried forward via highway”
This is done via gating mechanism called transform gate (t) and carry gate (1-t)
32
34. Key take home
• CNNs + NLP surely holds lot of promise.
• Pretty successful in classification setting.
• Can prove be great tool to model the input aspects of NLP.
• What about non-classification settings ?
• Sequence labeling (NER)
• Sequence generation (MT)
• As of today not so successful
• Though people have tried lot of ideas there too.
• de-convolutions in generative settings
• Some architectures use different embeddings as different channels. 35
35. More Resources
• https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/
• https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
• wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
• https://blogs.technet.microsoft.com/machinelearning/2017/02/13/cloud-scale-text-classification-with-
convolutional-neural-networks-on-microsoft-azure/
• https://www.aclweb.org/anthology/P/P14/P14-1062.xhtml
• https://github.com/yoonkim/lstm-char-cnn
• https://github.com/yoonkim/CNN_sentence
• https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32
• “Comparative Study of CNN and RNN for Natural Language Processing” Wenpeng Yin et. al 2017,
arXiv:1702.01923 [cs.CL]
36