Detecting Misleading Headlines in Online News: Hands-on Experiences on Attention-based RNN

Detecting Misleading
Headlines in Online News  
Hands-on Experiences on Attention-based RNN
Kunwoo Park
24th June 2019
IBS deep learning summer school

Who am I
• Kunwoo Park (박건우)
• Post doc, Data Analytics, QCRI (2018 - present)
• PhD, School of Computing, KAIST (2018)  
with outstanding dissertation award
• Research interest
• Computational social science using machine learning
• Text style transfer using RNN and RL
2

This talk will..
• Help audience understand the attention mechanism for text
• Introduce a recent research eﬀort on detecting misleading
news headlines using deep neural networks
• Explain the building blocks of the state-of-the-art model and
show how they are implemented in TensorFlow (1.x)
• Give a hand-on experience in implementing text classiﬁer
using attention mechanism
3

Target problem
• Detect incongruity between news headline and body text:  
A news headline does not correctly represent the story
5

Overall model architecture
Deep Neural Net for
Encoding Headline
Deep Neural Net for
Encoding Body Text
Embedding
Layer
Output
Layer
Input
Layer
Goal: Detecting headline incongruity
from the textual relationship between body text and headline
6

Deep Neural Net for
Encoding Headline
Deep Neural Net for
Encoding Body Text
Embedding
Layer
Output
Layer
Input
Layer
7

Input data
• Transform words into vocabulary indices
headline:
[1, 30, 5, …, 9951, 2]
body text:
[ 875, 22, 39, …, 2481, 2,
9, 93, 9593, …, 431, 77,
1, 30, 5, …, 9951, 2, … ]
8

Deﬁne input layer in TF
• Using tf.placeholders
• Parameters
• data type: tf.int32
• shape: [None, self.max_words]
• name: used for debugging
headline:
[1, 30, 5, …, 9951, 2]
body text:
[ 875, 22, 39, …, 2481, 2,
9, 93, 9593, …, 431, 77,
1, 30, 5, …, 9951, 2, … ]
9

Feed data into placeholders
• At the last end of computation graph: usually at optimizer
headline:
[1, 30, 5, …, 9951, 2]
body text:
[ 875, 22, 39, …, 2481, 2,
9, 93, 9593, …, 431, 77,
1, 30, 5, …, 9951, 2, … ]
10

One-hot encoding
{“believe”: 0, “do”: 1, “you”: 2, “happens”: 3,
“if”: 4,“what”: 5, “wouldn't”: 6, “yoga”: 7}
Vocabulary
11
[[0, 0, 1, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 0, 1, 0, … ],
[1, 0, 0, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 1, 0, 0, … ],
[0, 0, 0, 1, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 1, 0, 0, 0, … ],
[0, 0, 1, 0, 0, 0, 0, 0, … ],
[0, 1, 0, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 0, 0, 1, … ]]

Drawbacks of one-hot
[[0, 0, 1, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 0, 1, 0, … ],
[1, 0, 0, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 1, 0, 0, … ],
[0, 0, 0, 1, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 1, 0, 0, 0, … ],
[0, 0, 1, 0, 0, 0, 0, 0, … ],
[0, 1, 0, 0, 0, 0, 0, 0, … ],
[0, 0, 0, 0, 0, 0, 0, 1, … ]]
{“believe”: 0, “do”: 1, “you”: 2, “happens”: 3,“if”: 4,
“what”: 5, “wouldn't”: 6, “yoga”: 7, … “a”:1000000000}
Vocabulary
12

Word embedding
• A mapping of a discrete variable for each word to a ﬁxed
dimensional vector of continuous numbers
[[0.23, 0.51],
[0.72, 0.13],
[0.01, 0.07],
[0,18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.41, 0.38],
[0.33, 0.68],
[0.14, 0.22]]
[[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1]]
13
Sequence length
×
Vocab size
Sequence length
×
Embedding size

• A mapping of a discrete variable for each word to a ﬁxed
dimensional vector of continuous numbers
Word embedding
[[0.23, 0.51],
[0.72, 0.13],
[0.01, 0.07],
[0,18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.41, 0.38],
[0.33, 0.68],
[0.14, 0.22]]
[[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1]]
14
?
Embedding matrix
Sequence length
×
Vocab size
Sequence length
×
Embedding size

Training from scratch
[[0.01, 0.07],
[0.33, 0.68],
[0.23, 0.51],
[0.41, 0.38],
[0.18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.01, 0.07],
[0.72, 0.13],
[0.14, 0.22]]
[[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1]]
[[0.23, 0.51],
[0.72, 0.13],
[0.01, 0.07],
[0.18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.41, 0.38],
[0.33, 0.68],
[0.14, 0.22]]
One-hot input Embedding matrix
15
Vocab size
×
Embedding size
Sequence length
×
Vocab size
Sequence length
×
Embedding size
Embedded input

Load pre-trained matrix
[[0.01, 0.07],
[0.33, 0.68],
[0.23, 0.51],
[0.41, 0.38],
[0.18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.01, 0.07],
[0.72, 0.13],
[0.14, 0.22]]
[[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1]]
[[0.23, 0.51],
[0.72, 0.13],
[0.01, 0.07],
[0.18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.41, 0.38],
[0.33, 0.68],
[0.14, 0.22]]
One-hot input Embedding matrix Embedded input
word2vec
glove
BERT
….
17
Vocab size
×
Embedding size
Sequence length
×
Vocab size
Sequence length
×
Embedding size

Deep Neural Net for
Encoding Headline
Deep Neural Net for
Encoding Body Text
Embedding
Layer
Output
Layer
Input
Layer
19

Deep encoder
Deep neural network
20
[[0.23, 0.51],
[0.72, 0.13],
[0.01, 0.07],
[0,18, 0.77],
[0.04, 0.05],
[0.87, 0.92],
[0.41, 0.38],
[0.33, 0.68],
[0.14, 0.22]]
Embedded input
Sequence length
×
Embedding size
[[0.752, 0.757, 0.587],
[0.645, 0.397, 0.618],
[0.777, 0.099, 0.938],
[0.367, 0.139, 0.150],
[0.341, 0.069, 0.398],
[0.415, 0.655, 0.467],
[0.935, 0.659, 0.321],
[0.875, 0.699, 0.967],
[0.734, 0.966, 0.205]]
Hidden representation
Sequence length
×
Hidden size

Which neural net can we use?
• Feedforward neural network
• Convolutional network
• Recurrent neural network
21

Recurrent neural network
• Eﬃcient in modeling inputs with sequential dependencies  
(e.g., text, time-series, …)
• To make an output for each step, RNNs incorporate the current
input with what we have learned so far
https://colah.github.io/posts/2015-08-Understanding-LSTMs/22
x2
x3
x4 xt
h2
⋯
h3
h4 ht
x1
h1

Long-term dependencies
• “the clouds are in the sky“
• “I grew up in France … I speak ﬂuent French”
23

LSTM
Vanilla
recurrent unit
LSTM
24

Cell state
• Kind of memory units that keep past information
• LSTM has an ability to add or remove information to the state
by special structures called gates
25

Forget gate layer
• Decide what information we’re going to throw away from the
cell state
• 1: “completely keep this”. 0: “completely get rid of this”
26

Taking input
• What new information we’re going to store in the cell state
• Input gate layer: sigmoid decides which values we’ll update
• tanh layer: creates a vector of candidate values
27

Update cell state
• Combine the old cell state with the new candidate value
through andft it
28

Decide output
• Output is the ﬁltered version of cell state Ct
29

GRU
• Update gate: combination of forget gate and input gate
• Merge cell state and hidden state
30

Bi-directional RNN
• Combining two RNNs together:  
One RNN reads inputs from left to right and  
another RNN reads inputs from right to left
• Able to understand context better
https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd6631

How to build RNN in TF
1. Decide which cell you use for RNN
2. Decide the number of layers in RNN
3. Decide whether RNN is uni- or bidirectional
32

Uni-directional RNN
• tf.nn.dynamic_rnn()
• outputs: the sequence of hidden states  
[batch_size, max_sequences, output_size]
• state: the ﬁnal state  
[batch_size, output_size]
35

Bi-directional RNN
• outputs, states = (output_fw, output_bw), (state_fw, state_bw)
36

Some body text is too long..
should contain all necessary information
from the past over thousand steps
ht
x2
x3
x4 xt
h2
⋯
h3
h4 ht
x1
h1
37

A news article is hierarchical
38

Hierarchical RNN
Word-level RNN
Paragraph-level RNN
ht
p = f(ht−1
p , xt
p; θf )
up = g(up−1, ht
p; θg)
x1
1 x2
1 x3
1
xt
1
h1
1
⋯
h2
1 h3
1
ht
1
⋯
x1
2 x2
2 x3
2
xt
2
h1
2
⋯
h2
2 h3
2
ht
2
x1
p x2
p x3
p xt
p
h1
p
⋯
h2
p h3
p ht
p
ht
1 ht
2 ht
p⋯
u1 u2
up
39

Hierarchical RNN
Word-level RNN
Paragraph-level RNN
ht
p = f(ht−1
p , xt
p; θf )
up = g(up−1, ht
p; θg)
x1
1 x2
1 x3
1
xt
1
h1
1
⋯
h2
1 h3
1
ht
1
⋯
x1
2 x2
2 x3
2
xt
2
h1
2
⋯
h2
2 h3
2
ht
2
x1
p x2
p x3
p xt
p
h1
p
⋯
h2
p h3
p ht
p
ht
1 ht
2 ht
p⋯
u1 u2
up
The maximum length of RNN
can be reduced signiﬁcantly
Therefore, we can train models with a
fewer number of parameters eﬀectively
40

What’s more?
43
• Across body text, some paragraphs have a strong signal

Neural Machine Translation
• RNN-based encoder-decoder architecture, known as seq2seq
44Sutskever et al., 2014, Cho et al., 2014

Attention mechanism in NMT
45
Source
(German)
Target
(English)
https://aws.amazon.com/ko/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/

Attention mechanism in NMT
46https://aws.amazon.com/ko/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
Source
(German)
Target
(English)

Attention mechanism
47
• In detecting incongruity, we can pay a diﬀerent amount of
attention for each paragraph

Attention mechanism
ht
1 ht
2 ht
p⋯
uB
1 uB
2 uB
p
⋯
uH
RNN for headline (target) RNN for body text (source)
Alignment
Model
Weighted sum
uB
48
• In detecting incongruity, we can pay a diﬀerent amount of
attention for each paragraph

Alignment model
ht
1 ht
2 ht
p⋯
uB
1 uB
2 uB
p
⋯
uH
Weighted sum
uB
Alignment
Model
Alignment
Model
Alignment
Model
Alignment
Model
aH(s) = align(uH
, uB
s )
=
exp(score(uH
, uB
s )
∑s′
exp(score(uH, uB
s′)
• Calculate attention weights between each paragraph (source)
and headline (target)
49
uB
1 uB
2 uB
puH

Alignment model
ht
1 ht
2 ht
p⋯
uB
1 uB
2 uB
p
⋯
uH
Weighted sum
uB
Alignment
Model
Alignment
Model
Alignment
Model
Alignment
Model
50
uB
1 uB
2 uB
puH
• Score is a content-based function
(Luong et al. 2015)

Context vector
ht
1 ht
2 ht
p⋯
uB
1 uB
2 uB
p
⋯
uH
Context vector
uB
Alignment
Model
• Represents the body text with diﬀerent attention weights
across paragraphs
uB
=
∑
s′
aH(s)uB
s′ Weighted sum
uB
51
uB
1 uB
2 uB
p
Alignment
Model

Attention in TF
• Using dot-product similarity
• bodytext_outputs: sequence of the hidden states
• headline_states: the last hidden state
52

Deep Neural Net for
Encoding Headline
Deep Neural Net for
Encoding Body Text
Embedding
Layer
Output
Layer
Input
Layer
53

Measure similarity
• : last hidden state of RNN for encoding headline
• : context vector that encodes body text
• : learnable similarity matrix, : bias term
• :
p(label) = σ((uH
)⊤
MuB
+ b)
uH
uB
M
σ
b
54

Measure similarity
p(label) = σ((uH
)⊤
MuB
+ b)
55

Deﬁne loss function
• cross-entropy: standard loss function for classiﬁcation 
: ground truth (0/1) : model outputy p(y)
56

Optimizer
• Gradient clipping to prevent for exploding gradient
57

Overfitting
Model Complexity
Error
OverfittingUnderfitting
58

How to prevent overﬁtting?
• Add more data! (most eﬀective if possible)
• Data augmentation: add noises to input to better generalized
• Regularization: L1/L2, Dropout, Early stopping
• Reduce architecture complexity
59

Dataset/code/paper
• https://github.com/david-yoon/detecting-incongruity
62

Attention for text classiﬁcation
• Giving diﬀerent weights over word sequences (Zhou et al., ACL 2016)
63
H = [h1, h2, ⋯, hT]
M = tanh(H)
α = softmax(wt
M)
r = HαT

• Focusing on important sentence representation, each of which
pay a diﬀerent amount of attention to words (Yang et al., NAACL 2016)
64

• Transfer learning on Transformer language model, trained by
multi-head attention (Vaswani et al., NIPS 2017, Devlin et al., NAACL 2019)
65

Hands-on experience
• Target problem: sentiment analysis on IMDB review dataset
66
Link: https://bit.ly/2xbelke

Thank you
Kunwoo Park
@ IBS deep learning summer school

Detecting Misleading Headlines in Online News: Hands-on Experiences on Attention-based RNN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detecting Misleading Headlines in Online News: Hands-on Experiences on Attention-based RNN

Similar to Detecting Misleading Headlines in Online News: Hands-on Experiences on Attention-based RNN (20)

More from Kunwoo Park

More from Kunwoo Park (12)

Recently uploaded

Recently uploaded (20)

Detecting Misleading Headlines in Online News: Hands-on Experiences on Attention-based RNN