Text cnn on acme ugc moderation

TextCNN on
Acme UGC moderation
Marsan Ma
2018.7.18
1

● Stuffs you are interested in
○ TextCNN
○ Why you should try deep-learning on text data
● On real product
○ Acme review / Q&A moderation
○ Network Spec
Outline
2

TextCNN: architecture
1. Architecture
a. Embedding
i. Trainable, or not (if using pretrained)
b. 1d convolution filters
i. Each conv “searches for this pattern”
ii. Pattern to be searched is trainable
iii. hundreds/thousands filters per size
c. Maxpool results
i. Means “max density of certain pattern”
2. Keras code: just 10 lines
4

Let’s take an intuitive analogy!
(Keyword: Convolution, MaxPool)
5

6
I’m Llama (sample L) I’m Alpaca (sample A)
1. You want to classify whether the right animal is Llama.

7
2. So you find traits of Llama as your filters.
Traits (Filters)

8
3. Then you do a convolution to find maxmatch of each trait.
Llama Traits (Filters)
0%
70%
10%
Filters finding traits
(by Convolution)
70%
Best match of 1st trait
is 70% (by MaxPool)
10%
5%

9
0%
10%
80%
(by Convolution)
80%
Best match of 2nd trait
is 80% (by MaxPool)
60%
15%

10
0%
10%
10%
(by Convolution)
60%
Best match of 3rd trait
is 60% (by MaxPool)
40%
60%

11
4. Finally, you have similarities for each trait (features!)
70%
80%
60%

12
5. Make final decision (model) out of your features.
(In neural network, multi-layer perceptron is simplest.)
70%
80%
60%
M
Any
classifier
you love!

Conv & MaxPool
Now you got the idea,
Let’s dive into a bit more detail.
13

TextCNN: what is a 1d convolution?
14
Let’s talk in Convolutions!
10*3 + 20*0 + 30*1 + 40*2 = 140
Text Data Filter

15
Let’s talk in Convolution!
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
Text Data Filter

16
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
10*0 + 20*2 + 30*0 + 40*0 = 40
Text Data Filter

17
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
10*0 + 20*2 + 30*0 + 40*0 = 40
10*0 + 20*0 + 30*0 + 40*0 = 0
Note: you could specify activation function in Conv1D to adjust your output,
and seems like people is using relu in TextCNN.
Text Data Filter

1. It’s idea similar to ngram-bow/tfidf, but:
a. Don’t need an exact match, words with similar
meanings also contribute.
b. Weighted among tokens/dimensions.
c. Maxpool make only the most matched pattern
have final effect.
2. It’s including
a. Training the embedding
b. Training the feature finder
c. into your supervised learning process,
dedicated to your data.
3. While most of traditional feature-finding & extraction
is actually unsupervised-learning process.
18
Text Data Filter

Now you know the details.
Let’s return to full picture!
19

TextCNN: architecture
1. Architecture
a. Embedding
i. Trainable, or not (if using pretrained)
b. 1d convolution filters
i. Each “searches for this pattern”
ii. Pattern to be searched is trainable
iii. hundreds/thousands filters per size
c. Maxpool results
i. Means “max density of certain pattern”
2. Keras code: just 10 lines
20

1. Input: 140 as the max word count of doc.
2. 60k for english dictionary size
3. 300 is convention of embedding size
4. region size=[2,3,4] as example figure
5. filter=2, as example figure
6. dropout=0.5 because Hinton said so.
TextCNN: code detail
21

Text_1
(ex: resume)
Advance trick: concatenate channels from multiple features
22
MLP
...
?
[Output stage]
sigmoid, softmax,
linear… according to
your response type.
textcnn_vec1
textcnn_vec2
Text_2
(ex: jobTitle)

Your Fancy
Feature
Engineering
traditional
features
Advance trick: then concatenate regular features you love
23
MLP
...
?
[Output stage]
sigmoid, softmax,
linear… according to
your response type.
textcnn_vec1
textcnn_vec2
+
Text_1
(ex: resume)
Text_2
(ex: jobTitle)
Others
(ex: sex/lang)

Why you should try
Deep Learning?
24

● Doing much better than ngram
○ Embedding => better resolution than word level
○ words having similar meaning will also work
○ Including feature extraction in supervised learning for your data.
● No Manual Feature Extraction
○ No need to reproduce feature extraction in deploy (Java) domain.
○ Feature extraction from text data could be computing expensive:
■ Dictionary based feature is slow for large sample size
■ Model based features like NMF, LDA is super slow.
● Pretrained embeddings give you a boost (word2vec, GloVe, FastText):
○ Idea like transfer learning.
○ Then your embedding could further fine tune it to fit your dataset.
Better features = Better performance
25

● Customize you model according to your data/purpose.
○ Switch output layer / activation function for different response type / range.
○ Customize architecture according to your data characteristic.
○ Mess around with hidden layers / dropout / different activate function.
● Merit as an online model (SGD over BGD)
○ Sustainable model: old model + new samples = new model having old experience!
○ Memory friendly: don’t need to load all samples in memory, choose mini-batch size fitting
your usage.
● GPU speed you!
Customizable / Reusability!
26

Deployment: Tensorflow Java API (doc)
27
1. Load protobuf model
2. Input tokenized text
3. Get prediction results

● Big network => expensive computing power
○ Embedding layer is expensive, since it’s a huge fully-connected layer.
○ Reference: the predicting throughput is ~750/sec in our deploying model. (w/o GPU)
● Larger model
○ Both model we’re deploying is 100Mb with 13M total/trainable parameters, while simple tree
based model could be < 10 Mb.
○ In our case, it takes 4.5 hours to train on 1.2M reviews in 1 epoch, with 8cpu/32G ram.
● Solutions for throughput:
○ GPU acceleration
○ Do predicting in parallel on product
○ Use coarse model to narrow down search space, only use fine-grain model in sorting
promising candidates.
No free lunch, it will cost you ...
28

Preliminary
● User generated content (UGC) is valuable asset in Acme.
● Bad UGC will ruin user experience, or get us sued.
● Today we’re talking about model moderating Reviews and Q&A.
30

Acme reviews classifier: ROC curve
31
Old model
ROC=0.696
New model
ROC=0.843

Somehow the old model tend to be too confidence in separating sample to 1/0.
Acme reviews classifier: class distribution
32

Since the prediction distribution is smooth in NN, the precision-recall curve threshold is smoothly
scattered in whole range, too.
Acme reviews classifier
33

● Target : auto-accepting 80% user content, since we don’t won’t human moderating more.
● Currently, NOT auto-rejecting anything since stack holder asked so.
New model
Auto-accepting 80%
82% class-1 precision
Old model
Auto-accepting 80%
Acme reviews classifier
34
Bad
33%
Good
66%
0.82-0.74 / (1-0.74) = 30% less
bad content being approved

Acme Q&A answers classifier
35
Old model
ROC=0.668
New model
ROC=0.844

The old model prediction seems truncated by some reason.
36

37
Since the prediction distribution is smooth in NN, the precision-recall curve threshold is smoothly
scattered in whole range, too.

● Target : auto-accepting 80% user content, since we don’t won’t human moderating more.
● Currently, NOT auto-rejecting anything since stack holder asked so.
Auto-moderating 377%
Precision improved 7%
New model
Auto-accepting 68%
Old model
Auto-accepting 18%
Acme QnA answers classifier
38
Bad
30%
Good
66%
Good
70%

● LGB performance stocked 0.77 @ ~350k training samples
● TextCNN performance stocked 0.83 @ 1.2M training samples
● LSTM seems might still growing after 0.83? But we don’t have more samples.
Learning Curve (QnA Invalidator)
39

Text cnn on acme ugc moderation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text cnn on acme ugc moderation

Similar to Text cnn on acme ugc moderation (20)

Recently uploaded

Recently uploaded (20)

Text cnn on acme ugc moderation