2. ● Stuffs you are interested in
○ TextCNN
○ Why you should try deep-learning on text data
● On real product
○ Acme review / Q&A moderation
○ Network Spec
Outline
2
4. TextCNN: architecture
1. Architecture
a. Embedding
i. Trainable, or not (if using pretrained)
b. 1d convolution filters
i. Each conv “searches for this pattern”
ii. Pattern to be searched is trainable
iii. hundreds/thousands filters per size
c. Maxpool results
i. Means “max density of certain pattern”
2. Keras code: just 10 lines
4
5. Let’s take an intuitive analogy!
(Keyword: Convolution, MaxPool)
5
6. 6
I’m Llama (sample L) I’m Alpaca (sample A)
1. You want to classify whether the right animal is Llama.
7. 7
2. So you find traits of Llama as your filters.
Traits (Filters)
8. 8
3. Then you do a convolution to find maxmatch of each trait.
Llama Traits (Filters)
0%
70%
10%
Filters finding traits
(by Convolution)
70%
Best match of 1st trait
is 70% (by MaxPool)
10%
5%
9. 9
Llama Traits (Filters)
0%
10%
80%
Filters finding traits
(by Convolution)
80%
Best match of 2nd trait
is 80% (by MaxPool)
60%
15%
3. Then you do a convolution to find maxmatch of each trait.
10. 10
Llama Traits (Filters)
0%
10%
10%
Filters finding traits
(by Convolution)
60%
Best match of 3rd trait
is 60% (by MaxPool)
40%
60%
3. Then you do a convolution to find maxmatch of each trait.
11. 11
4. Finally, you have similarities for each trait (features!)
Llama Traits (Filters)
70%
80%
60%
12. 12
5. Make final decision (model) out of your features.
(In neural network, multi-layer perceptron is simplest.)
Llama Traits (Filters)
70%
80%
60%
M
Any
classifier
you love!
13. Conv & MaxPool
Now you got the idea,
Let’s dive into a bit more detail.
13
14. TextCNN: what is a 1d convolution?
14
Let’s talk in Convolutions!
10*3 + 20*0 + 30*1 + 40*2 = 140
Text Data Filter
15. TextCNN: what is a 1d convolution?
15
Let’s talk in Convolution!
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
Text Data Filter
16. TextCNN: what is a 1d convolution?
16
Let’s talk in Convolution!
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
10*0 + 20*2 + 30*0 + 40*0 = 40
Text Data Filter
17. TextCNN: what is a 1d convolution?
17
Let’s talk in Convolution!
10*3 + 20*0 + 30*1 + 40*2 = 140
10*0 + 20*0 + 30*0 + 40*0 = 0
10*0 + 20*2 + 30*0 + 40*0 = 40
10*0 + 20*0 + 30*0 + 40*0 = 0
Note: you could specify activation function in Conv1D to adjust your output,
and seems like people is using relu in TextCNN.
Text Data Filter
18. 1. It’s idea similar to ngram-bow/tfidf, but:
a. Don’t need an exact match, words with similar
meanings also contribute.
b. Weighted among tokens/dimensions.
c. Maxpool make only the most matched pattern
have final effect.
2. It’s including
a. Training the embedding
b. Training the feature finder
c. into your supervised learning process,
dedicated to your data.
3. While most of traditional feature-finding & extraction
is actually unsupervised-learning process.
TextCNN: what is a 1d convolution?
18
Text Data Filter
19. Now you know the details.
Let’s return to full picture!
19
20. TextCNN: architecture
1. Architecture
a. Embedding
i. Trainable, or not (if using pretrained)
b. 1d convolution filters
i. Each “searches for this pattern”
ii. Pattern to be searched is trainable
iii. hundreds/thousands filters per size
c. Maxpool results
i. Means “max density of certain pattern”
2. Keras code: just 10 lines
20
21. 1. Input: 140 as the max word count of doc.
2. 60k for english dictionary size
3. 300 is convention of embedding size
4. region size=[2,3,4] as example figure
5. filter=2, as example figure
6. dropout=0.5 because Hinton said so.
TextCNN: code detail
21
22. Text_1
(ex: resume)
Advance trick: concatenate channels from multiple features
22
MLP
...
?
[Output stage]
sigmoid, softmax,
linear… according to
your response type.
textcnn_vec1
textcnn_vec2
Text_2
(ex: jobTitle)
23. Your Fancy
Feature
Engineering
traditional
features
Advance trick: then concatenate regular features you love
23
MLP
...
?
[Output stage]
sigmoid, softmax,
linear… according to
your response type.
textcnn_vec1
textcnn_vec2
+
Text_1
(ex: resume)
Text_2
(ex: jobTitle)
Others
(ex: sex/lang)
25. ● Doing much better than ngram
○ Embedding => better resolution than word level
○ words having similar meaning will also work
○ Including feature extraction in supervised learning for your data.
● No Manual Feature Extraction
○ No need to reproduce feature extraction in deploy (Java) domain.
○ Feature extraction from text data could be computing expensive:
■ Dictionary based feature is slow for large sample size
■ Model based features like NMF, LDA is super slow.
● Pretrained embeddings give you a boost (word2vec, GloVe, FastText):
○ Idea like transfer learning.
○ Then your embedding could further fine tune it to fit your dataset.
Better features = Better performance
25
26. ● Customize you model according to your data/purpose.
○ Switch output layer / activation function for different response type / range.
○ Customize architecture according to your data characteristic.
○ Mess around with hidden layers / dropout / different activate function.
● Merit as an online model (SGD over BGD)
○ Sustainable model: old model + new samples = new model having old experience!
○ Memory friendly: don’t need to load all samples in memory, choose mini-batch size fitting
your usage.
● GPU speed you!
Customizable / Reusability!
26
27. Deployment: Tensorflow Java API (doc)
27
1. Load protobuf model
2. Input tokenized text
3. Get prediction results
28. ● Big network => expensive computing power
○ Embedding layer is expensive, since it’s a huge fully-connected layer.
○ Reference: the predicting throughput is ~750/sec in our deploying model. (w/o GPU)
● Larger model
○ Both model we’re deploying is 100Mb with 13M total/trainable parameters, while simple tree
based model could be < 10 Mb.
○ In our case, it takes 4.5 hours to train on 1.2M reviews in 1 epoch, with 8cpu/32G ram.
● Solutions for throughput:
○ GPU acceleration
○ Do predicting in parallel on product
○ Use coarse model to narrow down search space, only use fine-grain model in sorting
promising candidates.
No free lunch, it will cost you ...
28
30. Preliminary
● User generated content (UGC) is valuable asset in Acme.
● Bad UGC will ruin user experience, or get us sued.
● Today we’re talking about model moderating Reviews and Q&A.
30
32. Somehow the old model tend to be too confidence in separating sample to 1/0.
Acme reviews classifier: class distribution
32
33. Since the prediction distribution is smooth in NN, the precision-recall curve threshold is smoothly
scattered in whole range, too.
Acme reviews classifier
33
34. ● Target : auto-accepting 80% user content, since we don’t won’t human moderating more.
● Currently, NOT auto-rejecting anything since stack holder asked so.
New model
Auto-accepting 80%
82% class-1 precision
Old model
Auto-accepting 80%
74% class-1 precision
Acme reviews classifier
34
Bad
33%
Good
66%
0.82-0.74 / (1-0.74) = 30% less
bad content being approved
35. Acme Q&A answers classifier
35
Old model
ROC=0.668
New model
ROC=0.844
36. The old model prediction seems truncated by some reason.
Acme Q&A answers classifier
36
37. Acme Q&A answers classifier
37
Since the prediction distribution is smooth in NN, the precision-recall curve threshold is smoothly
scattered in whole range, too.
38. ● Target : auto-accepting 80% user content, since we don’t won’t human moderating more.
● Currently, NOT auto-rejecting anything since stack holder asked so.
Auto-moderating 377%
Precision improved 7%
New model
Auto-accepting 68%
90% class-1 precision
Old model
Auto-accepting 18%
83% class-1 precision
Acme QnA answers classifier
38
Bad
30%
Good
66%
Good
70%
39. ● LGB performance stocked 0.77 @ ~350k training samples
● TextCNN performance stocked 0.83 @ 1.2M training samples
● LSTM seems might still growing after 0.83? But we don’t have more samples.
Learning Curve (QnA Invalidator)
39