3. 3
1. Background
6/03/2017
Deep Crossing: Web-Scale Modeling without
Manually Crafted Combinatorial Features
Authors from Microsoft Research
Keywords Neural Networks, DL, CNN, etc.
Conference KDD 2016
4. 4
1. Background
6/03/2017
KDD (A*) Conferences on Knowledge Discovery
and Data mining
SIGKDD Special Interest Group on Knowledge
Discovery in Data
KDD 2017 Aug.13-17, Halifax, Canada
December 9, 2016 (DDL)
5. 5
2.Abstract
Combine features
Important and useful, but manually craft is time consuming
and experience needed, accuracy unguaranteed especially in
large scale, variety and volume of features
Contribution
Deep neural network to automatically combine features
Tool
Computational network Tool Kit, multi-GPU platform
6/03/2017
6. 6
3. Why choose it
Automatically combine features (reduce D)
Web-scale (Massive data)
Feature in different types & Dimensions
Better performance
6/03/2017
7. 7
4.what’s the main idea
What is feature extraction?
Individual features
An individual measurable property of a phenomenon being
observed. (Representation of the data)
Combinatorial features
Defined in the joint space of individual features, to make the
model shorter training time, simpler, better generalization.
6/03/2017
10. 6/03/2017
max(0, )O I
j j j jX W X b
( )j jm n ( 1)jn
( 1)jm
( 1)jm
( 1)jm ( 1)jm
( 1)jm
( 1)jn
Reduce Dj jm n
5.How it works
5.1 Embedding layers
11. 6/03/2017
(0, )maxO I
j j j jX W X b
5.1 Embedding layers
Rectified linear unit (ReLU)
• Elements non-negative
Activation function: a node
defines the output of that
node given an input
• ReLU
• Logistic
• Tanh
• Sigmoid
5.How it works
13. 6/03/2017
0 1, , ,O O O O
KX X X X L
256, set 256
embedding & stacking
j jn m
256,
stacking without embedding
jn
Feature Number
Inputs K
Embedding then stacking n
Stacking(non-embedding) K-n
Stacking all K
5.2 stacking layers
Stacking rules:
Threshold: 256
5.How it works
14. 6/03/2017
0 1 0 1( , , , , )OR IR IR
F W W B BX X X
𝑋 𝐼𝑅
5.3 Residual layers
• Inputs and outputs have the same size
• Residual Unit is first used beyond image
recognition
5.How it works
16. 16
5.How it works
6/03/2017
1
1
log ( log( ) (1 )log(1 ))
N
i i i i
i
loss y p y p
N
5.5 Objective function
Objective function: loss function or its negative
N: No. of samples
: sample label
: output of Model(predict)
𝑋 𝐼𝑅
𝒚𝒊
𝒑𝒊
17. 6/03/2017
5.5 Early Crossing vs. Late Crossing
Deep Crossing DSSM
(Deep Semantic Similarity Model)
5.How it works
21. 21
7.Experimentation
6 March, 2017
7.2 Performance on a Pair of Text Inputs
Production Model: one model can be used in sponsored
search as baseline
DSSM < DC< Production
DC Main advantage: deal with many individual features
25. 25
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
Performance changes a lot as features number
changes; log loss suffers a big fluctuation with
different feature combination. So that feature
selection is meaningful.
26. 26
7.Experimentation
6 March, 2017
7.4 Comparison with Production Models
2.2 billion samples
DC perform better with much less dataset
DC is easier to build and maintain
27. 27
8.Conclusions
Deep Crossing work well in automatically
feature combinatorial in large scale
Need less time and experience
6/03/2017
28. 28
9.Experience
6/03/2017
• Deep learning (LSTM, CNN, etc.) can extract feature
automatically, we can compare the efficient of them
with this model
• we can use the raw data instead the individual
features to train
• In different domains such mobile sensing,
recommender system