1
Reading Group
Deep Crossing: Web-Scale Modeling
without Manually Crafted
Combinatorial Features
Presenter:Xiang Zhang
6/03/2017
6/03/2017
1.Background
2.Abstract
3.Why choose it
4.what’s the main idea
5.How it works
6.Implementation
7.Expermentation
8.Conclusions
9.Experience
Main Content
2
3
1. Background
6/03/2017
Deep Crossing: Web-Scale Modeling without
Manually Crafted Combinatorial Features
 Authors from Microsoft Research
 Keywords Neural Networks, DL, CNN, etc.
 Conference KDD 2016
4
1. Background
6/03/2017
KDD (A*) Conferences on Knowledge Discovery
and Data mining
SIGKDD Special Interest Group on Knowledge
Discovery in Data
KDD 2017 Aug.13-17, Halifax, Canada
December 9, 2016 (DDL)
5
2.Abstract
 Combine features
Important and useful, but manually craft is time consuming
and experience needed, accuracy unguaranteed especially in
large scale, variety and volume of features
 Contribution
Deep neural network to automatically combine features
 Tool
Computational network Tool Kit, multi-GPU platform
6/03/2017
6
3. Why choose it
 Automatically combine features (reduce D)
 Web-scale (Massive data)
 Feature in different types & Dimensions
 Better performance
6/03/2017
7
4.what’s the main idea
What is feature extraction?
 Individual features
An individual measurable property of a phenomenon being
observed. (Representation of the data)
 Combinatorial features
Defined in the joint space of individual features, to make the
model shorter training time, simpler, better generalization.
6/03/2017
6/03/2017
 Manually:
• Time
• Experience
 Automatically
4.what’s the main idea
9
5.How it works
6/03/2017
Model
Architecture
6/03/2017
max(0, )O I
j j j jX W X b 
( )j jm n ( 1)jn 
( 1)jm 

 ( 1)jm 
( 1)jm  ( 1)jm 
( 1)jm 
( 1)jn 
Reduce Dj jm n
5.How it works
5.1 Embedding layers
6/03/2017
(0, )maxO I
j j j jX W X b 
5.1 Embedding layers
 Rectified linear unit (ReLU)
• Elements non-negative
 Activation function: a node
defines the output of that
node given an input
• ReLU
• Logistic
• Tanh
• Sigmoid
5.How it works
12
5.How it works
6/03/2017
0 1, , ,O O O O
KX X X X   L
5.2 stacking layers
6/03/2017
0 1, , ,O O O O
KX X X X   L
256, set 256
embedding & stacking
j jn m 
256,
stacking without embedding
jn 
Feature Number
Inputs K
Embedding then stacking n
Stacking(non-embedding) K-n
Stacking all K
5.2 stacking layers
Stacking rules:
Threshold: 256
5.How it works
6/03/2017
   0 1 0 1( , , , , )OR IR IR
F W W B BX X X 
𝑋 𝐼𝑅
5.3 Residual layers
• Inputs and outputs have the same size
• Residual Unit is first used beyond image
recognition
5.How it works
15
5.How it works
6/03/2017
5.4 Scoring layers
Sigmoid function:
𝑋 𝐼𝑅
16
5.How it works
6/03/2017
1
1
log ( log( ) (1 )log(1 ))
N
i i i i
i
loss y p y p
N 
    
5.5 Objective function
Objective function: loss function or its negative
N: No. of samples
: sample label
: output of Model(predict)
𝑋 𝐼𝑅
𝒚𝒊
𝒑𝒊
6/03/2017
5.5 Early Crossing vs. Late Crossing
Deep Crossing DSSM
(Deep Semantic Similarity Model)
5.How it works
18
6.Implementation
6/03/2017
 Software
Computational Network Toolkit (CNKT)
Same theoretical foundation with Tensorflow
 Hardware
Multi-GPU platform
24 days (1 GPS) to 20 hours (32 GPUs)
6/03/2017
7.Experimentation
7.1 Dataset
6/03/2017
7.Experimentation
7.2 Performance on a Pair of Text Inputs
 DSSM: late crossing
 DP: early crossing
 DC>DSSM
21
7.Experimentation
6 March, 2017
7.2 Performance on a Pair of Text Inputs
 Production Model: one model can be used in sponsored
search as baseline
 DSSM < DC< Production
 DC Main advantage: deal with many individual features
22
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
 All features works best
23
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
 Only counting feature is weak
24
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
 Counting feature is useful
25
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
 Performance changes a lot as features number
changes; log loss suffers a big fluctuation with
different feature combination. So that feature
selection is meaningful.
26
7.Experimentation
6 March, 2017
7.4 Comparison with Production Models
 2.2 billion samples
 DC perform better with much less dataset
 DC is easier to build and maintain
27
8.Conclusions
 Deep Crossing work well in automatically
feature combinatorial in large scale
 Need less time and experience
6/03/2017
28
9.Experience
6/03/2017
• Deep learning (LSTM, CNN, etc.) can extract feature
automatically, we can compare the efficient of them
with this model
• we can use the raw data instead the individual
features to train
• In different domains such mobile sensing,
recommender system
296/03/2017
Thanks!
306/03/2017
Questions?

Readinggroup xiang 24112016