Readinggroup xiang 24112016

1
Reading Group
Deep Crossing: Web-Scale Modeling
without Manually Crafted
Combinatorial Features
Presenter:Xiang Zhang
6/03/2017

6/03/2017
1.Background
2.Abstract
3.Why choose it
4.what’s the main idea
5.How it works
6.Implementation
7.Expermentation
8.Conclusions
9.Experience
Main Content
2

3
1. Background
6/03/2017
Deep Crossing: Web-Scale Modeling without
Manually Crafted Combinatorial Features
 Authors from Microsoft Research
 Keywords Neural Networks, DL, CNN, etc.
 Conference KDD 2016

4
1. Background
6/03/2017
KDD (A*) Conferences on Knowledge Discovery
and Data mining
SIGKDD Special Interest Group on Knowledge
Discovery in Data
KDD 2017 Aug.13-17, Halifax, Canada
December 9, 2016 (DDL)

5
2.Abstract
 Combine features
Important and useful, but manually craft is time consuming
and experience needed, accuracy unguaranteed especially in
large scale, variety and volume of features
 Contribution
Deep neural network to automatically combine features
 Tool
Computational network Tool Kit, multi-GPU platform
6/03/2017

6
3. Why choose it
 Automatically combine features (reduce D)
 Web-scale (Massive data)
 Feature in different types & Dimensions
 Better performance
6/03/2017

7
What is feature extraction?
 Individual features
An individual measurable property of a phenomenon being
observed. (Representation of the data)
 Combinatorial features
Defined in the joint space of individual features, to make the
model shorter training time, simpler, better generalization.
6/03/2017

6/03/2017
 Manually:
• Time
• Experience
 Automatically

9
5.How it works
6/03/2017
Model
Architecture

6/03/2017
max(0, )O I
j j j jX W X b 
( )j jm n ( 1)jn 
( 1)jm 

 ( 1)jm 
( 1)jm  ( 1)jm 
( 1)jm 
( 1)jn 
Reduce Dj jm n
5.How it works
5.1 Embedding layers

6/03/2017
(0, )maxO I
j j j jX W X b 
5.1 Embedding layers
 Rectified linear unit (ReLU)
• Elements non-negative
 Activation function: a node
defines the output of that
node given an input
• ReLU
• Logistic
• Tanh
• Sigmoid
5.How it works

12
5.How it works
6/03/2017
0 1, , ,O O O O
KX X X X   L
5.2 stacking layers

6/03/2017
0 1, , ,O O O O
KX X X X   L
256, set 256
embedding & stacking
j jn m 
256,
stacking without embedding
jn 
Feature Number
Inputs K
Embedding then stacking n
Stacking(non-embedding) K-n
Stacking all K
5.2 stacking layers
Stacking rules:
Threshold: 256
5.How it works

6/03/2017
   0 1 0 1( , , , , )OR IR IR
F W W B BX X X 
𝑋 𝐼𝑅
5.3 Residual layers
• Inputs and outputs have the same size
• Residual Unit is first used beyond image
recognition
5.How it works

15
5.How it works
6/03/2017
5.4 Scoring layers
Sigmoid function:
𝑋 𝐼𝑅

16
5.How it works
6/03/2017
1
1
log ( log( ) (1 )log(1 ))
N
i i i i
i
loss y p y p
N 
    
5.5 Objective function
Objective function: loss function or its negative
N: No. of samples
: sample label
: output of Model(predict)
𝑋 𝐼𝑅
𝒚𝒊
𝒑𝒊

6/03/2017
5.5 Early Crossing vs. Late Crossing
Deep Crossing DSSM
(Deep Semantic Similarity Model)
5.How it works

18
6.Implementation
6/03/2017
 Software
Computational Network Toolkit (CNKT)
Same theoretical foundation with Tensorflow
 Hardware
Multi-GPU platform
24 days (1 GPS) to 20 hours (32 GPUs)

6/03/2017
7.Experimentation
7.1 Dataset

6/03/2017
7.Experimentation
7.2 Performance on a Pair of Text Inputs
 DSSM: late crossing
 DP: early crossing
 DC>DSSM

21
7.Experimentation
6 March, 2017
7.2 Performance on a Pair of Text Inputs
 Production Model: one model can be used in sponsored
search as baseline
 DSSM < DC< Production
 DC Main advantage: deal with many individual features

22
7.Experimentation
6 March, 2017
7.3 Beyond Text Input
 All features works best

23
7.Experimentation
6 March, 2017
 Only counting feature is weak

24
7.Experimentation
6 March, 2017
 Counting feature is useful

25
7.Experimentation
6 March, 2017
 Performance changes a lot as features number
changes; log loss suffers a big fluctuation with
different feature combination. So that feature
selection is meaningful.

26
7.Experimentation
6 March, 2017
7.4 Comparison with Production Models
 2.2 billion samples
 DC perform better with much less dataset
 DC is easier to build and maintain

27
8.Conclusions
 Deep Crossing work well in automatically
feature combinatorial in large scale
 Need less time and experience
6/03/2017

28
9.Experience
6/03/2017
• Deep learning (LSTM, CNN, etc.) can extract feature
automatically, we can compare the efficient of them
with this model
• we can use the raw data instead the individual
features to train
• In different domains such mobile sensing,
recommender system

Readinggroup xiang 24112016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Readinggroup xiang 24112016

Similar to Readinggroup xiang 24112016 (20)

Recently uploaded

Recently uploaded (20)

Readinggroup xiang 24112016