DarkKnowledge

Dark Knowledge
Alex Tellez & Michal Malohlava
www.h2o.ai

DARK KNOWLEDGE?
Geoff’s been busy at Google. Recently, he published a paper talking
about ‘Dark Knowledge’. Sounds creepy….
What problem is this referring to?
Model complexity with respect to deployment.
Ensembles (RF / GBM) + DNN are slow to train + predict
and require lots of memory (READ: $$$)
What’s the solution?
Train a simpler model that extracts the ‘dark knowledge’ from the DNN
(or ensemble) we want to mimic. The simpler model can then be
deployed at a cheaper ‘cost’.

WHY NOW?
CLEARLY, this is a good idea BUT why hasn’t there been more
investigation to an otherwise very promising approach?
Our Perception
We equate the knowledge in a trained model learned parameters
(i.e. weights)
How can you change the ‘form’ of the model but keep the same
knowledge?
…which is why we have trouble with this question:
Answer: By using soft targets to train a simpler model to extract the
‘dark knowledge’ from the more complex model.

GAME-PLAN
1. Import Higgs-Boson Dataset (~11 million rows, 5 GBs)
2. Create FOUR splits of our dataset
3. Train a ‘cumbersome’ deep neural network
4. Predict targets for transfer dataset append as ‘soft targets’ for
distilled model.
5. Train ‘distilled’ model on soft targets to learn ‘Dark Knowledge’
6. Compare ‘distilled’ model vs.‘cumbersome’ model on validation data

FOUR DATASETS?
The original Higgs-Boson Dataset: 11 million rows
Split this into….
data.train - 8.8 million rows
data.test - 550k rows
‘Cumbersome Model’
data.transfer - 1.1 million rows ‘Distilled Model’
(labels = prediction from
‘Cumbersome’ Model
Model Comparisondata.valid - 550k rows
1.
2.
3.
4.

THE ‘CUMBERSOME’ NET
Inputs: 29 machine + human generated features
# of Layers: 3
# of Hidden Neurons: 1,024 / layer (3,072 total)
Activation Function: Rectiﬁer w/ Dropout (default = 50%)
Input Dropout: 10%, L1-regularization: 0.0001
Total Training Time: 24 mins. 20 seconds

‘CUMBERSOME’ NET CONT’D.
27% Model Error
~ 0.82 AUC

SOFTVS. HARDTARGETS
Hard Targets: Actual labels of the data (e.g. 1 if Higgs-Boson particle)
Soft Targets: The predicted labels from the data which will be used to
train the distilled model
soft-targets
“distilled model”
“cumbersome model”
predicts labels
on transfer
dataset
(aka ‘soft’ targets)

TRAIN ‘DISTILLED’ NET
AFTER cumbersome model predicts labels on transfer data,
use these labels as ‘soft targets’ to train distilled network
‘Cumbersome’ Net ‘Distilled’ Net
3 layers x 1,024 neurons / layer 2 layers x 800 neurons / layer
Rectiﬁer w/ Dropout Rectiﬁer
Input Dropout + L1-regular. No Input Dropout OR L1-regular.

‘DISTILLED’ NET CONT’D.
~ 3 minutes to train
High AUC on ‘soft’ targets

THE REAL ACIDTEST
So we have 2 models:
Cumbersome Model: Trained w/ DReD Net
Distilled Model: Trained w/ SoftTargets onTransfer Dataset
NOW, it’s time to score each model against the validation dataset
(which has hard targets)

NOT SHABBY…
Cumbersome Model Confusion Matrix:
Distilled Model Confusion Matrix:
A difference of
737 errors (!!)

WHAT NOW?
If you want to know more read:
“Distilling the Knowledge in a Neural Network” - G.E. Hinton
Alex: alex@h2o.ai Michal: michal@h2o.ai
Coming Soon: “The Hinton Trick” will be added to H2O’s algo
roadmap!
Next Test: Try some ensemble approaches (e.g. Random Forest,
Gradient Boosting Machine)
Result: Our ‘simple’ net does a very decent job compared to the
complex net having learned ‘dark knowledge’

DarkKnowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DarkKnowledge

Similar to DarkKnowledge (20)

DarkKnowledge