2. DARK KNOWLEDGE?
Geoff’s been busy at Google. Recently, he published a paper talking
about ‘Dark Knowledge’. Sounds creepy….
What problem is this referring to?
Model complexity with respect to deployment.
Ensembles (RF / GBM) + DNN are slow to train + predict
and require lots of memory (READ: $$$)
What’s the solution?
Train a simpler model that extracts the ‘dark knowledge’ from the DNN
(or ensemble) we want to mimic. The simpler model can then be
deployed at a cheaper ‘cost’.
3. WHY NOW?
CLEARLY, this is a good idea BUT why hasn’t there been more
investigation to an otherwise very promising approach?
Our Perception
We equate the knowledge in a trained model learned parameters
(i.e. weights)
How can you change the ‘form’ of the model but keep the same
knowledge?
…which is why we have trouble with this question:
Answer: By using soft targets to train a simpler model to extract the
‘dark knowledge’ from the more complex model.
4. GAME-PLAN
1. Import Higgs-Boson Dataset (~11 million rows, 5 GBs)
2. Create FOUR splits of our dataset
3. Train a ‘cumbersome’ deep neural network
4. Predict targets for transfer dataset append as ‘soft targets’ for
distilled model.
5. Train ‘distilled’ model on soft targets to learn ‘Dark Knowledge’
6. Compare ‘distilled’ model vs.‘cumbersome’ model on validation data
5. FOUR DATASETS?
The original Higgs-Boson Dataset: 11 million rows
Split this into….
data.train - 8.8 million rows
data.test - 550k rows
‘Cumbersome Model’
data.transfer - 1.1 million rows ‘Distilled Model’
(labels = prediction from
‘Cumbersome’ Model
Model Comparisondata.valid - 550k rows
1.
2.
3.
4.
6. THE ‘CUMBERSOME’ NET
Inputs: 29 machine + human generated features
# of Layers: 3
# of Hidden Neurons: 1,024 / layer (3,072 total)
Activation Function: Rectifier w/ Dropout (default = 50%)
Input Dropout: 10%, L1-regularization: 0.0001
Total Training Time: 24 mins. 20 seconds
8. SOFTVS. HARDTARGETS
Hard Targets: Actual labels of the data (e.g. 1 if Higgs-Boson particle)
Soft Targets: The predicted labels from the data which will be used to
train the distilled model
soft-targets
“distilled model”
“cumbersome model”
predicts labels
on transfer
dataset
(aka ‘soft’ targets)
9. TRAIN ‘DISTILLED’ NET
AFTER cumbersome model predicts labels on transfer data,
use these labels as ‘soft targets’ to train distilled network
‘Cumbersome’ Net ‘Distilled’ Net
3 layers x 1,024 neurons / layer 2 layers x 800 neurons / layer
Rectifier w/ Dropout Rectifier
Input Dropout + L1-regular. No Input Dropout OR L1-regular.
11. THE REAL ACIDTEST
So we have 2 models:
Cumbersome Model: Trained w/ DReD Net
Distilled Model: Trained w/ SoftTargets onTransfer Dataset
NOW, it’s time to score each model against the validation dataset
(which has hard targets)
12. NOT SHABBY…
Cumbersome Model Confusion Matrix:
Distilled Model Confusion Matrix:
A difference of
737 errors (!!)
13. WHAT NOW?
If you want to know more read:
“Distilling the Knowledge in a Neural Network” - G.E. Hinton
Alex: alex@h2o.ai Michal: michal@h2o.ai
Coming Soon: “The Hinton Trick” will be added to H2O’s algo
roadmap!
Next Test: Try some ensemble approaches (e.g. Random Forest,
Gradient Boosting Machine)
Result: Our ‘simple’ net does a very decent job compared to the
complex net having learned ‘dark knowledge’