1
ICLR 2020 Recap
Selected Paper summaries and discussions
Sanyam Bhutani
ML Engineer & AI Content Creator
bhutanisanyam1
🎙: ctdsshow
Democratizing AI
Our mission to use AI for Good permeates into everything we do
AI Transformation
Bringing AI to industry by helping
companies transform their
businesses with H2O.ai.
Trusted Partner
AI4GOOD
Bringing AI to impact by augmenting
non-profits and social ventures with
technological resources and
capabilities.
Impact/Social
Open Source
An industry leader in providing
open source, cutting edge AI & ML
platforms (H2O-3).
Community
Confidential3
Founded in Silicon Valley 2012
Funding: $147M | Series D
Investors: Goldman Sachs, Ping An,
Wells Fargo, NVIDIA, Nexus Ventures
We are Established
We Make World-class AI Platforms
We are Global
H2O Open Source Machine Learning
H2O Driverless AI: Automatic Machine Learning
H2O Q: AI platform for business users
Mountain View, NYC, London, Paris, Ottawa,
Prague, Chennai, Singapore
220+ 1K
20K 180K
Universities
Companies Using
H2O Open Source
Meetup Members
Experts
H2O.ai Snapshot
We are Passionate about Customers
4X customers, 2 years, all industries, all continents
Aetna/CVS, Allergan, AT&T, CapitalOne, CBA, Citi,
Coca Cola, Bredesco, Dish, Disney, Franklin
Templeton, Genentech, Kaiser Permanente, Lego,
Merck, Pepsi, Reckitt Benckiser, Roche
Confidential4
Our Team is Made up of the World’s Leading Data Scientists
Your projects are backed by 10% of the World’s Data
Science Grandmasters who are relentless in solving
your critical problems.
Make Your Company an
AI Company
ICLR 2020
What is ICLR?
7
AGENDA
• What is ICLR?
• Paper Selection
• 8 Paper Summaries
• Q & A
Confidential8
9
Paper Summaries
• GAN related use cases
• Deployment discussions
• Adversarial attacks
• Sesame Street (Transformers)
Confidential10
The Cutting edge of DL is about Engineering
- Jeremy Howard
Confidential11
tive Attentional Networks with Adaptive Layer-Instance Normalization
- Junho Kim et al
12
• Image to Image Translation:
- Selfie2Anime
- Horse2Zebra
- Dog2Cat
- Photo2VanGough
• Method for unsupervised image-to-image translation
• Attention! (Attention is all you need)
• Adaptive Layer- Instance Normalisation (AdaLIN)
U-GAT-IT
13
Architecture
• Appreciating the problem
• Attention! (Attention is all you
need)
• Adaptive Layer- Instance
Normalisation (AdaLIN)
14
• Using attention to guide different
geometric transforms
• Introduction of a new normalising
function
• Image 2 Image translation (And
Backwards!)
To Summarise
Confidential15
ix: A Simple Data processing method to improve robustness and unce
- Dan Hendrycks et al
16
• Why do you need image
augmentations?
• Test and Train split should be similar
• Comparison of recent techniques
• Why is AugMix promising?
Image Augmentations
17
How does it work?
18
• Mixes augmented images and enforces consistent embeddings of the augmented images, which results in increased robustness and improved uncertainty calibration.
• AutoAugment
• AugMix does not require tuning to work correctly: enables plug-and-play data augmentation
To Summarise
Confidential19
ELECTRA: Pre-Training Text Encoders as
Discriminators rather than
Generators
- Kevin Clark et al
20
21
• Progress in NLP as a
measure of GLUE score
• What is GLUE Score?
Pre-Training Progress
Confidential22
23
• Progress in NLP as a
measure of GLUE score
• What is GLUE Score?
• Normalised by Pre-
Training FLOPs
Pre-Training Progress
24
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
25
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
26
• BERT family uses MLM
• Suggested: A bi-
directional model that
learns from all of the
tokens rather than some
% masks
Masked LM & ELECTRA
ELECTRA Pre-Training outperforms MLM Pre-Training
27
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
Confidential28
ALBERT: A Lite BERT
for Language
Understanding
- Zhenzhong Lan et al
29
• At some point further model increases
become harder due to GPU/TPU
memory limitations
• Is having better NLP models as easy as
having larger models?
• How can we reduce Parameters?
Introduction
30
• Token Embeddings are sparsely populated -> Reduce size by projections
• Re-Use Parameters of repeated operations
Proposed Changes
31
•Sentence Order Prediction for
capturing inter-sentence coherence
•Remove Dropout!
•Adding more data increases
performance
Three More Tricks!
Confidential32
nce for All: Train One Network and Specialize it for Efficient Deploymen
- Han Cai et al
33
• Efficient Deployment of DL models
across devices
• Conventional approach: Train
specialised Models: Think SqueezeNet,
MobileNet,etc
• Training Costs $$$, Engineering costs
$$$
Introduction
34
• Train Once, Specialise for deployment
• Key Idea: Decouple model training from
architectural search
• Algorithm proposed: Progressive
Shrinking
Proposed Approach
35
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
36
• Replacing token detection: a new self-supervised task for language
representation learning.
• Training a text encoder to distinguish input tokens from high-quality
negative samples produced by an small generator network
• It works well even when using relatively small amounts of compute
• 45x/8x speedup over Train/Inference when compared to BERT-
Base
To Summarise
Confidential37
Thieves on Sesame Street! Model Extraction of BERT-based APIs
- Kalpesh Krishna et al
38
• Random sentences to understand the model
• After performing a large number of attacks, you have labels and dataset
• Note: These are economically practical (Cheaper than trying to train a model)
• Note 2: This is not model distillation, it’s IP Theft
Attacks
Confidential39
40
• Membership classification: Flagging
queries
• API Watermarking: Some % of queries
are return a wrong output,
“watermarked queries” and their
outputs are stored on the API side.
• Note: Both of these would fail against
smart attacks
Suggested Solutions
Confidential41
olling Text Generation with Plug and Play Language M
- Rosanne Liu et al
Confidential42
43
• LMs can generate coherent, relatable
text, either from scratch or by
completing a passage started by the
user.
• BUT, they are hard to steer or control.
• Can also be triggered by certain
adversarial attacks
Introduction
44
• Controlled generation: Adding knobs with
conditional probability
• Consists of 3 Steps:
Controlling the Mammoth
45
Controlling the Mammoth
46
• Controlled generation: Adding knobs with
conditional probability
• Consists of 3 Steps
• Also allows reduction in toxicity
63% to ~5%!
Controlling the Mammoth
Confidential47
ENERATIVE MODELS FOR EFFECTIVE ML ON PRIVATE, DECENTRALIZED DATASET
- Sean Augenstein et al
48
• Modelling is important: Looking at data is
a large part of the pipeline
• Manual data inspection is problematic for
privacy-sensitive dataset
• Problem: Your model resides on your
server, data on end devices
Introduction
49
• Modelling is important: Looking at data is
a large part of the pipeline
• Manual data inspection is problematic for
privacy-sensitive dataset
• Problem: Your model resides on your
server, data on end devices
Suggested Solutions
50
• DP: Federated GANs:
- Train on user device
- Inspect generated data
• Repository showcases:
- Language Modelling with DP RNN
- Image Modelling with DP GANs
Suggested Solutions
Thank You! 🍵
bhutanisanyam1
🎙: ctdsshow
Questions?

ICLR 2020 Recap

  • 1.
    1 ICLR 2020 Recap SelectedPaper summaries and discussions Sanyam Bhutani ML Engineer & AI Content Creator bhutanisanyam1 🎙: ctdsshow
  • 2.
    Democratizing AI Our missionto use AI for Good permeates into everything we do AI Transformation Bringing AI to industry by helping companies transform their businesses with H2O.ai. Trusted Partner AI4GOOD Bringing AI to impact by augmenting non-profits and social ventures with technological resources and capabilities. Impact/Social Open Source An industry leader in providing open source, cutting edge AI & ML platforms (H2O-3). Community
  • 3.
    Confidential3 Founded in SiliconValley 2012 Funding: $147M | Series D Investors: Goldman Sachs, Ping An, Wells Fargo, NVIDIA, Nexus Ventures We are Established We Make World-class AI Platforms We are Global H2O Open Source Machine Learning H2O Driverless AI: Automatic Machine Learning H2O Q: AI platform for business users Mountain View, NYC, London, Paris, Ottawa, Prague, Chennai, Singapore 220+ 1K 20K 180K Universities Companies Using H2O Open Source Meetup Members Experts H2O.ai Snapshot We are Passionate about Customers 4X customers, 2 years, all industries, all continents Aetna/CVS, Allergan, AT&T, CapitalOne, CBA, Citi, Coca Cola, Bredesco, Dish, Disney, Franklin Templeton, Genentech, Kaiser Permanente, Lego, Merck, Pepsi, Reckitt Benckiser, Roche
  • 4.
    Confidential4 Our Team isMade up of the World’s Leading Data Scientists Your projects are backed by 10% of the World’s Data Science Grandmasters who are relentless in solving your critical problems.
  • 5.
    Make Your Companyan AI Company
  • 6.
  • 7.
    7 AGENDA • What isICLR? • Paper Selection • 8 Paper Summaries • Q & A
  • 8.
  • 9.
    9 Paper Summaries • GANrelated use cases • Deployment discussions • Adversarial attacks • Sesame Street (Transformers)
  • 10.
    Confidential10 The Cutting edgeof DL is about Engineering - Jeremy Howard
  • 11.
    Confidential11 tive Attentional Networkswith Adaptive Layer-Instance Normalization - Junho Kim et al
  • 12.
    12 • Image toImage Translation: - Selfie2Anime - Horse2Zebra - Dog2Cat - Photo2VanGough • Method for unsupervised image-to-image translation • Attention! (Attention is all you need) • Adaptive Layer- Instance Normalisation (AdaLIN) U-GAT-IT
  • 13.
    13 Architecture • Appreciating theproblem • Attention! (Attention is all you need) • Adaptive Layer- Instance Normalisation (AdaLIN)
  • 14.
    14 • Using attentionto guide different geometric transforms • Introduction of a new normalising function • Image 2 Image translation (And Backwards!) To Summarise
  • 15.
    Confidential15 ix: A SimpleData processing method to improve robustness and unce - Dan Hendrycks et al
  • 16.
    16 • Why doyou need image augmentations? • Test and Train split should be similar • Comparison of recent techniques • Why is AugMix promising? Image Augmentations
  • 17.
  • 18.
    18 • Mixes augmentedimages and enforces consistent embeddings of the augmented images, which results in increased robustness and improved uncertainty calibration. • AutoAugment • AugMix does not require tuning to work correctly: enables plug-and-play data augmentation To Summarise
  • 19.
    Confidential19 ELECTRA: Pre-Training TextEncoders as Discriminators rather than Generators - Kevin Clark et al
  • 20.
  • 21.
    21 • Progress inNLP as a measure of GLUE score • What is GLUE Score? Pre-Training Progress
  • 22.
  • 23.
    23 • Progress inNLP as a measure of GLUE score • What is GLUE Score? • Normalised by Pre- Training FLOPs Pre-Training Progress
  • 24.
    24 • BERT familyuses MLM • Suggested: A bi- directional model that learns from all of the tokens rather than some % masks Masked LM & ELECTRA
  • 25.
    25 • BERT familyuses MLM • Suggested: A bi- directional model that learns from all of the tokens rather than some % masks Masked LM & ELECTRA
  • 26.
    26 • BERT familyuses MLM • Suggested: A bi- directional model that learns from all of the tokens rather than some % masks Masked LM & ELECTRA ELECTRA Pre-Training outperforms MLM Pre-Training
  • 27.
    27 • Replacing tokendetection: a new self-supervised task for language representation learning. • Training a text encoder to distinguish input tokens from high-quality negative samples produced by an small generator network • It works well even when using relatively small amounts of compute • 45x/8x speedup over Train/Inference when compared to BERT- Base To Summarise
  • 28.
    Confidential28 ALBERT: A LiteBERT for Language Understanding - Zhenzhong Lan et al
  • 29.
    29 • At somepoint further model increases become harder due to GPU/TPU memory limitations • Is having better NLP models as easy as having larger models? • How can we reduce Parameters? Introduction
  • 30.
    30 • Token Embeddingsare sparsely populated -> Reduce size by projections • Re-Use Parameters of repeated operations Proposed Changes
  • 31.
    31 •Sentence Order Predictionfor capturing inter-sentence coherence •Remove Dropout! •Adding more data increases performance Three More Tricks!
  • 32.
    Confidential32 nce for All:Train One Network and Specialize it for Efficient Deploymen - Han Cai et al
  • 33.
    33 • Efficient Deploymentof DL models across devices • Conventional approach: Train specialised Models: Think SqueezeNet, MobileNet,etc • Training Costs $$$, Engineering costs $$$ Introduction
  • 34.
    34 • Train Once,Specialise for deployment • Key Idea: Decouple model training from architectural search • Algorithm proposed: Progressive Shrinking Proposed Approach
  • 35.
    35 • Replacing tokendetection: a new self-supervised task for language representation learning. • Training a text encoder to distinguish input tokens from high-quality negative samples produced by an small generator network • It works well even when using relatively small amounts of compute • 45x/8x speedup over Train/Inference when compared to BERT- Base To Summarise
  • 36.
    36 • Replacing tokendetection: a new self-supervised task for language representation learning. • Training a text encoder to distinguish input tokens from high-quality negative samples produced by an small generator network • It works well even when using relatively small amounts of compute • 45x/8x speedup over Train/Inference when compared to BERT- Base To Summarise
  • 37.
    Confidential37 Thieves on SesameStreet! Model Extraction of BERT-based APIs - Kalpesh Krishna et al
  • 38.
    38 • Random sentencesto understand the model • After performing a large number of attacks, you have labels and dataset • Note: These are economically practical (Cheaper than trying to train a model) • Note 2: This is not model distillation, it’s IP Theft Attacks
  • 39.
  • 40.
    40 • Membership classification:Flagging queries • API Watermarking: Some % of queries are return a wrong output, “watermarked queries” and their outputs are stored on the API side. • Note: Both of these would fail against smart attacks Suggested Solutions
  • 41.
    Confidential41 olling Text Generationwith Plug and Play Language M - Rosanne Liu et al
  • 42.
  • 43.
    43 • LMs cangenerate coherent, relatable text, either from scratch or by completing a passage started by the user. • BUT, they are hard to steer or control. • Can also be triggered by certain adversarial attacks Introduction
  • 44.
    44 • Controlled generation:Adding knobs with conditional probability • Consists of 3 Steps: Controlling the Mammoth
  • 45.
  • 46.
    46 • Controlled generation:Adding knobs with conditional probability • Consists of 3 Steps • Also allows reduction in toxicity 63% to ~5%! Controlling the Mammoth
  • 47.
    Confidential47 ENERATIVE MODELS FOREFFECTIVE ML ON PRIVATE, DECENTRALIZED DATASET - Sean Augenstein et al
  • 48.
    48 • Modelling isimportant: Looking at data is a large part of the pipeline • Manual data inspection is problematic for privacy-sensitive dataset • Problem: Your model resides on your server, data on end devices Introduction
  • 49.
    49 • Modelling isimportant: Looking at data is a large part of the pipeline • Manual data inspection is problematic for privacy-sensitive dataset • Problem: Your model resides on your server, data on end devices Suggested Solutions
  • 50.
    50 • DP: FederatedGANs: - Train on user device - Inspect generated data • Repository showcases: - Language Modelling with DP RNN - Image Modelling with DP GANs Suggested Solutions
  • 51.
  • 52.

Editor's Notes

  • #4 Who is H2O.ai? get high-res pic for Sri H2O.ai was founded about 5 years ago, and closed a Series D round in August 2019 with Goldman Sachs leading the round, and Ping An insurance and finance out of China contributing as well. Customer investment side led the round including Wells Fargo and strategic partner NVidia. H2O.ai is the open source creator and inventor of H2O open source. Nearly 20,000 organizations, businesses, governments, universities use H2O.. H2O.ai also brought H2O Driverless AI to market in late 2017. It is the premier product for automatic machine learning., and this presentation covers what it is, its value and who is using it. The team is over 200 people, with some of the world best AI experts including Kaggle Grandmasters. Kaggle is a online tournament for Data Scientists, who compete for fame and money by delivering the best data science results. Companies offer a challenge and some prize money, and data scientists spend time fine-tuning their models, to get results. When they win a number of competitions, they can claim a Grandmaster title/status, similar to a Chess Grandmaster. H2O.ai has 13 out of the top 140 of the 100 Grandmasters on the planet today.. H2O.ai talent extends to distributed computing experts, visualizations experts (Leland Wilkinson), Finally, H2O.ai is global. Headquartered in Mountain View, CA. We have offices in Prague (AI Center of Excellence), London, NYC, and India.