SHOULD YOU BE AFRAID OF
TRANSFORMERS?
LECTURE DEEP LEARNING MEETUP, COLOGNE, 21.05.2019
DOMINIK SEISSER
Intro
• Over past year, string of deep learning innovation destroyed
previous state-of-the art NLP benchmarks
• We‘ll look how we got there, what future might look like and
what you can do with it
• Brief history of NLP deep learning with a small intermezzo on
ethics
aboutme
•3 Gartner Cool Vendor AI/Big Data Startups
•Gyana: Tech Lead/Head of AI, founded in Oxford,
„Top 10“ AI Startup in London – Geo Intelligence
•Cognigy: Building a leading NLU – conversational
AI, intent mapping etc.
Learning about the world from data
•To understand natural language we teach the machine
to learn from large text corpora
•Story of itera8ve improvements in our capacity to
machine learn about the world from data
•Increasing compute capacity enables new algorithms
and state-of-the-art improvements
Methodological Progress
•Before deep learning NLP
⚡Curse of dimensionality
•Word vector representa:ons
⚡ No syntac:c/seman:c context
•Sequence models
⚡ Long-range dependencies and complexity
Recent innovations
•Transformers
•New language modelling and training
techniques
•Fine-tuning/transfer learning
Classic Statistical NLP/NLU
• TF/IDF
• Bag of words, n-grams...
• Hidden Markov Models etc.
• Still relevant if you care about speed and compute cost!
Curse of dimensionality
From Steeve Huang, Word2Vec and FastText Word Embedding with Gensim (2018),
hCps://towardsdatascience.com/word-embedding-with-word2vec-and-fasCext-a209c1d3e12c
Vector representation breakthroughs
• Can we learn lower-dimensional word embeddings that capture some
meaning?
• Word2Vec
• Simple neural net Skip-Gram/CBOW model
• T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient EsKmaKon of Word
RepresentaKons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
• GloVe
• Word-word co-occurrence matrix
• J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word
RepresentaKon,” 2014, pp. 1532–1543.
Word2Vec: Skip-gram
Word2Vec Tutorial - The Skip-Gram Model, McCormick 2016
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
From vocabulary size V to hidden layer size N
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
woman
queen
man
king
+woman
queen
-man
king
Sequence models
• RNNs, LSTM, GRU...
• State of art up un6l last year:
some variant of recurrent model + a@en6on
• Suffer from structure dilemma and vanishing gradient
problem
Attention
K. M. Hermann et al., “Teaching Machines to Read and
Comprehend,” arXiv:1506.03340 [cs], Jun. 2015.
Attention
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine
Translation by Jointly Learning to Align and Translate,”
arXiv:1409.0473 [cs, stat], Sep. 2014.
What is attention?
• Vector interface to query – pay attention to -
salient features
• Answer à Question
• Translation à Source
• One position/token in a sentence
à All positions/tokens in a sentence
• Form of regularization to enable learning
• Produce salient semantic and syntactic features,
like attending to subject of a sentence
Transformers
• First sequence transduc.on
models with a4en.on only
• No RNNs or convolu.ons, just
posi.onal encodings
àlower computa.onal complexity
+ no sequen.al opera.ons
• A4en.on mechanisms specialise
on different task and capture
seman.c + syntac.c structure
A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
Transformers: co-reference resolu0on
A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
Should we be afraid of transformers?
•OpenAI lab decided not to release more powerful
GPT2 model because of AI safety concerns
•OpenAI started non-profit, now turned for-profit
•Push to make AI research private and proprietary for
AI safety and commercial gain
•Most public university AI labs bought by industry
What does this mean?
•Safety concern genuine – expect more efforts and
developments in this direc9on this as AI becomes
increasingly relevant tech for modern warfare
•Are we at bring of AI cold war?
•Is responsible and open coopera9on between
scien9sts and na9ons possible?
•AI is not dangerous, people are à more important
than ever to stand up for peace, open and free science
New kid on the block
BERT & Friends: recent innovations
• Language modelling and fine-tuning
• ULMFit (J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text ClassificaDon,”
arXiv:1801.06146 [cs, stat], Jan. 2018)
• ELMo (M. E. Peters et al., “Deep contextualized word representaDons,” arXiv:1802.05365 [cs], Feb.
2018)
• GPT (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language
Understanding by GeneraDve Pre-Training,” p. 12)
• BERT (J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep BidirecDonal
Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018)
Fine-tuning
• ULMFit (J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text
Classification,” arXiv:1801.06146 [cs, stat], Jan. 2018)
Contextual embeddings
• ELMo (M. E. Peters et al., “Deep contextualized word representations,” arXiv:1802.05365 [cs], Feb.
2018)
Contextual embeddings
• ELMo (M. E. Peters et al., “Deep contextualized word representa;ons,” arXiv:1802.05365 [cs], Feb.
2018)
• Language model next token predic;on task
• Bi-Direc;onal LSTM looking at en;re sentence
• Extract contextual embedding from hidden states
Generative Pretraining
• Genera%ve Pre-Training (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
“Improving Language Understanding by Genera%ve Pre-Training,” p. 12)
BERT
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding,” arXiv:1810.04805 [cs], Oct. 2018.
• Bi-Directional model achieved through masking
• Cloze task
• Sentence/Phrase switching
Fine-Tuning
• Take pre-trained model
• Adopt classification layer
to task
• Train for few epochs
Dynamic Coa+en-on Networks
80,383
Nov 01, 2016
A+en-onReader+
88.163
Dec 22, 2017
BERT
93,16
Oct 05, 2018
70
75
80
85
90
95
2016 2017 2018
Human Level Performance
Top F1-Score by year on the Stanford Question
Answering Dataset (SQuAD) 1.1.
SQuAD 1.1 contains 100,000+ question-answer
pairs on 500+ Wikipedia articles.
How to get started
• Author‘s implementation in TF https://github.com/google-
research/bert
• Pytorch implementation
https://github.com/huggingface/pytorch-pretrained-BERT
• Colab with free TPU
https://colab.research.google.com/github/tensorflow/tpu/blob/master/to
ols/colab/bert_finetuning_with_cloud_tpus.ipynb
• Flair https://github.com/zalandoresearch/flair
• Apache 2.0 licensed
Production
• Computa(onally cheap to train a huge model doesn‘t make it exactly
cheap in produc(on
• Mul(lingual BERT Base in 102 languages + Chinese
• Training your own full model is expensive, TPUs best op(on
• Accumula(ng gradients and other tricks to
hIps://medium.com/huggingface/training-larger-batches-prac(cal-
(ps-on-1-gpu-mul(-gpu-distributed-setups-ec88c3e51255
• BERT As Service hIps://github.com/hanxiao/bert-as-service
Future
•Improvements on BERT approach –
RNNs are not dead
•Better, more complex context/meaning
representations
•Increase in compute capacity, data and models
that best exploit available resources
Recap
•Deep Learning NLP: Learning
ever better language models
•ImageNet moment – transfer
learning breakthrough
•Latest innovations around
Transformers and BERT
C O N V E R S A T I O N A L A I
P L A T F O R M
D Ü S S E L D O R F | S A N F R A N C I S C O
W E ’ R E H I R I N G ! W W W . C O G N I G Y . C O M
S H O U L D W E B E A F R A I D
O F T R A N S F O R M E R S ?
D Ü S S E L D O R F , M A Y 2 0 1 9
D O M I N I K S E I S S E R | D . S E I S S E R @ C O G N I G Y . C O M
L E C T U R E
L I N K E D I N . C O M / I N / S E I S S E R
References
• J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representa@on,” 2014
• T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Es@ma@on of Word Representa@ons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
• A. Vaswani et al., “APen@on Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
• J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classifica@on,” arXiv:1801.06146 [cs, stat], Jan. 2018.
• M. E. Peters et al., “Deep contextualized word representa@ons,” arXiv:1802.05365 [cs], Feb. 2018.
• A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Genera@ve Pre-Training,”.
• J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirec@onal Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct.
2018.
• Alammar, Jay. n.d. “The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning).” Accessed May 22, 2019. hPps://jalammar.github.io/illustrated-bert/.
• “BePer Language Models and Their Implica@ons.” 2019. OpenAI. February 14, 2019. hPps://openai.com/blog/bePer-language-models/.
• “NLP’s ImageNet Moment Has Arrived.” 2018. Sebas@an Ruder. July 12, 2018. hPp://ruder.io/nlp-imagenet/.
• “Transformer: A Novel Neural Network Architecture for Language Understanding.” n.d. Google AI Blog (blog). Accessed May 22, 2019.
hPp://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.

Should we be afraid of Transformers?

  • 1.
    SHOULD YOU BEAFRAID OF TRANSFORMERS? LECTURE DEEP LEARNING MEETUP, COLOGNE, 21.05.2019 DOMINIK SEISSER
  • 2.
    Intro • Over pastyear, string of deep learning innovation destroyed previous state-of-the art NLP benchmarks • We‘ll look how we got there, what future might look like and what you can do with it • Brief history of NLP deep learning with a small intermezzo on ethics
  • 6.
    aboutme •3 Gartner CoolVendor AI/Big Data Startups •Gyana: Tech Lead/Head of AI, founded in Oxford, „Top 10“ AI Startup in London – Geo Intelligence •Cognigy: Building a leading NLU – conversational AI, intent mapping etc.
  • 13.
    Learning about theworld from data •To understand natural language we teach the machine to learn from large text corpora •Story of itera8ve improvements in our capacity to machine learn about the world from data •Increasing compute capacity enables new algorithms and state-of-the-art improvements
  • 14.
    Methodological Progress •Before deeplearning NLP ⚡Curse of dimensionality •Word vector representa:ons ⚡ No syntac:c/seman:c context •Sequence models ⚡ Long-range dependencies and complexity
  • 15.
    Recent innovations •Transformers •New languagemodelling and training techniques •Fine-tuning/transfer learning
  • 16.
    Classic Statistical NLP/NLU •TF/IDF • Bag of words, n-grams... • Hidden Markov Models etc. • Still relevant if you care about speed and compute cost!
  • 17.
    Curse of dimensionality FromSteeve Huang, Word2Vec and FastText Word Embedding with Gensim (2018), hCps://towardsdatascience.com/word-embedding-with-word2vec-and-fasCext-a209c1d3e12c
  • 18.
    Vector representation breakthroughs •Can we learn lower-dimensional word embeddings that capture some meaning? • Word2Vec • Simple neural net Skip-Gram/CBOW model • T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient EsKmaKon of Word RepresentaKons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013. • GloVe • Word-word co-occurrence matrix • J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word RepresentaKon,” 2014, pp. 1532–1543.
  • 19.
    Word2Vec: Skip-gram Word2Vec Tutorial- The Skip-Gram Model, McCormick 2016 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 20.
    From vocabulary sizeV to hidden layer size N T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
  • 21.
  • 22.
  • 25.
    Sequence models • RNNs,LSTM, GRU... • State of art up un6l last year: some variant of recurrent model + a@en6on • Suffer from structure dilemma and vanishing gradient problem
  • 26.
    Attention K. M. Hermannet al., “Teaching Machines to Read and Comprehend,” arXiv:1506.03340 [cs], Jun. 2015.
  • 27.
    Attention D. Bahdanau, K.Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs, stat], Sep. 2014.
  • 28.
    What is attention? •Vector interface to query – pay attention to - salient features • Answer à Question • Translation à Source • One position/token in a sentence à All positions/tokens in a sentence • Form of regularization to enable learning • Produce salient semantic and syntactic features, like attending to subject of a sentence
  • 29.
    Transformers • First sequencetransduc.on models with a4en.on only • No RNNs or convolu.ons, just posi.onal encodings àlower computa.onal complexity + no sequen.al opera.ons • A4en.on mechanisms specialise on different task and capture seman.c + syntac.c structure A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
  • 30.
    Transformers: co-reference resolu0on A.Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
  • 32.
    Should we beafraid of transformers? •OpenAI lab decided not to release more powerful GPT2 model because of AI safety concerns •OpenAI started non-profit, now turned for-profit •Push to make AI research private and proprietary for AI safety and commercial gain •Most public university AI labs bought by industry
  • 33.
    What does thismean? •Safety concern genuine – expect more efforts and developments in this direc9on this as AI becomes increasingly relevant tech for modern warfare •Are we at bring of AI cold war? •Is responsible and open coopera9on between scien9sts and na9ons possible? •AI is not dangerous, people are à more important than ever to stand up for peace, open and free science
  • 34.
    New kid onthe block
  • 35.
    BERT & Friends:recent innovations • Language modelling and fine-tuning • ULMFit (J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text ClassificaDon,” arXiv:1801.06146 [cs, stat], Jan. 2018) • ELMo (M. E. Peters et al., “Deep contextualized word representaDons,” arXiv:1802.05365 [cs], Feb. 2018) • GPT (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by GeneraDve Pre-Training,” p. 12) • BERT (J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep BidirecDonal Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018)
  • 36.
    Fine-tuning • ULMFit (J.Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” arXiv:1801.06146 [cs, stat], Jan. 2018)
  • 37.
    Contextual embeddings • ELMo(M. E. Peters et al., “Deep contextualized word representations,” arXiv:1802.05365 [cs], Feb. 2018)
  • 38.
    Contextual embeddings • ELMo(M. E. Peters et al., “Deep contextualized word representa;ons,” arXiv:1802.05365 [cs], Feb. 2018) • Language model next token predic;on task • Bi-Direc;onal LSTM looking at en;re sentence • Extract contextual embedding from hidden states
  • 39.
    Generative Pretraining • Genera%vePre-Training (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Genera%ve Pre-Training,” p. 12)
  • 40.
    BERT J. Devlin, M.-W.Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018. • Bi-Directional model achieved through masking • Cloze task • Sentence/Phrase switching
  • 41.
    Fine-Tuning • Take pre-trainedmodel • Adopt classification layer to task • Train for few epochs
  • 42.
    Dynamic Coa+en-on Networks 80,383 Nov01, 2016 A+en-onReader+ 88.163 Dec 22, 2017 BERT 93,16 Oct 05, 2018 70 75 80 85 90 95 2016 2017 2018 Human Level Performance Top F1-Score by year on the Stanford Question Answering Dataset (SQuAD) 1.1. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ Wikipedia articles.
  • 43.
    How to getstarted • Author‘s implementation in TF https://github.com/google- research/bert • Pytorch implementation https://github.com/huggingface/pytorch-pretrained-BERT • Colab with free TPU https://colab.research.google.com/github/tensorflow/tpu/blob/master/to ols/colab/bert_finetuning_with_cloud_tpus.ipynb • Flair https://github.com/zalandoresearch/flair • Apache 2.0 licensed
  • 44.
    Production • Computa(onally cheapto train a huge model doesn‘t make it exactly cheap in produc(on • Mul(lingual BERT Base in 102 languages + Chinese • Training your own full model is expensive, TPUs best op(on • Accumula(ng gradients and other tricks to hIps://medium.com/huggingface/training-larger-batches-prac(cal- (ps-on-1-gpu-mul(-gpu-distributed-setups-ec88c3e51255 • BERT As Service hIps://github.com/hanxiao/bert-as-service
  • 45.
    Future •Improvements on BERTapproach – RNNs are not dead •Better, more complex context/meaning representations •Increase in compute capacity, data and models that best exploit available resources
  • 46.
    Recap •Deep Learning NLP:Learning ever better language models •ImageNet moment – transfer learning breakthrough •Latest innovations around Transformers and BERT
  • 47.
    C O NV E R S A T I O N A L A I P L A T F O R M D Ü S S E L D O R F | S A N F R A N C I S C O W E ’ R E H I R I N G ! W W W . C O G N I G Y . C O M
  • 48.
    S H OU L D W E B E A F R A I D O F T R A N S F O R M E R S ? D Ü S S E L D O R F , M A Y 2 0 1 9 D O M I N I K S E I S S E R | D . S E I S S E R @ C O G N I G Y . C O M L E C T U R E L I N K E D I N . C O M / I N / S E I S S E R
  • 49.
    References • J. Pennington,R. Socher, and C. Manning, “Glove: Global Vectors for Word Representa@on,” 2014 • T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Es@ma@on of Word Representa@ons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013. • A. Vaswani et al., “APen@on Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017. • J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classifica@on,” arXiv:1801.06146 [cs, stat], Jan. 2018. • M. E. Peters et al., “Deep contextualized word representa@ons,” arXiv:1802.05365 [cs], Feb. 2018. • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Genera@ve Pre-Training,”. • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirec@onal Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018. • Alammar, Jay. n.d. “The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning).” Accessed May 22, 2019. hPps://jalammar.github.io/illustrated-bert/. • “BePer Language Models and Their Implica@ons.” 2019. OpenAI. February 14, 2019. hPps://openai.com/blog/bePer-language-models/. • “NLP’s ImageNet Moment Has Arrived.” 2018. Sebas@an Ruder. July 12, 2018. hPp://ruder.io/nlp-imagenet/. • “Transformer: A Novel Neural Network Architecture for Language Understanding.” n.d. Google AI Blog (blog). Accessed May 22, 2019. hPp://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.