Should we be afraid of Transformers?

SHOULD YOU BE AFRAID OF
TRANSFORMERS?
LECTURE DEEP LEARNING MEETUP, COLOGNE, 21.05.2019
DOMINIK SEISSER

Intro
• Over past year, string of deep learning innovation destroyed
previous state-of-the art NLP benchmarks
• We‘ll look how we got there, what future might look like and
what you can do with it
• Brief history of NLP deep learning with a small intermezzo on
ethics

aboutme
•3 Gartner Cool Vendor AI/Big Data Startups
•Gyana: Tech Lead/Head of AI, founded in Oxford,
„Top 10“ AI Startup in London – Geo Intelligence
•Cognigy: Building a leading NLU – conversational
AI, intent mapping etc.

Learning about the world from data
•To understand natural language we teach the machine
to learn from large text corpora
•Story of itera8ve improvements in our capacity to
machine learn about the world from data
•Increasing compute capacity enables new algorithms
and state-of-the-art improvements

Methodological Progress
•Before deep learning NLP
⚡Curse of dimensionality
•Word vector representa:ons
⚡ No syntac:c/seman:c context
•Sequence models
⚡ Long-range dependencies and complexity

Recent innovations
•Transformers
•New language modelling and training
techniques
•Fine-tuning/transfer learning

Classic Statistical NLP/NLU
• TF/IDF
• Bag of words, n-grams...
• Hidden Markov Models etc.
• Still relevant if you care about speed and compute cost!

Curse of dimensionality
From Steeve Huang, Word2Vec and FastText Word Embedding with Gensim (2018),
hCps://towardsdatascience.com/word-embedding-with-word2vec-and-fasCext-a209c1d3e12c

Vector representation breakthroughs
• Can we learn lower-dimensional word embeddings that capture some
meaning?
• Word2Vec
• Simple neural net Skip-Gram/CBOW model
• T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Eﬃcient EsKmaKon of Word
RepresentaKons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
• GloVe
• Word-word co-occurrence matrix
• J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word
RepresentaKon,” 2014, pp. 1532–1543.

Word2Vec: Skip-gram
Word2Vec Tutorial - The Skip-Gram Model, McCormick 2016
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

From vocabulary size V to hidden layer size N
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.

Sequence models
• RNNs, LSTM, GRU...
• State of art up un6l last year:
some variant of recurrent model + a@en6on
• Suﬀer from structure dilemma and vanishing gradient
problem

Attention
K. M. Hermann et al., “Teaching Machines to Read and
Comprehend,” arXiv:1506.03340 [cs], Jun. 2015.

Attention
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine
Translation by Jointly Learning to Align and Translate,”
arXiv:1409.0473 [cs, stat], Sep. 2014.

What is attention?
• Vector interface to query – pay attention to -
salient features
• Answer à Question
• Translation à Source
• One position/token in a sentence
à All positions/tokens in a sentence
• Form of regularization to enable learning
• Produce salient semantic and syntactic features,
like attending to subject of a sentence

Transformers
• First sequence transduc.on
models with a4en.on only
• No RNNs or convolu.ons, just
posi.onal encodings
àlower computa.onal complexity
+ no sequen.al opera.ons
• A4en.on mechanisms specialise
on diﬀerent task and capture
seman.c + syntac.c structure
A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.

Transformers: co-reference resolu0on
A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.

Should we be afraid of transformers?
•OpenAI lab decided not to release more powerful
GPT2 model because of AI safety concerns
•OpenAI started non-profit, now turned for-profit
•Push to make AI research private and proprietary for
AI safety and commercial gain
•Most public university AI labs bought by industry

What does this mean?
•Safety concern genuine – expect more eﬀorts and
developments in this direc9on this as AI becomes
increasingly relevant tech for modern warfare
•Are we at bring of AI cold war?
•Is responsible and open coopera9on between
scien9sts and na9ons possible?
•AI is not dangerous, people are à more important
than ever to stand up for peace, open and free science

BERT & Friends: recent innovations
• Language modelling and ﬁne-tuning
• ULMFit (J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text ClassiﬁcaDon,”
arXiv:1801.06146 [cs, stat], Jan. 2018)
• ELMo (M. E. Peters et al., “Deep contextualized word representaDons,” arXiv:1802.05365 [cs], Feb.
2018)
• GPT (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language
Understanding by GeneraDve Pre-Training,” p. 12)
• BERT (J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep BidirecDonal
Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018)

Fine-tuning
• ULMFit (J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text
Classification,” arXiv:1801.06146 [cs, stat], Jan. 2018)

Contextual embeddings
• ELMo (M. E. Peters et al., “Deep contextualized word representations,” arXiv:1802.05365 [cs], Feb.
2018)

Contextual embeddings
• ELMo (M. E. Peters et al., “Deep contextualized word representa;ons,” arXiv:1802.05365 [cs], Feb.
2018)
• Language model next token predic;on task
• Bi-Direc;onal LSTM looking at en;re sentence
• Extract contextual embedding from hidden states

Generative Pretraining
• Genera%ve Pre-Training (A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
“Improving Language Understanding by Genera%ve Pre-Training,” p. 12)

BERT
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding,” arXiv:1810.04805 [cs], Oct. 2018.
• Bi-Directional model achieved through masking
• Cloze task
• Sentence/Phrase switching

Fine-Tuning
• Take pre-trained model
• Adopt classification layer
to task
• Train for few epochs

Dynamic Coa+en-on Networks
80,383
Nov 01, 2016
A+en-onReader+
88.163
Dec 22, 2017
BERT
93,16
Oct 05, 2018
70
75
80
85
90
95
2016 2017 2018
Human Level Performance
Top F1-Score by year on the Stanford Question
Answering Dataset (SQuAD) 1.1.
SQuAD 1.1 contains 100,000+ question-answer
pairs on 500+ Wikipedia articles.

How to get started
• Author‘s implementation in TF https://github.com/google-
research/bert
• Pytorch implementation
https://github.com/huggingface/pytorch-pretrained-BERT
• Colab with free TPU
https://colab.research.google.com/github/tensorflow/tpu/blob/master/to
ols/colab/bert_finetuning_with_cloud_tpus.ipynb
• Flair https://github.com/zalandoresearch/flair
• Apache 2.0 licensed

Production
• Computa(onally cheap to train a huge model doesn‘t make it exactly
cheap in produc(on
• Mul(lingual BERT Base in 102 languages + Chinese
• Training your own full model is expensive, TPUs best op(on
• Accumula(ng gradients and other tricks to
hIps://medium.com/huggingface/training-larger-batches-prac(cal-
(ps-on-1-gpu-mul(-gpu-distributed-setups-ec88c3e51255
• BERT As Service hIps://github.com/hanxiao/bert-as-service

Future
•Improvements on BERT approach –
RNNs are not dead
•Better, more complex context/meaning
representations
•Increase in compute capacity, data and models
that best exploit available resources

Recap
•Deep Learning NLP: Learning
ever better language models
•ImageNet moment – transfer
learning breakthrough
•Latest innovations around
Transformers and BERT

C O N V E R S A T I O N A L A I
P L A T F O R M
D Ü S S E L D O R F | S A N F R A N C I S C O
W E ’ R E H I R I N G ! W W W . C O G N I G Y . C O M

S H O U L D W E B E A F R A I D
O F T R A N S F O R M E R S ?
D Ü S S E L D O R F , M A Y 2 0 1 9
D O M I N I K S E I S S E R | D . S E I S S E R @ C O G N I G Y . C O M
L E C T U R E
L I N K E D I N . C O M / I N / S E I S S E R

References
• J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representa@on,” 2014
• T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Eﬃcient Es@ma@on of Word Representa@ons in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013.
• A. Vaswani et al., “APen@on Is All You Need,” arXiv:1706.03762 [cs], Jun. 2017.
• J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classiﬁca@on,” arXiv:1801.06146 [cs, stat], Jan. 2018.
• M. E. Peters et al., “Deep contextualized word representa@ons,” arXiv:1802.05365 [cs], Feb. 2018.
• A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Genera@ve Pre-Training,”.
• J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirec@onal Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct.
2018.
• Alammar, Jay. n.d. “The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning).” Accessed May 22, 2019. hPps://jalammar.github.io/illustrated-bert/.
• “BePer Language Models and Their Implica@ons.” 2019. OpenAI. February 14, 2019. hPps://openai.com/blog/bePer-language-models/.
• “NLP’s ImageNet Moment Has Arrived.” 2018. Sebas@an Ruder. July 12, 2018. hPp://ruder.io/nlp-imagenet/.
• “Transformer: A Novel Neural Network Architecture for Language Understanding.” n.d. Google AI Blog (blog). Accessed May 22, 2019.
hPp://ai.googleblog.com/2017/08/transformer-novel-neural-network.html.

Should we be afraid of Transformers?

Recommended

Recommended

More Related Content

Similar to Should we be afraid of Transformers?

Similar to Should we be afraid of Transformers? (20)

Recently uploaded

Recently uploaded (20)

Should we be afraid of Transformers?