SlideShare a Scribd company logo
1 of 31
Download to read offline
Language Models are 

Unsupervised Multitask Learners

(GPT-2)
OpenAI
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145
Articles & Useful Links
• Official

• Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

• Blog: https://blog.openai.com/better-language-models/

• GitHub: https://github.com/openai/gpt-2

• Unofficial

• Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/
!2
Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049: https://youtu.be/6zGgVIlStXs

• Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html 

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website: https://blog.openai.com/language-unsupervised/

• Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf

• Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)

• Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• Paper: https://arxiv.org/abs/1810.04805

• PR-121: https://youtu.be/GK4IO3qOnLc
!3
Dataset
!4
Dataset (BERT)
!5
BookCorpus
(800M words) Wikipedia
(2500M words)
+
Common Crawl?
!6
• Significant data quality issues.

• Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset

• Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.
WebText
• GPT-2 authors created a new web scrape which
emphasizes document quality

• They scraped web pages which have been curated/
filtered by humans

• Manually filtering a full web scrape would be
exceptionally expensive

• Scraped all outbound links from Reddit, which
received at least 3 karma

• Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3
WebText
• 45 million links

• Used content extractors to extract the text from HTML

• De-duplication

• heuristic based cleaning

• slightly over 8 million documents

• 40 GB of text

• Removed ALL Wikipedia documents

• since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8
Input Representation
!9
Byte Pair Encoding (BPE)
• Sennrich, Rico et al. 

“Neural Machine Translation of Rare Words with Subword Units.” (2016)

• Practical middle ground between character level and word level language modeling

• Effectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences

• Combined empirical benefits of word-level LMs with the generality of byte-level
approaches

• This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10
Byte Pair Encoding
(BPE)
Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)
Model
!12
Transformer
• Transformer-based 

• Follows the details of GPT-1

• Layer Normalization was moved to the input of each sub-block 

(similar to pre-activation in ResNet)

• Additional LayerNorm was added after the final self-attention
block.

• Vocab is expanded to 50,257

• Batchsize of 512 is used
!13
Original Transformer
Experiments
!14
Model sizes
!15
(BERT)GPT-2
GPT-1
BERT-large
GPT-2
Zero-shot results
!16
Children’s Book Test
• Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)

• Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.

• GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17
LAMBADA
• LAnguage Modeling Broadened to Account for Discourse Aspects

• Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)

• Task is to predict the final word of sentences which require at least 50 tokens of
context for a human to successfully predict

• 99.8 PPL -> 8.63 PPL
!18
Winograd Schema Challenge
• Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19
Winograd Schema Challenge
Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)
Summarization
• Added text “TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2

• CNN and Daily Mail dataset

• Used 3 generated sentences in
these 100 tokens as the summary
!21
Translation
• ‘english sentence = french sentence’ format

• Generate text after ‘english sentence = ’

• Sample from the model with greedy decoding and use the first generated sentence as the translation

• GPT-2 gets 5 BLEU on WMT-14 English-French test set

• GPT-2 gets 11.5 BLEU on WMT-14 French-English test set

• Outperforms several unsupervised machine translation baselines (2017)

• But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22
Translation
• Surprising result!

• Authors of GPT-2 deliberately removed non-English webpages from WebText as a
filtering step

• Authors ran byte-level language detector on WebText

• Only 10MB of data in the French language

• (Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23
Question Answering
• GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD

• Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)

• -> Model capacity is important

• But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in
!24
Generalization vs Memorization
• It is important to analyze how much test data also shows up in the training data
• Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25
WebText Underfitting
!26
Conclusionss
• Unsupervised task learning is an additional promising area of research to explore 

• Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. 

• on reading comprehension

• but not on other tasks like summarization, etc…

• Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27
Discussions
!28
Personal Thoughts
• Rather than focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks

• Compared to the hype, the model is quite less achieving 

• Scaling is important. Modern research by huge companies have already transitioned to
huge models

• Zero-shot learning is interesting
!29
How do you think about 

OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
• Propagate Fear

• Reproducibility issue

• Making unnecessary hype
!30
• May be used for malicious use such as

• Generate misleading news articles

• Automate the production of abusive or faked
content to post on social media

• Automate the production of spam/phishing
content
Thank you!
!31

More Related Content

What's hot

Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERTATPowr
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 

What's hot (20)

Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERT
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Bert
BertBert
Bert
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Gpt models
Gpt modelsGpt models
Gpt models
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Word embedding
Word embedding Word embedding
Word embedding
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 

Similar to GPT-2: Language Models are Unsupervised Multitask Learners

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
[246]reasoning, attention and memory toward differentiable reasoning machines
[246]reasoning, attention and memory   toward differentiable reasoning machines[246]reasoning, attention and memory   toward differentiable reasoning machines
[246]reasoning, attention and memory toward differentiable reasoning machinesNAVER D2
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 

Similar to GPT-2: Language Models are Unsupervised Multitask Learners (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Deep Domain
Deep DomainDeep Domain
Deep Domain
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
Java basics
Java basicsJava basics
Java basics
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
[246]reasoning, attention and memory toward differentiable reasoning machines
[246]reasoning, attention and memory   toward differentiable reasoning machines[246]reasoning, attention and memory   toward differentiable reasoning machines
[246]reasoning, attention and memory toward differentiable reasoning machines
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

GPT-2: Language Models are Unsupervised Multitask Learners

  • 1. Language Models are 
 Unsupervised Multitask Learners
 (GPT-2) OpenAI Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever 2019.03.03 Presented by Young Seok Kim PR-145
  • 2. Articles & Useful Links • Official • Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf • Blog: https://blog.openai.com/better-language-models/ • GitHub: https://github.com/openai/gpt-2 • Unofficial • Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/ !2
  • 3. Related Papers • Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017) • PR-049: https://youtu.be/6zGgVIlStXs • Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html • Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018) • Website: https://blog.openai.com/language-unsupervised/ • Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/ language_understanding_paper.pdf • Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” (2018) • Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • Paper: https://arxiv.org/abs/1810.04805 • PR-121: https://youtu.be/GK4IO3qOnLc !3
  • 5. Dataset (BERT) !5 BookCorpus (800M words) Wikipedia (2500M words) +
  • 6. Common Crawl? !6 • Significant data quality issues. • Best results were achieved when using a small subsample of common crawl which included only documents most similar to the target dataset • Authors of GPT-2 wanted to avoid making assumptions about the tasks to be performed ahead of time.
  • 7. WebText • GPT-2 authors created a new web scrape which emphasizes document quality • They scraped web pages which have been curated/ filtered by humans • Manually filtering a full web scrape would be exceptionally expensive • Scraped all outbound links from Reddit, which received at least 3 karma • Heuristic indicator for whether other users found the link interesting / educational / or just funny !7 Karma > 3
  • 8. WebText • 45 million links • Used content extractors to extract the text from HTML • De-duplication • heuristic based cleaning • slightly over 8 million documents • 40 GB of text • Removed ALL Wikipedia documents • since it is a coomon data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks !8
  • 10. Byte Pair Encoding (BPE) • Sennrich, Rico et al. 
 “Neural Machine Translation of Rare Words with Subword Units.” (2016) • Practical middle ground between character level and word level language modeling • Effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences • Combined empirical benefits of word-level LMs with the generality of byte-level approaches • This approach can assign a probability to any Unicode string, regardless of pre- processing, tokenization or vocabulary size !10
  • 11. Byte Pair Encoding (BPE) Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)
  • 13. Transformer • Transformer-based • Follows the details of GPT-1 • Layer Normalization was moved to the input of each sub-block 
 (similar to pre-activation in ResNet) • Additional LayerNorm was added after the final self-attention block. • Vocab is expanded to 50,257 • Batchsize of 512 is used !13 Original Transformer
  • 17. Children’s Book Test • Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations.” (2016) • Reports accuracy on automatically constructed cloze test where the task is to predict which of 10 possible choices for an omitted word is correct. • GPT-2 authors compute the probability of each choice and the rest of sentence conditioned on this choice according to LM, and predict the one with highest probability. !17
  • 18. LAMBADA • LAnguage Modeling Broadened to Account for Discourse Aspects • Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad discourse context.” (2016) • Task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict • 99.8 PPL -> 8.63 PPL !18
  • 19. Winograd Schema Challenge • Commonsense reasoning by measuring its ability to resolve ambiguities in text !19
  • 20. Winograd Schema Challenge Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)
  • 21. Summarization • Added text “TL;DR:” after the article and generated 100 tokens with Top-k random sampling with k=2 • CNN and Daily Mail dataset • Used 3 generated sentences in these 100 tokens as the summary !21
  • 22. Translation • ‘english sentence = french sentence’ format • Generate text after ‘english sentence = ’ • Sample from the model with greedy decoding and use the first generated sentence as the translation • GPT-2 gets 5 BLEU on WMT-14 English-French test set • GPT-2 gets 11.5 BLEU on WMT-14 French-English test set • Outperforms several unsupervised machine translation baselines (2017) • But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019) !22
  • 23. Translation • Surprising result! • Authors of GPT-2 deliberately removed non-English webpages from WebText as a filtering step • Authors ran byte-level language detector on WebText • Only 10MB of data in the French language • (Approximately 500x smaller than the monolingual French corpus common in prior unsupervised machine translation research) !23
  • 24. Question Answering • GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric commonly used on reading comprehension datasets like SQUAD • Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which returns the most common answer for each question type (who, what, where, etc…) • -> Model capacity is important • But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in !24
  • 25. Generalization vs Memorization • It is important to analyze how much test data also shows up in the training data • Using Bloom Filters, authors found out what percentage of (test) dataset is found in WebText training set. !25
  • 27. Conclusionss • Unsupervised task learning is an additional promising area of research to explore • Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. • on reading comprehension • but not on other tasks like summarization, etc… • Studied zero-shot performance of WebText LMs on many canonical NLP tasks !27
  • 29. Personal Thoughts • Rather than focusing on novel model architecture, the paper focuses on unsupervised task learning, evaluating / analyzing on various canonical datasets / tasks • Compared to the hype, the model is quite less achieving • Scaling is important. Modern research by huge companies have already transitioned to huge models • Zero-shot learning is interesting !29
  • 30. How do you think about 
 OpenAI not releasing the model? (Is it ethical for OpenAI to keep the big model private?) • Propagate Fear • Reproducibility issue • Making unnecessary hype !30 • May be used for malicious use such as • Generate misleading news articles • Automate the production of abusive or faked content to post on social media • Automate the production of spam/phishing content