Grigory Sapunov
OpenTalks.AI / 2021.02.04
gs@inten.to
NLP in 2020
1. GPT-3 and the
“new way of learning”
GPT-3
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
https://arxiv.org/abs/2005.14165
The GPT-3 family of models is a recent upgrade of the well-known GPT-2
model, with the largest of them (175B parameters), the “GPT-3” is 100x
times larger than the largest (1.5B parameters) GPT-2.
GPT-3
is 10 screens
higher!!!
In-context
Learning
Prompt
Engineering
2. Large Transformers
Large models
● GPT-3 (up to 175B parameters)
● ruGPT-3 (1.3B+)
● Chinese CPM (2.6B)
● T5 (up to 11B)
● mT5 (up to 13B)
● mBERT (110M)
● mBART (680M, trained for MT)
● MARGE (960M)
● XLM, XLM-R (570M)
● Turing-NLG (17B)
● T-ULRv2 (550M)
● M2M-100 (12B, trained for MT)
● MoE Transformer (600B*, MT)
● Switch Transformer (1.6T*)
Large models
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
2*. Problems of Large Models
Costs
Large model training costs
“The Cost of Training NLP Models: A Concise Overview”
https://arxiv.org/abs/2004.08900
CO2 emissions
“Energy and Policy Considerations for Deep Learning in NLP”
https://arxiv.org/abs/1906.02243
Training Data Extraction
“Extracting Training Data from Large Language Models”
https://arxiv.org/abs/2012.07805
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently
enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.
“An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information about
how they combine, but without any reference to meaning:
a stochastic parrot. “
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
3. Multilinguality
Large multilingual models
● GPT-3 (up to 175B parameters)
● ruGPT-3 (1.3B+)
● Chinese CPM (2.6B)
● T5 (up to 11B)
● mT5 (up to 13B, 101 languages)
● mBERT (110M, 104 languages)
● mBART (680M, 25 languages, trained for MT)
● MARGE (960M, 26 languages)
● XLM, XLM-R (570M, 100 languages)
● Turing-NLG (17B)
● T-ULRv2 (550M, 94 languages)
● M2M-100 (12B, 100 languages, trained for MT)
● MoE Transformer (600B*, 100 languages → en, trained for MT)
● Switch Transformer (1.6T*, 101 languages like mT5)
http://robot-design.org/
Positive language transfer (MoE Transformer)
“GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding”
https://arxiv.org/abs/2006.16668
Positive language transfer (M2M-100)
“Introducing the First AI Model That Translates 100 Languages Without Relying on English”
https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
“Beyond English-Centric Multilingual Machine Translation”
https://arxiv.org/abs/2010.11125
XGLUE Benchmark
https://github.com/microsoft/XGLUE
XGLUE Benchmark
“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”
https://arxiv.org/abs/2004.01401
XTREME Benchmark
“XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization”
https://arxiv.org/abs/2003.11080
XTREME Benchmark
https://sites.research.google/xtreme
4. Efficient Transformers
Architectural innovations
● Efficiency:
○ Lower attention computational
complexity
○ Larger attention spans
○ Reformer, Linformer, Longformer, Big
Bird, Performer, Axial Transformers
● Images:
○ iGPT, Vision Transformer (ViT), image
processing transformer (IPT), DALL·E
● Memory:
○ Compressive Transformer,
memory-augmented models
● many other improvements!
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
5. Benchmarks
SuperGLUE (2019)
“SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”
https://arxiv.org/abs/1905.00537
SuperGLUE
https://super.gluebenchmark.com/leaderboard
6. Conversational
Meena (2.6B model, Google)
“Towards a Conversational Agent that Can Chat About…Anything”
https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html
https://arxiv.org/abs/2001.09977
BlenderBot (up to 9.4B model, FB)
“A state-of-the-art open source chatbot”
https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot
https://arxiv.org/abs/2004.13637
Alexa Prize Grand Challenge 3
https://www.amazon.science/latest-news/amazon-announces-2020-alexa-prize-winner-emory-university
Their ultimate goal is to meet the Grand
Challenge: earn a composite score of 4.0 or
higher (out of 5) from the judges, and have the
judges find that at least two-thirds of their
conversations with the socialbot in the final round of judging remain
coherent and engaging for 20 minutes.
Emora, the Emory University chatbot, earned first place with a 3.81
average rating and average duration of 7 minutes and 37 seconds.
7. APIs & Cloud
Cloud: The democratization of AI
“Across all these different technology areas, 93 percent are using cloud-based AI capabilities,
while 78 percent employ open-source AI capabilities. For example, online marketplace Etsy
has shifted its AI experimentation to a cloud provider to dramatically increase its
computing power and number of experiments. Learning how to manage and integrate these
disparate tools and techniques is fundamental for success.”
Deloitte’s State of AI in the Enterprise, 3rd Edition
https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/state-of-ai-and-intelligent-automation-in-business-survey.html
https://openai.com/blog/openai-api/
Cloud: API for GPT-3
https://try.inten.to/mt_report_2020
Cloud: Machine Translation Landscape
Old example: NLP cloud APIs / 60+ APIs
Сustomized models in the cloud
https://cloud.google.com/automl/
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

NLP in 2020

  • 1.
    Grigory Sapunov OpenTalks.AI /2021.02.04 gs@inten.to NLP in 2020
  • 2.
    1. GPT-3 andthe “new way of learning”
  • 3.
    GPT-3 https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9 https://arxiv.org/abs/2005.14165 The GPT-3 familyof models is a recent upgrade of the well-known GPT-2 model, with the largest of them (175B parameters), the “GPT-3” is 100x times larger than the largest (1.5B parameters) GPT-2.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Large models ● GPT-3(up to 175B parameters) ● ruGPT-3 (1.3B+) ● Chinese CPM (2.6B) ● T5 (up to 11B) ● mT5 (up to 13B) ● mBERT (110M) ● mBART (680M, trained for MT) ● MARGE (960M) ● XLM, XLM-R (570M) ● Turing-NLG (17B) ● T-ULRv2 (550M) ● M2M-100 (12B, trained for MT) ● MoE Transformer (600B*, MT) ● Switch Transformer (1.6T*)
  • 9.
  • 10.
    Scaling laws “Scaling Lawsfor Neural Language Models” https://arxiv.org/abs/2001.08361
  • 11.
    Scaling laws “Scaling Lawsfor Neural Language Models” https://arxiv.org/abs/2001.08361
  • 12.
    2*. Problems ofLarge Models
  • 13.
  • 14.
    Large model trainingcosts “The Cost of Training NLP Models: A Concise Overview” https://arxiv.org/abs/2004.08900
  • 15.
    CO2 emissions “Energy andPolicy Considerations for Deep Learning in NLP” https://arxiv.org/abs/1906.02243
  • 16.
    Training Data Extraction “ExtractingTraining Data from Large Language Models” https://arxiv.org/abs/2012.07805
  • 17.
  • 18.
    ● Size Doesn’tGuarantee Diversity ○ Internet data overrepresenting younger users and those from developed countries. ○ Training data is sourced by scraping only specific sites (e.g. Reddit). ○ There are structural factors including moderation practices. ○ The current practice of filtering datasets can further attenuate specific voices. ● Static Data/Changing Social Views ○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings. ○ Movements with no significant media attention will not be captured at all. ○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough. ● Encoding Bias ○ Large LMs exhibit various kinds of bias, including stereotypical associations or negative sentiment towards specific groups. ○ Issues with training data: unreliable news sites, banned subreddits, etc. ○ Model auditing using automated systems that are not reliable themselves. ● Documentation debt ○ Datasets are both undocumented and too large to document post hoc.
  • 19.
    “An LM isa system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. “ http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
  • 20.
  • 21.
    Large multilingual models ●GPT-3 (up to 175B parameters) ● ruGPT-3 (1.3B+) ● Chinese CPM (2.6B) ● T5 (up to 11B) ● mT5 (up to 13B, 101 languages) ● mBERT (110M, 104 languages) ● mBART (680M, 25 languages, trained for MT) ● MARGE (960M, 26 languages) ● XLM, XLM-R (570M, 100 languages) ● Turing-NLG (17B) ● T-ULRv2 (550M, 94 languages) ● M2M-100 (12B, 100 languages, trained for MT) ● MoE Transformer (600B*, 100 languages → en, trained for MT) ● Switch Transformer (1.6T*, 101 languages like mT5) http://robot-design.org/
  • 22.
    Positive language transfer(MoE Transformer) “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding” https://arxiv.org/abs/2006.16668
  • 23.
    Positive language transfer(M2M-100) “Introducing the First AI Model That Translates 100 Languages Without Relying on English” https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/ “Beyond English-Centric Multilingual Machine Translation” https://arxiv.org/abs/2010.11125
  • 24.
  • 25.
    XGLUE Benchmark “XGLUE: ANew Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” https://arxiv.org/abs/2004.01401
  • 26.
    XTREME Benchmark “XTREME: AMassively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization” https://arxiv.org/abs/2003.11080
  • 27.
  • 28.
  • 29.
    Architectural innovations ● Efficiency: ○Lower attention computational complexity ○ Larger attention spans ○ Reformer, Linformer, Longformer, Big Bird, Performer, Axial Transformers ● Images: ○ iGPT, Vision Transformer (ViT), image processing transformer (IPT), DALL·E ● Memory: ○ Compressive Transformer, memory-augmented models ● many other improvements!
  • 30.
    “Efficient Transformers: ASurvey” https://arxiv.org/abs/2009.06732
  • 31.
    “Efficient Transformers: ASurvey” https://arxiv.org/abs/2009.06732
  • 32.
  • 33.
    SuperGLUE (2019) “SuperGLUE: AStickier Benchmark for General-Purpose Language Understanding Systems” https://arxiv.org/abs/1905.00537
  • 34.
  • 35.
  • 36.
    Meena (2.6B model,Google) “Towards a Conversational Agent that Can Chat About…Anything” https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html https://arxiv.org/abs/2001.09977
  • 37.
    BlenderBot (up to9.4B model, FB) “A state-of-the-art open source chatbot” https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot https://arxiv.org/abs/2004.13637
  • 38.
    Alexa Prize GrandChallenge 3 https://www.amazon.science/latest-news/amazon-announces-2020-alexa-prize-winner-emory-university Their ultimate goal is to meet the Grand Challenge: earn a composite score of 4.0 or higher (out of 5) from the judges, and have the judges find that at least two-thirds of their conversations with the socialbot in the final round of judging remain coherent and engaging for 20 minutes. Emora, the Emory University chatbot, earned first place with a 3.81 average rating and average duration of 7 minutes and 37 seconds.
  • 39.
    7. APIs &Cloud
  • 40.
    Cloud: The democratizationof AI “Across all these different technology areas, 93 percent are using cloud-based AI capabilities, while 78 percent employ open-source AI capabilities. For example, online marketplace Etsy has shifted its AI experimentation to a cloud provider to dramatically increase its computing power and number of experiments. Learning how to manage and integrate these disparate tools and techniques is fundamental for success.” Deloitte’s State of AI in the Enterprise, 3rd Edition https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/state-of-ai-and-intelligent-automation-in-business-survey.html
  • 41.
  • 42.
  • 43.
    Old example: NLPcloud APIs / 60+ APIs
  • 44.
    Сustomized models inthe cloud https://cloud.google.com/automl/
  • 45.