Nlp research presentation

NLP Research Papers -- Surya SG

Today's Agenda
• Trends of NLP Research Paper
• Real Time Example of Transformer
• Baseline and Overview of Transformers
in NLP
• Quick Code Tour at the Transformers
library features.
• Summary of the models
• Summary of selective Research Papers

Shifting away from huge labeled datasets
• Unsupervised:
• Yadav et al. propose a retrieval-based QA approach that iteratively refines the query to a KB to retrieve evidence for answering a
certain question. Tamborrino et al. achieve impressive results on commonsense multiple choice tasks by computing a plausibility
score for each answer candidate using a masked LM.
• Data augmentation:
• Fabbri et al. propose an approach to automatically generate (context, question, answer) triplets to train a QA model. They retrieve
contexts that are similar to those in the original dataset, generate yes/no and templated WH questions for these contexts, and train
the model on the synthetic triplets. Jacob Andreas proposes replacing rare phrases with a more frequent phrase that appears in
similar contexts in order to improve compositional generalization in neural networks. Asai and Hajishirzi augment QA training data
with synthetic examples that are logically derived from the original training data, to enforce symmetry and transitivity consistency.
• Meta learning:
• Yu et al. use meta learning to transfer knowledge for hypernymy detection from high-resource to low-resource languages.
• Active learning:
• Li et al. developed an efficient annotation framework for coreference resolution that selects the most valuable samples to annotate
through active learning.

Language models is not all you need —
retrieval is back
• Retrieval:
• Two of the invited talks at the Repl4NLP workshop mentioned retrieval-augmented LMs. Kristina Toutanova talked
about Google’s REALM, and about augmenting LMs with knowledge about entities (e.g. here, and here). Mike Lewis
talked about the nearest neighbor LM that improves the prediction of factual knowledge, and Facebook’s RAG
model that combines a generator with a retrieval component.
• Using external KBs:
• this has been commonly done for several years now. Guan et al. enhance GPT-2 with knowledge from commonsense
KBs for commonsense tasks. Wu et al. used such KBs for dialogue generation.
• Enhancing LMs with new abilities:
• Zhou et al. trained a LM to capture temporal knowledge (e.g. on the frequency and duration of events) using training
instances obtained through information extraction with patterns and SRL. Geva and Gupta inject numerical skills into
BERT by fine-tuning it on numerical data generated using templates and textual data that requires reasoning over
numbers.

Explainable
NLP
• It seems that this year looking at attention weights has
gone out of fashion and instead the focus is on generating
textual rationales, preferably ones that are faithful —
• i.e. reflect the discriminative model’s decision. Kumar and
Talukdar predict faithful explanations for NLI by generating
candidate explanations for each label, and using them to
predict the label. Jain et al. develop a faithful explanation
model that relies on post-hoc explanation methods (which
are not necessarily faithful) and heuristics to generate
training data.
• To evaluate explanation models, Hase and Bansal propose
to measure users’ ability to predict model behavior with
and without a given explanation.

Reflecting on
current
achievements,
limitations,
and thoughts
about the
future of NLP
We are solving datasets, not tasks.
There are inherent limitations in
current models and data.
We need to move away from
classification tasks.
We need to learn to handle ambiguity
and uncertainty.

Discussions
about ethics
(it’s
complicated)
• Who benefits from the system?
• Who could be harmed by it?
• Can users choose to opt out?
• Does the system enforce or
worsen systemic inequalities?
• Is it generallybettering the
world?

1. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
• Original Abstract
• We introduce a new language representationmodel called BERT, which standsfor Bidirectional Encoder
Representationsfrom Transformers.Unlike recent language representationmodels, BERT is designed to
pre-train deep bidirectional representationsby jointly conditioning on both left and right contextin all
layers. As a result, the pre-trained BERT representationscan be fine-tuned with justone additional output
layer to create state-of-the-artmodels for a wide range of tasks, such as question answering and language
inference, without substantialtask-specific architecturemodifications.
• BERT is conceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven
naturallanguage processingtasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute
improvement), MultiNLIaccuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question
answering Test F1 to 93.2(1.5% absolute improvement), outperforming human performance by 2.0%.

Summary
• A Google AI team presents a new cutting-edge model for Natural Language Processing
(NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design
allows the model to consider the context from both the left and the right sides of each
word. While being conceptually simple, BERTobtains new state-of-the-art results on
eleven NLP tasks, including question answering, named entity recognition and other tasks
related to general language understanding.

What’s the core idea of this paper?
• Training a deep bidirectional model by randomly masking a percentage of input tokens –
thus, avoiding cycles where words can indirectly “see themselves”.
• Also pre-training a sentence relationship model by building a simple binary classification
task to predict whether sentence B immediately follows sentence A, thus allowing BERTto
better understand relationships between sentences.
• Training a very big model (24 Transformer blocks, 1024-hidden, 340Mparameters) with
lots of data (3.3 billion word corpus).

What’s the key achievement?
• Advancing the state-of-the-art for 11 NLP tasks, including:
• getting a GLUE score of 80.4%, which is 7.6% of absolute improvement
from the previous best result;
• achieving 93.2% accuracy on SQuAD 1.1 and outperforming human
performance by 2%.
• Suggesting a pre-trained model, which doesn’t require any substantial
architecture modifications to be applied to specific NLP tasks.

What does the AI community think?
• BERT model marks a new era of NLP.
• In a nutshell, two unsupervised tasks together (“fill in the blank” and “does
sentence B comes after sentence A?” ) provide great results for many NLP tasks.
• Pre-training of language models becomes a new standard.
• What are future research areas?
• Testing the method on a wider range of tasks.
• Investigating the linguistic phenomena that may or may not be captured by
BERT.

What are possible business applications?
• BERT may assist businesses with a wide range of NLP problems, including:
• chatbots for better customerexperience;
• analysis of customer reviews;
• the search for relevant information, etc.

2. XLNet: Generalized Autoregressive
Pretraining for Language Understanding
• With the capability of modeling bidirectional contexts,denoising autoencodingbased pretraininglike
BERT achieves better performance than pretraining approachesbased on autoregressive language
modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the
masked positions and suffers from a pretrain-finetune discrepancy.
• In light of these pros and cons, we propose XLNet, a generalized autoregressive pretrainingmethod that
(1) enables learning bidirectional contextsby maximizing the expected likelihood over all permutations of
the factorizationorder and
• (2) overcomes the limitations of BERT thanks to its autoregressive formulation.Furthermore, XLNet
integratesideas from Transformer-XL,the state-of-the-artautoregressive model, into pretraining.
• Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art
results on 18 tasks including question answering, naturallanguage inference, sentiment analysis, and
document ranking.

Summary
• The researchers from Carnegie Mellon University and Google have developed a new
model, XLNet, for natural language processing (NLP) tasks such as reading comprehension,
text classification, sentiment analysis, and others.
• XLNet is a generalized autoregressive pretraining method that leverages the best of both
autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT)
while avoiding their limitations. The experiments demonstrate that the new model
outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on
18 NLP tasks.

• XLNet combines the bidirectional capability of BERT with the autoregressive technologyof
Transformer-XL:
• Like BERT, XLNet uses a bidirectional context, which means it looks at the words before
and after a given token to predict what it should be. To this end, XLNet maximizes the
expected log-likelihood of a sequence with respect to all possible permutations of the
factorization order.
• As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus
avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the
assumptionthat unmasked tokens are independent of each other.
• To further improve architectural designs for pretraining, XLNet integrates the segment
recurrence mechanism and relative encoding scheme of Transformer-XL.

• XLnet outperforms BERT on 20 tasks,often by a large margin.
• The new model achieves state-of-the-artperformance on 18 NLP tasks
including question answering, natural language inference, sentiment
analysis, and document ranking.
• Extending XLNet to new areas, such as computer vision and
reinforcement learning.

What does the AI community think?
• The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in
artificial intelligence.
• “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a
new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian
Ruder, a research scientist at Deepmind.
• “XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the
latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.

What are possible
business applications?
XLNetmayassistbusinesses
witha wide range of NLP
problems,including:
chatbotsfor first-line
customersupportor
answeringproductinquiries;
sentimentanalysisfor
gaugingbrandawarenessand
perceptionbasedon
customerreviewsandsocial
media;
the search forrelevant
informationindocument
basesor online,etc.

3. RoBERTa: A Robustly Optimized BERT
Pretraining Approach
• Language model pretraining has led to significant performance gains but careful comparison
between different approaches is challenging. Training is computationally expensive, often
done on private datasets of different sizes, and, as we will show, hyperparameter choices have
significant impact on the final results.
• We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures
the impact of many key hyperparameters and training data size. We find that BERT was
significantly undertrained, and can match or exceed the performance of every model
published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
• These results highlight the importance of previously overlooked design choices, and raise
questions about the source of recently reported improvements. We release our models and
code.

Summary
• Natural languageprocessing models have made significant advances thanks to the
introduction of pretraining methods, but the computational expense of training has made
replication and fine-tuning parameters difficult.
• In this study, Facebook AI and the University of Washingtonresearchers analyzed the
training of Google’s Bidirectional Encoder Representations from Transformers (BERT)
model and identified several changes to the training procedure that enhance its
performance.
• Specifically, the researchers used a new, larger dataset for training, trained the model over
far more iterations, and removed the next sequence prediction training objective. The
resulting optimized model, RoBERTa(Robustly Optimized BERT Approach), matched the
scores of the recently introduced XLNet model on the GLUE benchmark.

• The Facebook AI research team found that BERT was significantly undertrained and
suggested an improved recipe for its training, called RoBERTa:
• More data: 160GB of text instead of the 16GB dataset originally used to train BERT.
• Longer training: increasing the number of iterations from 100K to 300K and then
further to 500K.
• Larger batches: 8K instead of 256 in the original BERT base model.
• Larger byte-level BPE vocabulary with 50K subword units instead of character-level
BPE vocabulary of size 30K.
• Removing the next sequence prediction objective from the training procedure.
• Dynamically changing the masking pattern applied to the training data.

• RoBERTa outperforms BERT in all individual tasks on the General Language
Understanding Evaluation (GLUE) benchmark.
• The new model matches the recently introduced XLNet model on the GLUE
benchmark and sets a new state of the art in four out of nine individual
tasks.
• Incorporating more sophisticated multi-taskfinetuning procedures.

Big pretrained language frameworks like RoBERTa can be leveraged in
the business setting for a wide range of downstream tasks,
including dialogue systems,
question answering,
document classification, etc.

4. Emotion-Cause Pair Extraction: A New Task
to Emotion Analysis in Texts
• Emotion cause extraction (ECE), the task aimed at extracting the potentialcauses behind certain
emotionsin text, has gained much attentionin recent yearsdue to its wide applications.However, it
suffers from two shortcomings:
• 1) the emotion must be annotatedbefore cause extraction in ECE, which greatly limitsits
applicationsin real-world scenarios;
• 2) the way to first annotateemotion and then extract the cause ignores the fact that they are
mutuallyindicative.In this work, we propose a new task: emotion-cause pair extraction (ECPE),
which aims to extract the potentialpairsof emotions and corresponding causes in a document.
• We propose a 2-step approachto address this new ECPE task, which first performs individual
emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause
pairing and filtering.
• The experimentalresults on a benchmark emotion cause corpus prove the feasibilityof the ECPE
task as well as the effectiveness of our approach.

Summary
• Emotion cause extraction (ECE) is an approach used in natural language processing to
identify statements containing the causes behind vocabulary expressing emotion.
However, ECE requires emotions to first be annotated and ignores mutual relationships
between causes and emotional effects. The researchers sought to solve this problem by
simultaneously identifying pairs of emotions and causes in a task they call emotion-cause
pair extraction (ECPE).
• ECPE uses a two-step approach: the first step uses two multi-task learning networks to
identify emotion and cause clauses, while the second step pairs all causes and emotions,
and uses a trained filter to eliminate pairings that do not contain a causal relationship.
The resulting ECPE task is able to identify emotion-cause pairs at an accuracy on par with
existing ECE methods but without requiring emotion annotation.

• The paper introduces a new emotion-cause pair extraction (ECPE) task to overcome the
limitations of the traditional ECE task, where emotion annotation is required prior to cause
extraction and mutual indicativeness of emotion and cause is not taken into account.
• The introduced approach consists of two steps:
• In the first step, the two individual tasks of emotion extraction and cause extraction are
performed via two kinds of multi-task learning networks:
• Inter-EC that uses emotion extraction to improve cause extraction;
• Inter-CE that leverages cause extraction to enhance emotion extraction.
• In the second step, the model combines all elements of the two sets into pairs by
applying a Cartesian product. Then, a logistic regression model is trained to eliminate
pairs that do not contain a causal relationship.

• ECPE is able to achieve F1 scores of 0.83 for emotion extraction, 0.65 for
cause extraction, and 0.61 for emotion-causepairing.
• On the ECE benchmark dataset, ECPE performs on par with existing ECE
methods that require emotion annotation before causal clauses can be
identified.
• Altering the ECPE approach from a two-stepto a one-step process that
directly extracts emotion-cause pairs in an end-to-end fashion.

Sentiment analysis for
marketing campaigns.
Opinion monitoring from
social media.

5. CTRL: A Conditional Transformer Language
Model For Controllable Generation
• Large-scale language models show promising text generation capabilities, but users
cannot easily control particular aspects of the generated text. We release CTRL, a 1.6
billion-parameter conditional transformer language model, trained to condition on
control codes that govern style, content, and task-specific behavior.
• Control codes were derived from structure that naturally co-occurs with raw text,
preserving the advantages of unsupervised learning while providing more explicit control
over text generation. These codes also allow CTRL to predict which parts of the training
data are most likely given a sequence.
• This provides a potential method for analyzing large amounts of data via model-based
source attribution. We have released multiple full-sized, pretrained versions of CTRL
at https://www.github.com/salesforce/ctrl.

Summary
• Language models used for text generation are very powerful, but they are often “black
boxes”, so users do not have much control over the output.
• To address this problem, the Salesforce research team has introduced the Conditional
TransformerLanguage (CTRL) model that conditions on a set of control codes. With these
codes, the users can control domain, style, topics, dates, entities, relationships between
entities, plot points, and task-related behavior.
• Moreover, all control codes can be traced back to a specific subset of the training data,
allowing CTRL to predict the subset of the training data mostlikely leveraged for a
particular sequence.
• This relationship between CTRL and its training data provides new possibilities for
analyzing the correlations learned from each domain.

• Text generation tools are very powerful, but they do not give users much control over the
content, style or genre of the generated text.
• The Salesforce research team has released CTRL, a 1.6 billion-parameter conditional
transformer language model, that gives users more control over the generated content:
• CTRL exposes keywords called control codes which allow users to specify a domain,
style, topics, dates, entities, relationships between entities, plot points, and task-
related behavior.
• CTRL is trained on control codes derived from the structure that naturally co-occurs
with the raw text. In particular, CTRL leverages the fact that training data is usually
associated with a URL that contains information relevant to the text it represents.

• Introducing and open-sourcing a language model that:
• enables more controllable text generation;
• provides new opportunities for analyzing large amounts of text via
model-based source attribution;
• can be used to detect artificially generated text.

What are future research areas?
• Introducing a greater variety of control codes to allow finer-grained control.
• Extending to other areas of NLP including abstractive summarizationand commonsense
reasoning.
• Analyzing the relationships between training data and language models.
• Exploring the possibilities to make the interface between humans and language models
more explicit and intuitive.

Improved and tailored text generationfor
question-answering systems and other human-
computer interactionapplications.
Identifying artificially generatedtext, to detect
malicious uses such as automatically generated
essays or fake reviews.

6. ALBERT: A Lite BERT for Self-supervised
Learning of Language Representations
• Increasing model size when pretraining natural language representations often results in
improved performance on downstream tasks. However, at some point further model increases
become harder due to GPU/TPU memory limitations, longer training times, and unexpected
model degradation.
• To address these problems, we present two parameter-reduction techniques to lower memory
consumption and increase the training speed of BERT. Comprehensive empirical evidence
shows that our proposed methods lead to models that scale much better compared to the
original BERT.
• We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and
show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best
model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks
while having fewer parameters compared to BERT-large.

Summary
• The Google Research team addresses the problem of the continuouslygrowing
size of the pretrained language models, which results in memory limitations,
longer training time, and sometimes unexpectedlydegraded performance.
• Specifically, they introduce A Lite BERT (ALBERT) architecture that incorporates
two parameter-reduction techniques: factorized embedding
parameterization and cross-layer parameter sharing.
• In addition, the suggested approach includes a self-supervised loss for sentence-
order prediction to improve inter-sentencecoherence.
• The experiments demonstrate that the best version of ALBERT sets new state-of-
the-art results on GLUE, RACE, and SQuAD benchmarks while having fewer
parameters than BERT-large.

• It is not reasonable to further improve language models by making them larger because of
memory limitations of available hardware, longer training times, and unexpected degradation of
model performance with the increased number of parameters.
• To address this problem, the researchers introduce the ALBERT architecture that incorporates
two parameter-reduction techniques:
• factorized embedding parameterization, where the size of the hidden layers is separated
from the size of vocabulary embeddings by decomposing the large vocabulary-embedding
matrix into two small matrices;
• cross-layer parameter sharing to prevent the number of parameters from growing with the
depth of the network.
• The performance of ALBERT is further improved by introducing the self-supervised loss
for sentence-order prediction to address BERT’s limitations with regard to inter-sentence
coherence.

• With the introduced parameter-reduction techniques, the ALBERT
configuration with 18× fewer parameters and 1.7× faster training compared
to the original BERT-large model achieves only slightly worse performance.
• The much larger ALBERT configuration, which still has fewer parameters
than BERT-large, outperforms all of the current state-of-the-artlanguage
modes by getting:
• 89.4% accuracy on the RACE benchmark;
• 89.4 score on the GLUE benchmark; and
• An F1 score of 92.2 on the SQuAD 2.0 benchmark.

THE ALBERT LANGUAGE
MODEL CAN BE LEVERAGED IN
THE BUSINESSSETTING TO
IMPROVEPERFORMANCEON A
WIDE RANGE OF
DOWNSTREAMTASKS,
INCLUDINGCHATBOT
PERFORMANCE,
SENTIMENT ANALYSIS, DOCUMENT MINING,AND TEXT CLASSIFICATION.

7. Explain Yourself! Leveraging Language
Models for Commonsense Reasoning
• Deep learning models perform poorly on tasks that require commonsense reasoning, which
often necessitates some form of world-knowledge or reasoning over information not
immediately present in the input.
• We collect human explanations for commonsense reasoning in the form of natural language
sequences and highlighted annotations in a new dataset called Common Sense Explanations
(CoS-E). We use CoS-E to train language models to automatically generate explanations that
can be used during training and inference in a novel Commonsense Auto-Generated
Explanation (CAGE) framework.
• CAGE improves the state-of-the-art by 10% on the challenging CommonsenseQA task. We
further study commonsense reasoning in DNNs using both human and auto-generated
explanations including transfer to out-of-domain tasks. Empirical results indicate that we can
effectively leverage language models for commonsense reasoning.

Summary
• Natural language processing algorithms are limited to information contained in
texts, and often these algorithms lack commonsensereasoning that allows them
to make inferences as most humans do.
• The Salesforce research team suggestsaddressing this problem by training the
language model to automatically generate commonsenseexplanations. This task
is accomplished by providing the model with human explanations alongside the
question answering samples.
• These autogenerated explanations are then used by a neural network to solve the
CommonsenseQA(CQA) task. This two-stepapproach improved accuracy on the
CommonsenseQAmultiple-choice test by 10% compared to existing models.

• Natural language processing struggles with inference based on common sense and
real-world knowledge.
• The paper suggests addressing this issue in two phases:
• First, the researchers train the model to generate Common Sense Explanations
(CoS-E) by providing human-generated explanations in the form of both open-
ended sentences and highlighted span annotations, alongside Commonsense
Question Answering (CQA) examples.
• In the second phase, the authors use this trained language model to generate
explanations for each sample in the training and validation sets. These
Commonsense Auto-Generated Explanations (CAGE) are then leveraged to
solve the CQA task.

• The explanation-generating model improves performance in a natural
language reasoning test by 10% over the previous best model and improves
understanding of how neural networks apply knowledge.
• Moreover, the experiments demonstrate that the introduced approach can
be successfullytransferred to out-of-domain datasets.

Combining the explanation-generating model into an answer
prediction model.
Combining
Extending the dataset of explanations to other tasks to create a
more general explanatory language model.
Extending
Removing bias from training datasets to eliminate bias in
generated explanations.
Removing

What are
possible
business
applications?
• The model with improved
common-sense reasoning
capabilitiescan be leveraged:
• to provide bettercustomer
service via chatbots;
• to improve the
performance of
information retrieval
systems.

8. Detecting Concealed Informationin Text
and Speech
• Motivatedby infamouscheating scandalsin variousindustries and politicalevents, we address the
problem of detecting concealed informationin technicalsettings.
• In this work, we explore acoustic-prosodicand linguisticindicatorsof information concealmentby
collecting a uniquecorpus of professionalspracticing for oral exams while concealinginformation.
• We reveal subtle signs of concealedinformation in speech and text, compare, and contrast them
with those in deceptiondetection literature,thus uncovering the link between concealing
information and deception.
• We then present a series of experiments that automatically detectconcealed informationfrom text
and speech. We compare the use of acoustic-prosodic,linguistic,and individual featuresets, using
different machine learning models. Finally,we present a multi-task learning framework with
acoustic, linguistic,and individualfeatures, that outperforms human performance by over 15%.

Summary
• When confidential information is leaked, it is often difficult to tell who originally
obtained the leaked information and who it has been leaked to. Even though previous
work has demonstrated that changes in voice tone, lexicon, and speech patterns can
identify when someone is concealing information, research in this subject area is very
scarce. It is partly due to the lack of datasets that include ground truth labels indicating
information concealment.
• To address this issue, the present study introduces a new dataset collected from a
unique audio corpus of professional wine tasters practicing for oral exams while
concealing information. By leveraging this dataset, the researcher was able to develop a
new multi-task learning model for detecting concealed information that performs 11%
better than baseline models and 15% better than humans.

While there are machine learning-based methods for detecting when someone does not
have information but pretends to, there are few comparable models for detecting when
someone is concealing leaked information.
In this study, Hu from Cornell University captured linguistic and acoustic-prosodic features
from a controlled human experiment to create a dataset of speech patterns when people
were speaking honestly and when they were concealing some information.
The author leverages this dataset to develop a multi-task learning framework where, as
well as identifying concealed information, the system is also predicting whether the
speaker’s answer is correct and the identity of the wine.

A multi-task learning model outperformed baseline models by 11%
and humans by 15% at detecting when someone is concealing
information.
Moreover, the introduced framework outperforms humans even in
the case where some of the humans in the experiment knew one
another and could read social cues (e.g. gestures) that are not
available to the model.

STUDYING INDIVIDUAL
DIFFERENCESIN BOTH DETECTING
CONCEALEDINFORMATION AND
CONCEALINGINFORMATION.
EXPLORINGTHE PREDICTIVEPOWER
OF PHONOTACTICVARIATION
FEATURES.
CONDUCTINGDOMAIN
ADAPTATION WITHREGARDS TO
DETECTING CONCEALED
INFORMATION.
IMPROVINGTHE SCALABILITYOF
THE MULTI-TASK LEARNINGMODEL.

What are possible business
applications?
Detecting insider trading in financial markets.
Detecting
Controlling data leaks within different testing procedures.
Controlling
Tracing and limiting the extent of information leaks around
political campaigns.
Tracing and
limiting

9. Improving Visual Question Answering by
Referring to Generated Paragraph Captions
• Paragraph-style image captions describe diverse aspects ofan image as opposed to the more common single-sentence
captions that onlyprovide an abstract description ofthe image. These paragraph captions can hence contain substantial
information ofthe image for tasks such as visual questionanswering.
• Moreover,this textual informationis complementarywith visual information presentin the image because it can
discuss both more abstract concepts and more explicit,intermediate symbolicinformation about objects,events,and
scenes that can directlybe matched with the textual question and copied intothe textualanswer (i.e.,via easier
modalitymatch).
• Hence, we propose a combined Visual and Textual QuestionAnswering(VTQA) model which takes as input a paragraph
caption as well as the correspondingimage, and answers the given question based on both inputs.In our model,the
inputs are fused to extract related information bycross-attention (earlyfusion),then fused again in the form of
consensus (late fusion),and finallyexpected answers are given an extra score to enhance the chance of selection (later
fusion).
• Empirical results showthat paragraphcaptions,even when automaticallygenerated (via an RL-based encoder-decoder
model),help correctly answer more visual questions.Overall,our joint model,when trained on the Visual Genome
dataset,significantlyimprovesthe VQA performance over a strongbaseline model.

Summary
• Computer models struggle with answering questions about visual images, a
task known as visual question answering (VQA).
• In this study, the researchers sought to improve VQA performance by
providing a VQA model with a text description of an image’s content
produced by a paragraph captioning model.
• The two models were fused over three stages to generate a consensus
answer to questions posed about the image.
• The resulting visual and textual question answering (VQTA) model was
1.92% more accurate than the standalone VQA model.

• VQA models struggle with identifying all of the necessary informationin images, and particularly abstract
concepts,required to answer questions.
• The researchers suggest using a pre-trained paragraphcaptioningmodel to provide additional information
to the VQA model.
• The text and image input are fused at three levels:
• in the early fuse stage, visual features are fused with paragraphcaptionand object property features
by cross-attention;
• in the late fuse stage,the inputs are fused again in the form of consensus,i.e. logits from each
module are integratedinto one vector;
• in the later fuse stage,the model accountsfor the fact that some regions of the image are more
likely to draw people’s attention,and thus questions and answers are more likely to be related to
those regions. So, the model gives an extra score to the answers related to the salient regions.

• Improving visual question answering performance by 1.92% compared to the baseline
VQA model.

ImprovingVTQA models to extract more information
from textual captions,and enhancingparagraph
captioningmodels to generate better captions.
Trainingthe VTQA model jointlywith the paragraph
captioningmodel.

• Improving image search and retrieval.
• Imageannotation and interactivity for blind people.
• Creating “interactive” images for online education.

10. Thieves on Sesame Street! Model
Extraction of BERT-based APIs
• We study the problem of model extraction in natural language processing, in which an adversary with only query
access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary
and victim model fine-tune a large pretrained language model such as BERT (Devlin et al., 2019), we show that
the adversary does not need any real training data to successfully mount the attack.
• In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random
sequences of words coupled with task-specific heuristics form effective queries for model extraction on a
diverse set of NLP tasks, including natural language inference and question answering.
• Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the
NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only
slightly worse than the victim model.
• Finally, we study two defense strategies against model extraction—membership classification and API
watermarking—which while successful against naive adversaries, are ineffective against more sophisticated
ones.

Summary
• This paper highlights an exploit only made feasible by the shift towards transfer learning
methods within the NLP community: for a query budget of a few hundred dollars, an
attacker can extract a model that performs only slightly worse than the victim model on
SST2, SQuAD, MNLI, and BoolQ. On the SST2 task, the victim model had a 93.1%accuracy
compared to their extracted model’s 90.1%.
• They show that an adversary does not need any real training data to mount the attack
successfully. The attacker does not even need to use grammatical or semantically
meaningful queries. They used random sequences of words coupled with task-specific
heuristics to form useful queries for model extraction on a diverse set of NLP tasks.

Summary
• Why It Matters: Outputs of modern NLP APIs on nonsensical text
provide strong signals about model internals, allowing adversaries to
train their own models and avoid paying for the API.

• DEFENSES
• MEMBERSHIPCLASSIFICATION
• Our first defense uses membership inference, which is traditionally used to determinewhether a
classifier was trained on a particular input point.
• In our setting we use membership inference for “outlier detection”,where nonsensicaland
ungrammaticalinputs (which are unlikely to be issued by a legitimate user) are identified
• When such out-of-distributioninputs are detected, the API issues a random outputinstead of the
model’s predicted output, which eliminates the extractionsignal.
• WATERMARKING
• in which a tiny fractionof queries are chosen at random and modified to return a wrong output.
• These “watermarked queries” and their outputs are stored on the API side. Since deep neural
networks have the ability to memorize arbitrary information,this defense anticipatesthat
extractedmodels will memorize some of the watermarked queries, leaving them vulnerable to
post-hoc detection if they are deployed publicly

• Our results show that fine-tuning large pretrained language models
simplifies the process of extraction for an attacker.
• Unfortunately, existing defenses against extraction, while effective in some
scenarios, are generally inadequate, and further research is necessary to
develop defenses robust in the face of adaptive adversaries who develop
counter-attacksanticipating simple defenses.

• Other interesting future directions that follow from the results in this paper
include
• (1) leveraging nonsensical inputs to improve model distillation on tasks for
which it is difficult to procure input data;
• (2) diagnosing dataset complexity by using query efficiency as a proxy; and
• (3) further investigation of the agreement between victim models as a
method to identify proximity in input distribution and its incorporation into
an active learning setup for model extraction.

• Avoids Paid APIs from possible thefts.
• Decision Analysis on API Cost Models for NLU and NLG.

11. WinoGrande: An Adversarial Winograd
Schema Challenge at Scale
• The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of
273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional
preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on
variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or
whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.
• To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but
adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully
designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-
detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve
59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed.
• Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef
(85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande
when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true
capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in
existing and future benchmarks to mitigate such overestimation.

Summary
The research group from the Allen Institute
for Artificial Intelligence
introduces WinoGrande, a new benchmark
for commonsense reasoning. They buildon
the design of the famous Winograd Schema
Challenge(WSC) benchmark but
significantlyincrease the scale of the
dataset to 44K problemsand reduce
systematic bias using a
novel AfLite algorithm.
The experimentsdemonstrate that state-of-
the-art methods achieve up to 79.1%
accuracy on WinoGrande,which is
significantlybelow the human performance
of 94%. Furthermore, the researchers show
that WinoGrande is an effective resource
for transfer learning, by using a RoBERTa
model fine-tuned with WinoGrandeto
achieve new state-of-the-art results on
WSC and four other relatedbenchmarks.

• The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and
annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense
reasoning.
• They introduce WinoGrande, a new large-scale dataset for commonsense reasoning. Their approach has two key
features:
• A carefully designed crowdsourcing procedure:
• Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain
certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers.
• Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected
questions, 53K were deemed valid.
• A novel algorithm AfLite for systematic bias reduction:
• It generalizes human-detectable biases based on word occurrences to machine-detectable biases
based on embedding occurrences.
• After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples.

• WinoGrande is easy for humans and challenging for machines:
• Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%);
• RoBERTa achieves 79.1%test-set accuracy;
• whereas human performance achieves 94% accuracy.
• WinoGrande is also an effective resource for transfer learning. The RoBERTa-based model fine-tuned on
WinoGrande achieved a new state of the art on WSC and four other related datasets:
• 90.1%on WSC;
• 93.1%on DPR;
• 90.6%on COPA;
• 85.6%on KnowRef; and
• 97.1%on Winogender.

• Exploring new algorithmic approaches for systematicbias reduction.
• Debiasing other NLP benchmarks.

12. Exploring the Limits of Transfer Learning
with a Unified Text-to-Text Transformer
• Transfer learning, where a model is first pre-trained on a data-richtask before being fine-tuned on a
downstream task, has emerged as a powerful technique in naturallanguage processing (NLP).The
effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
• In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified
framework that converts every language problem into a text-to-textformat.Our systematicstudy
compares pre-training objectives, architectures,unlabeled datasets, transfer approaches,and other
factorson dozens of language understandingtasks.
• By combining the insights from our explorationwith scale and our new “Colossal Clean Crawled Corpus”,
we achieve state-of-the-artresults on many benchmarks covering summarization, question answering,
text classification,and more. To facilitate future work on transfer learning for NLP, we release our dataset,
pre-trained models, and code.

Summary
• The Google research team suggests a unified approach to transfer learning in NLP with the
goal to set a new stateof the art in the field. To this end, they propose treating each NLP
problem as a “text-to-text” problem.
• Such a framework allows using the same model, objective, training procedure, and
decoding process for different tasks, including summarization, sentimentanalysis, question
answering, and machine translation. The researchers call their model a Text-to-Text
Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-
of-the-art results on a number of NLP tasks.

• The paper has several important contributions:
• Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing
techniques.
• Introducing a new approach to transfer learning in NLP by suggesting to treat every NLP problem as a text-to-
text task:
• The mode understands which tasks should be performed thanks to the task-specific prefix added to the
original input sentence (e.g., “translate English to German:”, “summarize:”).
• Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text,
the Colossal Clean Crawled Corpus (C4).
• Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.

• The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out
of 24 tasks considered, including:
• the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI
tasks;
• the Exact Match score of 90.06 on SQuAD dataset;
• the SuperGLUE score of 88.9, which is a very significant improvement over the previous
state-of-the-art result (84.6)and very close to human performance (89.8);
• the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarizationtask.

• Researching the methods to achieve stronger performance with cheaper models.
• Exploring more efficient knowledge extraction techniques.
• Further investigating the language-agnosticmodels.

What are
possible
business
applications?
• Even though the introduced model has billions
of parameters and can be too heavy to be
applied in the business setting,
• the presented ideas can be used to improve
the performance on different NLP tasks,
including summarization, question answering,
and sentiment analysis.

13. Reformer: The Efficient Transformer
• Large Transformer models routinely achieve state-of-the-art results on a number of tasks
but training these models can be prohibitively costly, especially on long sequences. We
introduce two techniques to improve the efficiency of Transformers.
• For one, we replace dot-product attention by one that uses locality-sensitive hashing,
changing its complexity from O(L^2) to O(L log L), where L is the length of the sequence.
• Furthermore, we use reversible residual layers instead of the standard residuals, which
allows storing activations only once in the training process instead of N times, where N is
the number of layers. The resulting model, the Reformer, performs on par with
Transformer models while being much more memory-efficient and much faster on long
sequences.

Summary
• The leading Transformer models have become so big that they can be realistically trained only in
large research laboratories. To address this problem, the Google Research team introduces
several techniques that improve the efficiency of Transformers. In particular, they suggest
• (1) using reversible layers to allow storing the activations only once instead of for each layer,
and
• (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-
product attention. Experiments on several text tasks demonstrate that the
introduced Reformer model matches the performance of the full Transformer but runs much
faster and with much better memory efficiency.

Summary
Locality-Sensitive Hashing Attention showing the hash-bucketing,
sorting, and chunking steps, and the resulting causal attentions,
together with the corresponding attention matrices (a–d)

What’s the
core idea
of this
paper?
The leading Transformer models require huge
computational resources because of the very high number
of parameters and several other factors:
• The activations of every layer need to be stored for back-propagation.
• The intermediate feed-forward layers accountfor a large fractionof
memory use since their depth is often much larger than the depth of
attentionactivations.
• The complexity of attentionon a sequence of length L is O(L^2).
To address these problems, the research team introduces
the Reformer model with the following improvements:
• using reversiblelayersto store only a single copy of activations;
• splittingactivations inside the feed-forward layers and processing them
in chunks;
• approximatingattentioncomputationbased on locality-sensitive
hashing.

• By analyzing the introduced techniques one by one, the authors show that model accuracy
is not sacrificed by:
• switching to locality-sensitive hashing attention;
• using reversible layers.
• Reformer performs on par with the full Transformer model while demonstrating much
higher speed and memory efficiency:
• For example, on the newstest2014 taskfor machine translation from English to
German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et
al. (2017)BLEU score of 27.3.

What are
possible
business
applications?
• The suggestedefficiency improvements
enable more widespread Transformer
application, especially for the tasks that
depend on large-contextdata, such as:
• text generation;
• visual content generation;
• music generation;
• time-series forecasting.

14. Longformer: The Long-Document
Transformer
• Transformer-based models are unable to process long sequences due to their self-attention
operation, which scales quadratically with the sequence length. To address this limitation, we
introduce the Longformer with an attention mechanism that scales linearly with sequence
length, making it easy to process documents of thousands of tokens or longer.
• Longformer’s attention mechanism is a drop-in replacement for the standard self-attention
and combines a local windowed attention with a task motivated global attention. Following
prior work on long-sequence transformers, we evaluate Longformer on character-level
language modeling and achieve state-of-the-art results on text8 and enwik8.
• In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of
downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long
document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Summary
• Self-attention is one of the key factors behind the success of Transformer architecture. However,
it also makes transformer-based models hard to apply to long documents. The existing
techniques usually divide the long input into a number of chunks and then use complex
architectures to combine information across these chunks.
• The research team from the Allen Institute for Artificial Intelligence introduces a more elegant
solution to this problem. The suggested Longformer model employs an attention pattern that
combines local windowed attention with task-motivated global attention.
• This attention mechanism scales linearly with the sequence length and enables processing of
documents with thousands of tokens. The experiments demonstrate that Longformer achieves
state-of-the-art results on character-level language modeling tasks, and when pre-trained,
consistently outperforms RoBERTa on long-document tasks.

Summary
Full self-attention pattern vs. Longformer’s configuration of attention patterns

• The computational requirements of self-attention grow quadratically with sequence length, making it hard to
process on current hardware.
• To address this issue, the researchers present Longformer, a modified version of Transformer architecture that:
• allows memory usage to scale linearly, and not quadratically, with the sequence length;
• includes an attention mechanism that combines:
• a windowed local-context self-attention to build contextual representations;
• an end task motivated global attention to encode inductive bias about the task and build full sequence
representation.
• Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication
that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce
a custom CUDA kernel for implementing these attention operations.

• The Longformer model achieves a new state of the art on character-level language modeling tasks:
• BPC of 1.10 on text8; (Bits Per Character)
• BPC of 1.00 on enwik8.
• After pre-training and fine-tuning for six tasks, including classification, question answering, and
coreference resolution, the Longformer-base consistently outperformers the RoBERTa-base with:
• accuracy of 75.0 vs. 72.4 on WikiHop;
• F1 score of 75.2 vs. 74.2 on TriviaQA;
• joint F1 score of 64.4 vs. 63.5 on HotpotQA;
• average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
• accuracy of 95.7 vs. 95.3 on the IMDB classification task;
• F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
• The performance gains are especially remarkable for the tasks that require a long context (i.e.,
WikiHop and Hyperpartisan).

Exploring other attentionpatternsthat are more
efficient due to dynamic adaptationto the input.
Applying Longformer to other relevant long
document tasks such as summarization.

What are
possible
business
applications?
• The Longformer architecture can be very
advantageous for the downstream NLP tasks that
often require processing of long documents:
• document classification;
• question answering;
• coreference resolution;
• summarization;
• semantic search.

15. ELECTRA: Pre-training Text Encoders as
Discriminators Rather Than Generators
• Masked language modeling(MLM) pre-trainingmethods such as BERT corrupt the input byreplacingsome tokens with
[MASK] and then train a model to reconstruct the original tokens.While they produce good results when transferred to
downstreamNLP tasks,theygenerallyrequire large amounts ofcompute to be effective.
• As an alternative,we propose a more sample-efficient pre-trainingtaskcalled replaced token detection.Instead of
maskingthe input,ourapproach corrupts it byreplacingsome tokens with plausible alternatives sampledfroma small
generator network.
• Then,instead oftraininga model that predicts the original identities ofthe corrupted tokens,we train a discriminative
model that predicts whether each token in the corrupted input was replaced bya generator sample or not.Thorough
experiments demonstrate this newpre-trainingtaskis more efficient than MLM because the task is defined overall
input tokens rather than just the small subset that was masked out.
• As a result,the contextual representations learned byourapproach substantiallyoutperform the ones learned byBERT
given the same model size, data,and compute.The gains are particularlystrongfor small models;for example, we train
a model on one GPU for 4 days that outperforms GPT (trained using30× more compute)on the GLUE naturallanguage
understandingbenchmark.Our approach also works well at scale, where it performs comparablyto RoBERTa and XLNet
while usingless than 1/4 of theircompute and outperforms them when usingthe same amount of compute.

Summary
• The pre-training task for popular language models like BERT and XLNet involves masking a
small subset of unlabeled input and then training the network to recover this original
input. Even though it works quite well, this approachis not particularly data-efficient as it
learns from only a small fractionof tokens (typically ~15%).
• As an alternative, the researchers from StanfordUniversity and Google Brain propose a
new pre-training task called replacedtoken detection.Insteadof masking, they suggest
replacing some tokens with plausible alternatives generated by a small language model.
Then, the pre-trained discriminatoris used to predict whether each token is an original or
a replacement.
• As a result, the model learns from all input tokens instead of the small masked fraction,
making it much more computationally efficient. The experiments confirm that the
introduced approachleads to significantly faster trainingand higher accuracyon
downstream NLP tasks.

• Pre-training methods that are based on masked language modeling are computationally
inefficient as they use only a small fraction of tokens for learning.
• Researchers propose a new pre-training task called replaced token detection, where:
• some tokens are replaced by samples from a small generator network;
• a model is pre-trained as a discriminator to distinguish between original and replaced
tokens.
• The introduced approach, called ELECTRA (Efficiently Learning an Encoder
that Classifies Token Replacements Accurately):
• enables the model to learn from all input tokens instead of the small masked-out
subset;
• is not adversarial, despite the similarity to GAN, as the generator producing tokens
for replacement is trained with maximum likelihood.

• Demonstrating that the discriminative task of distinguishing between real data and
challenging negative samples is more efficient than existing generative methods for
language representation learning.
• Introducing a model that substantially outperforms state-of-the-art approaches while
requiring less pre-training compute:
• ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT
model with a score of 75.1 and a much larger GPT model with a score of 78.8.
• An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25%
of their pre-training compute.
• ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and
SQuAD benchmarks while still requiring less pre-training compute.

What are
possible
business
applications?
Because of its computational
efficiency,
the ELECTRA approach can make
the application of pre-trained
text encoders more accessible to
business practitioners.

16. Language
Models are
Few-Shot
Learners

Summary
• The OpenAI research team draws attention to the fact that the need for a labeled dataset
for every new language task limits the applicability of language models.
• Considering that there is a wide range of possible tasks and it’s often difficult to collect a
large labeled training dataset, the researchers suggest an alternative solution, which is
scaling up language models to improve task-agnostic few-shot performance.
• They test their solution by training a 175B-parameter autoregressive language model,
called GPT-3, and evaluating its performance on over two dozen NLP tasks. The
evaluation under few-shot learning, one-shot learning, and zero-shot learning
demonstrates that GPT-3 achieves promising results and even occasionally outperforms
the state of the art achieved by fine-tuned models.

What’s the
core idea of
this paper?
Sparse Transformer

What does
the AI
community
think?
Sam Altman, CEO and co-
founder of OpenAI
Abubakar
Abid, CEO and founder of Gradio
Gary
Marcus,CEO and founder of Robust.ai
Geoffrey Hinton, Turing
Award winner

What are
future
research
areas?
Improving
pre-training
sample
efficiency.

17. Beyond Accuracy: Behavioral Testing of
NLP models with CheckList
• Although measuring held-out accuracyhas been the primary approachto evaluate generalization,it often
overestimates the performance of NLP models, while alternative approachesfor evaluating models either
focus on individual tasks or on specific behaviors.
• Inspired by principles of behavioral testing in software engineering, we introduceCheckList, a task-
agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities
and test types that facilitatecomprehensive test ideation, as well as a software tool to generate a large
and diverse number of test cases quickly.
• We illustratethe utility of CheckList with tests for three tasks, identifying critical failures in both
commercial and state-of-artmodels. In a user study, a team responsible for a commercial sentiment
analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP
practitionerswith CheckList created twice as many tests, and found almost three times as many bugs as
users without it.

Summary
• The authors point out the shortcomings of existing approaches to evaluating performance
of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate
where the model is failing and how to fix it. The alternative evaluation approaches usually
focus on individual tasks or specific capabilities.
• To address the lack of comprehensive evaluation approaches, the researchers
introduce CheckList, a new evaluation methodology for testing of NLP models. The
approach is inspired by principles of behavioral testing in software engineering.
• Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test
ideation. Multiple user studies demonstrate that CheckList is very effective at discovering
actionable bugs, even in extensively tested NLP models.

What’s the
core idea
of this
paper?
Existing approaches to evaluation of NLP models have many significant
shortcomings:
• The primary approach to the evaluation of models’ generalization capabilities, which is accuracy
on held-out data, may lead to performance overestimation, as the held-out data often contains
the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much
in figuring out where the NLP model is failing and how to fix these bugs.
• The alternative approaches are usually designed for evaluation of specific behaviors on
individual tasks and thus, lack comprehensiveness.
To address this problem, the research team introduces CheckList,a new
methodology for evaluating NLP models, inspired by the behavioral testing
in software engineering:
• CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named
entity recognition, and negation.
• Then, to break down potential capability failures into specific behaviors, CheckList suggests
different test types, such as prediction invariance or directional expectation tests in case of
certain perturbations.
• Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
The suggested implementation of CheckList also introducesa variety of
abstractionsto help users generate large numbers of test cases easily.

What does
the AI
community
think?
• The paper received the Best
Paper Award at ACL 2020, the
leading conference in natural
language processing.

What are
possible
business
applications?
• CheckList can be used to create more
exhaustivetesting for a variety of NLP
tasks.
• Such comprehensive testing that
helps in identifying many actionable
bugs is likely to lead to more robust
NLP systems.

18. Tangled up in BLEU: Reevaluating the Evaluation of
Automatic Machine Translation Evaluation Metrics
• Automatic metrics are fundamentalfor the developmentand evaluationof machine translation
systems. Judging whether, and to what extent, automatic metrics concur with the gold standardof
human evaluation isnot a straightforward problem.
• We show that current methods for judging metrics are highly sensitive to the translationsused for
assessment, particularlythe presence of outliers, which often leadsto falsely confident conclusions
about a metric’s efficacy.
• Finally,we turn to pairwise system ranking, developinga method for thresholding performance
improvement under an automaticmetric againsthuman judgements, which allowsquantification of
type I versus type II errors incurred, i.e., insignificanthumandifferences in system qualitythat are
accepted, and significanthuman differences that are rejected.
• Together, these findings suggest improvementsto the protocolsfor metric evaluationandsystem
performance evaluationin machine translation.

Summary
• The most recent Conference on Machine Translation (WMT) has revealed that,
based on Pearson’s correlation coefficient, automatic metrics poorly match human
evaluations of translation quality when comparing only a few best systems. Even
negative correlations were exhibited in some instances.
• The research team from the University of Melbourne investigates this issue by
studying the role of outlier systems, exploring how the correlation coefficient
reflects different patterns of errors (type I vs. type II errors), and what magnitude of
difference in the metric score corresponds to true improvements in translation
quality as judged by humans.
• Their findings suggest that small BLEU differences (i.e., 1–2 points) have little
meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over
BLEU. However, only human evaluations can be a reliable basis for drawing
important empirical conclusions.

• Automaticmetrics are used as a proxyforhuman translation evaluation,which is considerablymore expensiveand time-
consuming.
• However, evaluatinghowwell different automaticmetrics concur with human evaluationis not a straightforwardproblem:
• For example, the recent findings show that if the correlation between leadingmetrics and human evaluations is computed
usinga large set of translationsystems,it is typicallyvery high (i.e., 0.9). However, if onlya few best systems are considered,
the correlation reduces markedlyand can even be negativein some cases.
• The authors ofthis paper take a closer lookat this problem and discoverthat:
• The identified problem with Pearson’s correlationis due to the small sample size and not specific to comparingstrongMT
systems.
• Outlier systems,whose qualityis much higher or lower than the rest of the systems,havea disproportionate effect on the
computed correlationand shouldbe removed.
• The same correlation coefficient can reflect different patterns oferrors.Thus,a better approach for gaininginsights into
metric reliabilityis to visualize metricscores against human scores.
• Small BLEU differences of 1-2 points correspondto true improvements in translationquality(as judged by humans)onlyin
50% of cases.

• Conducting a thorough analysis of automatic metrics performance metrics vs.
human judgments in machine translation, and providing key recommendations
on evaluating MT systems:
• Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU
and TER.
• Moving away from using small changes in evaluation metrics as the sole basis to
draw important empirical conclusions, and always ensuring support from human
evaluations before claiming that one MT system significantly outperforms
another one.

19. Towards
a Human-
like Open-
Domain
Chatbot

Example of Meena generating a response, “The Next Generation” (Google AI Blog)
Summary

What’s the
core idea of
this paper?
Evolved
Transformer

What does
the AI
community
think?
Elliot Turner, CEO and founder
of Hyperia
Graham Neubig, Associate
professor at Carnegie Mellon University

What are future research
areas?

The authors suggest some
interesting applications
for open-domain chatbots
such as Meena:
further humanizing
computer interactions;
improving foreign
languagepractice;
making interactive movie
and videogame characters
relatable.

20. Recipes for Building an Open-Domain
Chatbot
• Building open-domain chatbotsis a challenging area for machine learning research. While prior work has
shown that scaling neural models in the number of parameters and the size of the data they are trained
on gives improved results, we show that other ingredients are important for a high-performing chatbot.
• Good conversationrequires a number of skills that an expert conversationalistblends in a seamless way:
providing engaging talking points and listening to their partners, and displaying knowledge, empathy and
personality appropriately, while maintaining a consistentpersona.
• We show that large scale models can learn these skills when given appropriate trainingdata and choice of
generation strategy.We build variants of these recipes with 90M,2.7Band 9.4Bparameter models, and
make our models and code publicly available.
• Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in
terms of engagingnessand humanness measurements. We then discuss the limitations of this work by
analyzing failure cases of our models.

Summary
• The Facebook AI Research team shows that with appropriate training data and
generation strategy, large-scale models can learn many important
conversational skills, such as engagingness, knowledge, empathy, and persona
consistency.Thus, to build their state-of-the-art conversational agent,
called BlenderBot, they leveraged a model with 9.4B parameters, trained it on
a novel task called Blended Skill Talk, and deployed beam search with carefully
selected hyperparameters as a generation strategy.
• Human evaluations demonstrate that BlenderBot outperforms Meena in
pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in
terms of humanness.

• The introduced recipe for building a state-of-the-artopen-domain chatbotincludes three key ingredients:
• Largescale. The largest model has 9.4 billion parametersand was trained on 1.5 billion training examples of
extractedconversations.
• Blendedskills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of
personality, engaginguse of knowledge, and display of empathy.
• Beam search used for decoding. The researchers show that this generation strategy,deployed with
carefully selected hyperparameters,gives strongresults. In particular,it was demonstratedthat the lengths
of the agent’sutterancesis very important for chatbot performance (i.e, too short responses are often
considered dull and too long responses make the chatbot appear to waffle and not listen).

The introduced chatbot outperforms
the previous best-performing open-
domain chatbot Meena. Thus, in
pairwise match-ups,BlenderBot
with 2.7B parameters wins:
• 75% of the time in terms of
engagingness;
• 65% of the time in terms of
humanness.
In an A/B comparison between
human-to-human and human-to-
BlenderBot conversations, the latter
were preferred 49% of the time as
more engaging.

What are future
research areas?

Nlp research presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nlp research presentation

Similar to Nlp research presentation (20)

Recently uploaded

Recently uploaded (20)

Nlp research presentation