SlideShare a Scribd company logo
1 of 141
Download to read offline
NLP Research Papers -- Surya SG
Today's Agenda
• Trends of NLP Research Paper
• Real Time Example of Transformer
• Baseline and Overview of Transformers
in NLP
• Quick Code Tour at the Transformers
library features.
• Summary of the models
• Summary of selective Research Papers
Trends at
ACL 2020
Shifting away from huge labeled datasets
• Unsupervised:
• Yadav et al. propose a retrieval-based QA approach that iteratively refines the query to a KB to retrieve evidence for answering a
certain question. Tamborrino et al. achieve impressive results on commonsense multiple choice tasks by computing a plausibility
score for each answer candidate using a masked LM.
• Data augmentation:
• Fabbri et al. propose an approach to automatically generate (context, question, answer) triplets to train a QA model. They retrieve
contexts that are similar to those in the original dataset, generate yes/no and templated WH questions for these contexts, and train
the model on the synthetic triplets. Jacob Andreas proposes replacing rare phrases with a more frequent phrase that appears in
similar contexts in order to improve compositional generalization in neural networks. Asai and Hajishirzi augment QA training data
with synthetic examples that are logically derived from the original training data, to enforce symmetry and transitivity consistency.
• Meta learning:
• Yu et al. use meta learning to transfer knowledge for hypernymy detection from high-resource to low-resource languages.
• Active learning:
• Li et al. developed an efficient annotation framework for coreference resolution that selects the most valuable samples to annotate
through active learning.
Language models is not all you need —
retrieval is back
• Retrieval:
• Two of the invited talks at the Repl4NLP workshop mentioned retrieval-augmented LMs. Kristina Toutanova talked
about Google’s REALM, and about augmenting LMs with knowledge about entities (e.g. here, and here). Mike Lewis
talked about the nearest neighbor LM that improves the prediction of factual knowledge, and Facebook’s RAG
model that combines a generator with a retrieval component.
• Using external KBs:
• this has been commonly done for several years now. Guan et al. enhance GPT-2 with knowledge from commonsense
KBs for commonsense tasks. Wu et al. used such KBs for dialogue generation.
• Enhancing LMs with new abilities:
• Zhou et al. trained a LM to capture temporal knowledge (e.g. on the frequency and duration of events) using training
instances obtained through information extraction with patterns and SRL. Geva and Gupta inject numerical skills into
BERT by fine-tuning it on numerical data generated using templates and textual data that requires reasoning over
numbers.
Explainable
NLP
• It seems that this year looking at attention weights has
gone out of fashion and instead the focus is on generating
textual rationales, preferably ones that are faithful —
• i.e. reflect the discriminative model’s decision. Kumar and
Talukdar predict faithful explanations for NLI by generating
candidate explanations for each label, and using them to
predict the label. Jain et al. develop a faithful explanation
model that relies on post-hoc explanation methods (which
are not necessarily faithful) and heuristics to generate
training data.
• To evaluate explanation models, Hase and Bansal propose
to measure users’ ability to predict model behavior with
and without a given explanation.
Reflecting on
current
achievements,
limitations,
and thoughts
about the
future of NLP
We are solving datasets, not tasks.
There are inherent limitations in
current models and data.
We need to move away from
classification tasks.
We need to learn to handle ambiguity
and uncertainty.
Discussions
about ethics
(it’s
complicated)
• Who benefits from the system?
• Who could be harmed by it?
• Can users choose to opt out?
• Does the system enforce or
worsen systemic inequalities?
• Is it generallybettering the
world?
1. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
• Original Abstract
• We introduce a new language representationmodel called BERT, which standsfor Bidirectional Encoder
Representationsfrom Transformers.Unlike recent language representationmodels, BERT is designed to
pre-train deep bidirectional representationsby jointly conditioning on both left and right contextin all
layers. As a result, the pre-trained BERT representationscan be fine-tuned with justone additional output
layer to create state-of-the-artmodels for a wide range of tasks, such as question answering and language
inference, without substantialtask-specific architecturemodifications.
• BERT is conceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven
naturallanguage processingtasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute
improvement), MultiNLIaccuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question
answering Test F1 to 93.2(1.5% absolute improvement), outperforming human performance by 2.0%.
Summary
• A Google AI team presents a new cutting-edge model for Natural Language Processing
(NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design
allows the model to consider the context from both the left and the right sides of each
word. While being conceptually simple, BERTobtains new state-of-the-art results on
eleven NLP tasks, including question answering, named entity recognition and other tasks
related to general language understanding.
Summary
What’s the core idea of this paper?
• Training a deep bidirectional model by randomly masking a percentage of input tokens –
thus, avoiding cycles where words can indirectly “see themselves”.
• Also pre-training a sentence relationship model by building a simple binary classification
task to predict whether sentence B immediately follows sentence A, thus allowing BERTto
better understand relationships between sentences.
• Training a very big model (24 Transformer blocks, 1024-hidden, 340Mparameters) with
lots of data (3.3 billion word corpus).
What’s the key achievement?
• Advancing the state-of-the-art for 11 NLP tasks, including:
• getting a GLUE score of 80.4%, which is 7.6% of absolute improvement
from the previous best result;
• achieving 93.2% accuracy on SQuAD 1.1 and outperforming human
performance by 2%.
• Suggesting a pre-trained model, which doesn’t require any substantial
architecture modifications to be applied to specific NLP tasks.
What does the AI community think?
• BERT model marks a new era of NLP.
• In a nutshell, two unsupervised tasks together (“fill in the blank” and “does
sentence B comes after sentence A?” ) provide great results for many NLP tasks.
• Pre-training of language models becomes a new standard.
• What are future research areas?
• Testing the method on a wider range of tasks.
• Investigating the linguistic phenomena that may or may not be captured by
BERT.
What are possible business applications?
• BERT may assist businesses with a wide range of NLP problems, including:
• chatbots for better customerexperience;
• analysis of customer reviews;
• the search for relevant information, etc.
2. XLNet: Generalized Autoregressive
Pretraining for Language Understanding
• Original Abstract
• With the capability of modeling bidirectional contexts,denoising autoencodingbased pretraininglike
BERT achieves better performance than pretraining approachesbased on autoregressive language
modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the
masked positions and suffers from a pretrain-finetune discrepancy.
• In light of these pros and cons, we propose XLNet, a generalized autoregressive pretrainingmethod that
(1) enables learning bidirectional contextsby maximizing the expected likelihood over all permutations of
the factorizationorder and
• (2) overcomes the limitations of BERT thanks to its autoregressive formulation.Furthermore, XLNet
integratesideas from Transformer-XL,the state-of-the-artautoregressive model, into pretraining.
• Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art
results on 18 tasks including question answering, naturallanguage inference, sentiment analysis, and
document ranking.
Summary
• The researchers from Carnegie Mellon University and Google have developed a new
model, XLNet, for natural language processing (NLP) tasks such as reading comprehension,
text classification, sentiment analysis, and others.
• XLNet is a generalized autoregressive pretraining method that leverages the best of both
autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT)
while avoiding their limitations. The experiments demonstrate that the new model
outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on
18 NLP tasks.
Summary
What’s the core idea of this paper?
• XLNet combines the bidirectional capability of BERT with the autoregressive technologyof
Transformer-XL:
• Like BERT, XLNet uses a bidirectional context, which means it looks at the words before
and after a given token to predict what it should be. To this end, XLNet maximizes the
expected log-likelihood of a sequence with respect to all possible permutations of the
factorization order.
• As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus
avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the
assumptionthat unmasked tokens are independent of each other.
• To further improve architectural designs for pretraining, XLNet integrates the segment
recurrence mechanism and relative encoding scheme of Transformer-XL.
What’s the key achievement?
• XLnet outperforms BERT on 20 tasks,often by a large margin.
• The new model achieves state-of-the-artperformance on 18 NLP tasks
including question answering, natural language inference, sentiment
analysis, and document ranking.
• What are future research areas?
• Extending XLNet to new areas, such as computer vision and
reinforcement learning.
What does the AI community think?
• The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in
artificial intelligence.
• “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a
new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian
Ruder, a research scientist at Deepmind.
• “XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the
latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.
What are possible
business applications?
XLNetmayassistbusinesses
witha wide range of NLP
problems,including:
chatbotsfor first-line
customersupportor
answeringproductinquiries;
sentimentanalysisfor
gaugingbrandawarenessand
perceptionbasedon
customerreviewsandsocial
media;
the search forrelevant
informationindocument
basesor online,etc.
3. RoBERTa: A Robustly Optimized BERT
Pretraining Approach
• Original Abstract
• Language model pretraining has led to significant performance gains but careful comparison
between different approaches is challenging. Training is computationally expensive, often
done on private datasets of different sizes, and, as we will show, hyperparameter choices have
significant impact on the final results.
• We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures
the impact of many key hyperparameters and training data size. We find that BERT was
significantly undertrained, and can match or exceed the performance of every model
published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
• These results highlight the importance of previously overlooked design choices, and raise
questions about the source of recently reported improvements. We release our models and
code.
Summary
• Natural languageprocessing models have made significant advances thanks to the
introduction of pretraining methods, but the computational expense of training has made
replication and fine-tuning parameters difficult.
• In this study, Facebook AI and the University of Washingtonresearchers analyzed the
training of Google’s Bidirectional Encoder Representations from Transformers (BERT)
model and identified several changes to the training procedure that enhance its
performance.
• Specifically, the researchers used a new, larger dataset for training, trained the model over
far more iterations, and removed the next sequence prediction training objective. The
resulting optimized model, RoBERTa(Robustly Optimized BERT Approach), matched the
scores of the recently introduced XLNet model on the GLUE benchmark.
What’s the core idea of this paper?
• The Facebook AI research team found that BERT was significantly undertrained and
suggested an improved recipe for its training, called RoBERTa:
• More data: 160GB of text instead of the 16GB dataset originally used to train BERT.
• Longer training: increasing the number of iterations from 100K to 300K and then
further to 500K.
• Larger batches: 8K instead of 256 in the original BERT base model.
• Larger byte-level BPE vocabulary with 50K subword units instead of character-level
BPE vocabulary of size 30K.
• Removing the next sequence prediction objective from the training procedure.
• Dynamically changing the masking pattern applied to the training data.
What’s the key achievement?
• RoBERTa outperforms BERT in all individual tasks on the General Language
Understanding Evaluation (GLUE) benchmark.
• The new model matches the recently introduced XLNet model on the GLUE
benchmark and sets a new state of the art in four out of nine individual
tasks.
• What are future research areas?
• Incorporating more sophisticated multi-taskfinetuning procedures.
What are possible business applications?
Big pretrained language frameworks like RoBERTa can be leveraged in
the business setting for a wide range of downstream tasks,
including dialogue systems,
question answering,
document classification, etc.
4. Emotion-Cause Pair Extraction: A New Task
to Emotion Analysis in Texts
• Original Abstract
• Emotion cause extraction (ECE), the task aimed at extracting the potentialcauses behind certain
emotionsin text, has gained much attentionin recent yearsdue to its wide applications.However, it
suffers from two shortcomings:
• 1) the emotion must be annotatedbefore cause extraction in ECE, which greatly limitsits
applicationsin real-world scenarios;
• 2) the way to first annotateemotion and then extract the cause ignores the fact that they are
mutuallyindicative.In this work, we propose a new task: emotion-cause pair extraction (ECPE),
which aims to extract the potentialpairsof emotions and corresponding causes in a document.
• We propose a 2-step approachto address this new ECPE task, which first performs individual
emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause
pairing and filtering.
• The experimentalresults on a benchmark emotion cause corpus prove the feasibilityof the ECPE
task as well as the effectiveness of our approach.
Summary
• Emotion cause extraction (ECE) is an approach used in natural language processing to
identify statements containing the causes behind vocabulary expressing emotion.
However, ECE requires emotions to first be annotated and ignores mutual relationships
between causes and emotional effects. The researchers sought to solve this problem by
simultaneously identifying pairs of emotions and causes in a task they call emotion-cause
pair extraction (ECPE).
• ECPE uses a two-step approach: the first step uses two multi-task learning networks to
identify emotion and cause clauses, while the second step pairs all causes and emotions,
and uses a trained filter to eliminate pairings that do not contain a causal relationship.
The resulting ECPE task is able to identify emotion-cause pairs at an accuracy on par with
existing ECE methods but without requiring emotion annotation.
Summary
What’s the core idea of this paper?
• The paper introduces a new emotion-cause pair extraction (ECPE) task to overcome the
limitations of the traditional ECE task, where emotion annotation is required prior to cause
extraction and mutual indicativeness of emotion and cause is not taken into account.
• The introduced approach consists of two steps:
• In the first step, the two individual tasks of emotion extraction and cause extraction are
performed via two kinds of multi-task learning networks:
• Inter-EC that uses emotion extraction to improve cause extraction;
• Inter-CE that leverages cause extraction to enhance emotion extraction.
• In the second step, the model combines all elements of the two sets into pairs by
applying a Cartesian product. Then, a logistic regression model is trained to eliminate
pairs that do not contain a causal relationship.
What’s the core idea of this paper?
What’s the key achievement?
• ECPE is able to achieve F1 scores of 0.83 for emotion extraction, 0.65 for
cause extraction, and 0.61 for emotion-causepairing.
• On the ECE benchmark dataset, ECPE performs on par with existing ECE
methods that require emotion annotation before causal clauses can be
identified.
• What are future research areas?
• Altering the ECPE approach from a two-stepto a one-step process that
directly extracts emotion-cause pairs in an end-to-end fashion.
What are possible business applications?
Sentiment analysis for
marketing campaigns.
Opinion monitoring from
social media.
5. CTRL: A Conditional Transformer Language
Model For Controllable Generation
• Original Abstract
• Large-scale language models show promising text generation capabilities, but users
cannot easily control particular aspects of the generated text. We release CTRL, a 1.6
billion-parameter conditional transformer language model, trained to condition on
control codes that govern style, content, and task-specific behavior.
• Control codes were derived from structure that naturally co-occurs with raw text,
preserving the advantages of unsupervised learning while providing more explicit control
over text generation. These codes also allow CTRL to predict which parts of the training
data are most likely given a sequence.
• This provides a potential method for analyzing large amounts of data via model-based
source attribution. We have released multiple full-sized, pretrained versions of CTRL
at https://www.github.com/salesforce/ctrl.
Summary
• Language models used for text generation are very powerful, but they are often “black
boxes”, so users do not have much control over the output.
• To address this problem, the Salesforce research team has introduced the Conditional
TransformerLanguage (CTRL) model that conditions on a set of control codes. With these
codes, the users can control domain, style, topics, dates, entities, relationships between
entities, plot points, and task-related behavior.
• Moreover, all control codes can be traced back to a specific subset of the training data,
allowing CTRL to predict the subset of the training data mostlikely leveraged for a
particular sequence.
• This relationship between CTRL and its training data provides new possibilities for
analyzing the correlations learned from each domain.
What’s the core idea of this paper?
• Text generation tools are very powerful, but they do not give users much control over the
content, style or genre of the generated text.
• The Salesforce research team has released CTRL, a 1.6 billion-parameter conditional
transformer language model, that gives users more control over the generated content:
• CTRL exposes keywords called control codes which allow users to specify a domain,
style, topics, dates, entities, relationships between entities, plot points, and task-
related behavior.
• CTRL is trained on control codes derived from the structure that naturally co-occurs
with the raw text. In particular, CTRL leverages the fact that training data is usually
associated with a URL that contains information relevant to the text it represents.
What’s the key achievement?
• Introducing and open-sourcing a language model that:
• enables more controllable text generation;
• provides new opportunities for analyzing large amounts of text via
model-based source attribution;
• can be used to detect artificially generated text.
What are future research areas?
• Introducing a greater variety of control codes to allow finer-grained control.
• Extending to other areas of NLP including abstractive summarizationand commonsense
reasoning.
• Analyzing the relationships between training data and language models.
• Exploring the possibilities to make the interface between humans and language models
more explicit and intuitive.
What are possible business applications?
Improved and tailored text generationfor
question-answering systems and other human-
computer interactionapplications.
Identifying artificially generatedtext, to detect
malicious uses such as automatically generated
essays or fake reviews.
6. ALBERT: A Lite BERT for Self-supervised
Learning of Language Representations
• Original Abstract
• Increasing model size when pretraining natural language representations often results in
improved performance on downstream tasks. However, at some point further model increases
become harder due to GPU/TPU memory limitations, longer training times, and unexpected
model degradation.
• To address these problems, we present two parameter-reduction techniques to lower memory
consumption and increase the training speed of BERT. Comprehensive empirical evidence
shows that our proposed methods lead to models that scale much better compared to the
original BERT.
• We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and
show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best
model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks
while having fewer parameters compared to BERT-large.
Summary
• The Google Research team addresses the problem of the continuouslygrowing
size of the pretrained language models, which results in memory limitations,
longer training time, and sometimes unexpectedlydegraded performance.
• Specifically, they introduce A Lite BERT (ALBERT) architecture that incorporates
two parameter-reduction techniques: factorized embedding
parameterization and cross-layer parameter sharing.
• In addition, the suggested approach includes a self-supervised loss for sentence-
order prediction to improve inter-sentencecoherence.
• The experiments demonstrate that the best version of ALBERT sets new state-of-
the-art results on GLUE, RACE, and SQuAD benchmarks while having fewer
parameters than BERT-large.
What’s the core idea of this paper?
• It is not reasonable to further improve language models by making them larger because of
memory limitations of available hardware, longer training times, and unexpected degradation of
model performance with the increased number of parameters.
• To address this problem, the researchers introduce the ALBERT architecture that incorporates
two parameter-reduction techniques:
• factorized embedding parameterization, where the size of the hidden layers is separated
from the size of vocabulary embeddings by decomposing the large vocabulary-embedding
matrix into two small matrices;
• cross-layer parameter sharing to prevent the number of parameters from growing with the
depth of the network.
• The performance of ALBERT is further improved by introducing the self-supervised loss
for sentence-order prediction to address BERT’s limitations with regard to inter-sentence
coherence.
What’s the key achievement?
• With the introduced parameter-reduction techniques, the ALBERT
configuration with 18× fewer parameters and 1.7× faster training compared
to the original BERT-large model achieves only slightly worse performance.
• The much larger ALBERT configuration, which still has fewer parameters
than BERT-large, outperforms all of the current state-of-the-artlanguage
modes by getting:
• 89.4% accuracy on the RACE benchmark;
• 89.4 score on the GLUE benchmark; and
• An F1 score of 92.2 on the SQuAD 2.0 benchmark.
What are possible business applications?
THE ALBERT LANGUAGE
MODEL CAN BE LEVERAGED IN
THE BUSINESSSETTING TO
IMPROVEPERFORMANCEON A
WIDE RANGE OF
DOWNSTREAMTASKS,
INCLUDINGCHATBOT
PERFORMANCE,
SENTIMENT ANALYSIS, DOCUMENT MINING,AND TEXT CLASSIFICATION.
7. Explain Yourself! Leveraging Language
Models for Commonsense Reasoning
• Original Abstract
• Deep learning models perform poorly on tasks that require commonsense reasoning, which
often necessitates some form of world-knowledge or reasoning over information not
immediately present in the input.
• We collect human explanations for commonsense reasoning in the form of natural language
sequences and highlighted annotations in a new dataset called Common Sense Explanations
(CoS-E). We use CoS-E to train language models to automatically generate explanations that
can be used during training and inference in a novel Commonsense Auto-Generated
Explanation (CAGE) framework.
• CAGE improves the state-of-the-art by 10% on the challenging CommonsenseQA task. We
further study commonsense reasoning in DNNs using both human and auto-generated
explanations including transfer to out-of-domain tasks. Empirical results indicate that we can
effectively leverage language models for commonsense reasoning.
Summary
• Natural language processing algorithms are limited to information contained in
texts, and often these algorithms lack commonsensereasoning that allows them
to make inferences as most humans do.
• The Salesforce research team suggestsaddressing this problem by training the
language model to automatically generate commonsenseexplanations. This task
is accomplished by providing the model with human explanations alongside the
question answering samples.
• These autogenerated explanations are then used by a neural network to solve the
CommonsenseQA(CQA) task. This two-stepapproach improved accuracy on the
CommonsenseQAmultiple-choice test by 10% compared to existing models.
Summary
What’s the core idea of this paper?
• Natural language processing struggles with inference based on common sense and
real-world knowledge.
• The paper suggests addressing this issue in two phases:
• First, the researchers train the model to generate Common Sense Explanations
(CoS-E) by providing human-generated explanations in the form of both open-
ended sentences and highlighted span annotations, alongside Commonsense
Question Answering (CQA) examples.
• In the second phase, the authors use this trained language model to generate
explanations for each sample in the training and validation sets. These
Commonsense Auto-Generated Explanations (CAGE) are then leveraged to
solve the CQA task.
What’s the key achievement?
• The explanation-generating model improves performance in a natural
language reasoning test by 10% over the previous best model and improves
understanding of how neural networks apply knowledge.
• Moreover, the experiments demonstrate that the introduced approach can
be successfullytransferred to out-of-domain datasets.
What are future research areas?
Combining the explanation-generating model into an answer
prediction model.
Combining
Extending the dataset of explanations to other tasks to create a
more general explanatory language model.
Extending
Removing bias from training datasets to eliminate bias in
generated explanations.
Removing
What are
possible
business
applications?
• The model with improved
common-sense reasoning
capabilitiescan be leveraged:
• to provide bettercustomer
service via chatbots;
• to improve the
performance of
information retrieval
systems.
8. Detecting Concealed Informationin Text
and Speech
• Original Abstract
• Motivatedby infamouscheating scandalsin variousindustries and politicalevents, we address the
problem of detecting concealed informationin technicalsettings.
• In this work, we explore acoustic-prosodicand linguisticindicatorsof information concealmentby
collecting a uniquecorpus of professionalspracticing for oral exams while concealinginformation.
• We reveal subtle signs of concealedinformation in speech and text, compare, and contrast them
with those in deceptiondetection literature,thus uncovering the link between concealing
information and deception.
• We then present a series of experiments that automatically detectconcealed informationfrom text
and speech. We compare the use of acoustic-prosodic,linguistic,and individual featuresets, using
different machine learning models. Finally,we present a multi-task learning framework with
acoustic, linguistic,and individualfeatures, that outperforms human performance by over 15%.
Summary
• When confidential information is leaked, it is often difficult to tell who originally
obtained the leaked information and who it has been leaked to. Even though previous
work has demonstrated that changes in voice tone, lexicon, and speech patterns can
identify when someone is concealing information, research in this subject area is very
scarce. It is partly due to the lack of datasets that include ground truth labels indicating
information concealment.
• To address this issue, the present study introduces a new dataset collected from a
unique audio corpus of professional wine tasters practicing for oral exams while
concealing information. By leveraging this dataset, the researcher was able to develop a
new multi-task learning model for detecting concealed information that performs 11%
better than baseline models and 15% better than humans.
Summary
What’s the core idea of this paper?
While there are machine learning-based methods for detecting when someone does not
have information but pretends to, there are few comparable models for detecting when
someone is concealing leaked information.
In this study, Hu from Cornell University captured linguistic and acoustic-prosodic features
from a controlled human experiment to create a dataset of speech patterns when people
were speaking honestly and when they were concealing some information.
The author leverages this dataset to develop a multi-task learning framework where, as
well as identifying concealed information, the system is also predicting whether the
speaker’s answer is correct and the identity of the wine.
What’s the key achievement?
A multi-task learning model outperformed baseline models by 11%
and humans by 15% at detecting when someone is concealing
information.
Moreover, the introduced framework outperforms humans even in
the case where some of the humans in the experiment knew one
another and could read social cues (e.g. gestures) that are not
available to the model.
What are future research areas?
STUDYING INDIVIDUAL
DIFFERENCESIN BOTH DETECTING
CONCEALEDINFORMATION AND
CONCEALINGINFORMATION.
EXPLORINGTHE PREDICTIVEPOWER
OF PHONOTACTICVARIATION
FEATURES.
CONDUCTINGDOMAIN
ADAPTATION WITHREGARDS TO
DETECTING CONCEALED
INFORMATION.
IMPROVINGTHE SCALABILITYOF
THE MULTI-TASK LEARNINGMODEL.
What are possible business
applications?
Detecting insider trading in financial markets.
Detecting
Controlling data leaks within different testing procedures.
Controlling
Tracing and limiting the extent of information leaks around
political campaigns.
Tracing and
limiting
9. Improving Visual Question Answering by
Referring to Generated Paragraph Captions
• Original Abstract
• Paragraph-style image captions describe diverse aspects ofan image as opposed to the more common single-sentence
captions that onlyprovide an abstract description ofthe image. These paragraph captions can hence contain substantial
information ofthe image for tasks such as visual questionanswering.
• Moreover,this textual informationis complementarywith visual information presentin the image because it can
discuss both more abstract concepts and more explicit,intermediate symbolicinformation about objects,events,and
scenes that can directlybe matched with the textual question and copied intothe textualanswer (i.e.,via easier
modalitymatch).
• Hence, we propose a combined Visual and Textual QuestionAnswering(VTQA) model which takes as input a paragraph
caption as well as the correspondingimage, and answers the given question based on both inputs.In our model,the
inputs are fused to extract related information bycross-attention (earlyfusion),then fused again in the form of
consensus (late fusion),and finallyexpected answers are given an extra score to enhance the chance of selection (later
fusion).
• Empirical results showthat paragraphcaptions,even when automaticallygenerated (via an RL-based encoder-decoder
model),help correctly answer more visual questions.Overall,our joint model,when trained on the Visual Genome
dataset,significantlyimprovesthe VQA performance over a strongbaseline model.
Summary
• Computer models struggle with answering questions about visual images, a
task known as visual question answering (VQA).
• In this study, the researchers sought to improve VQA performance by
providing a VQA model with a text description of an image’s content
produced by a paragraph captioning model.
• The two models were fused over three stages to generate a consensus
answer to questions posed about the image.
• The resulting visual and textual question answering (VQTA) model was
1.92% more accurate than the standalone VQA model.
Summary
What’s the core idea of this paper?
• VQA models struggle with identifying all of the necessary informationin images, and particularly abstract
concepts,required to answer questions.
• The researchers suggest using a pre-trained paragraphcaptioningmodel to provide additional information
to the VQA model.
• The text and image input are fused at three levels:
• in the early fuse stage, visual features are fused with paragraphcaptionand object property features
by cross-attention;
• in the late fuse stage,the inputs are fused again in the form of consensus,i.e. logits from each
module are integratedinto one vector;
• in the later fuse stage,the model accountsfor the fact that some regions of the image are more
likely to draw people’s attention,and thus questions and answers are more likely to be related to
those regions. So, the model gives an extra score to the answers related to the salient regions.
What’s the key achievement?
• Improving visual question answering performance by 1.92% compared to the baseline
VQA model.
What are future research areas?
ImprovingVTQA models to extract more information
from textual captions,and enhancingparagraph
captioningmodels to generate better captions.
Trainingthe VTQA model jointlywith the paragraph
captioningmodel.
What are possible business applications?
• Improving image search and retrieval.
• Imageannotation and interactivity for blind people.
• Creating “interactive” images for online education.
10. Thieves on Sesame Street! Model
Extraction of BERT-based APIs
• Original Abstract
• We study the problem of model extraction in natural language processing, in which an adversary with only query
access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary
and victim model fine-tune a large pretrained language model such as BERT (Devlin et al., 2019), we show that
the adversary does not need any real training data to successfully mount the attack.
• In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random
sequences of words coupled with task-specific heuristics form effective queries for model extraction on a
diverse set of NLP tasks, including natural language inference and question answering.
• Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the
NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only
slightly worse than the victim model.
• Finally, we study two defense strategies against model extraction—membership classification and API
watermarking—which while successful against naive adversaries, are ineffective against more sophisticated
ones.
Summary
• This paper highlights an exploit only made feasible by the shift towards transfer learning
methods within the NLP community: for a query budget of a few hundred dollars, an
attacker can extract a model that performs only slightly worse than the victim model on
SST2, SQuAD, MNLI, and BoolQ. On the SST2 task, the victim model had a 93.1%accuracy
compared to their extracted model’s 90.1%.
• They show that an adversary does not need any real training data to mount the attack
successfully. The attacker does not even need to use grammatical or semantically
meaningful queries. They used random sequences of words coupled with task-specific
heuristics to form useful queries for model extraction on a diverse set of NLP tasks.
Summary
• Why It Matters: Outputs of modern NLP APIs on nonsensical text
provide strong signals about model internals, allowing adversaries to
train their own models and avoid paying for the API.
What’s the core idea of this paper?
• DEFENSES
• MEMBERSHIPCLASSIFICATION
• Our first defense uses membership inference, which is traditionally used to determinewhether a
classifier was trained on a particular input point.
• In our setting we use membership inference for “outlier detection”,where nonsensicaland
ungrammaticalinputs (which are unlikely to be issued by a legitimate user) are identified
• When such out-of-distributioninputs are detected, the API issues a random outputinstead of the
model’s predicted output, which eliminates the extractionsignal.
• WATERMARKING
• in which a tiny fractionof queries are chosen at random and modified to return a wrong output.
• These “watermarked queries” and their outputs are stored on the API side. Since deep neural
networks have the ability to memorize arbitrary information,this defense anticipatesthat
extractedmodels will memorize some of the watermarked queries, leaving them vulnerable to
post-hoc detection if they are deployed publicly
What’s the key achievement?
• Our results show that fine-tuning large pretrained language models
simplifies the process of extraction for an attacker.
• Unfortunately, existing defenses against extraction, while effective in some
scenarios, are generally inadequate, and further research is necessary to
develop defenses robust in the face of adaptive adversaries who develop
counter-attacksanticipating simple defenses.
What are future research areas?
• Other interesting future directions that follow from the results in this paper
include
• (1) leveraging nonsensical inputs to improve model distillation on tasks for
which it is difficult to procure input data;
• (2) diagnosing dataset complexity by using query efficiency as a proxy; and
• (3) further investigation of the agreement between victim models as a
method to identify proximity in input distribution and its incorporation into
an active learning setup for model extraction.
What are possible business applications?
• Avoids Paid APIs from possible thefts.
• Decision Analysis on API Cost Models for NLU and NLG.
11. WinoGrande: An Adversarial Winograd
Schema Challenge at Scale
• Original Abstract
• The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of
273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional
preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on
variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or
whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.
• To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but
adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully
designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-
detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve
59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed.
• Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef
(85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande
when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true
capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in
existing and future benchmarks to mitigate such overestimation.
Summary
The research group from the Allen Institute
for Artificial Intelligence
introduces WinoGrande, a new benchmark
for commonsense reasoning. They buildon
the design of the famous Winograd Schema
Challenge(WSC) benchmark but
significantlyincrease the scale of the
dataset to 44K problemsand reduce
systematic bias using a
novel AfLite algorithm.
The experimentsdemonstrate that state-of-
the-art methods achieve up to 79.1%
accuracy on WinoGrande,which is
significantlybelow the human performance
of 94%. Furthermore, the researchers show
that WinoGrande is an effective resource
for transfer learning, by using a RoBERTa
model fine-tuned with WinoGrandeto
achieve new state-of-the-art results on
WSC and four other relatedbenchmarks.
Summary
What’s the core idea of this paper?
• The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and
annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense
reasoning.
• They introduce WinoGrande, a new large-scale dataset for commonsense reasoning. Their approach has two key
features:
• A carefully designed crowdsourcing procedure:
• Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain
certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers.
• Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected
questions, 53K were deemed valid.
• A novel algorithm AfLite for systematic bias reduction:
• It generalizes human-detectable biases based on word occurrences to machine-detectable biases
based on embedding occurrences.
• After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples.
What’s the key achievement?
• WinoGrande is easy for humans and challenging for machines:
• Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%);
• RoBERTa achieves 79.1%test-set accuracy;
• whereas human performance achieves 94% accuracy.
• WinoGrande is also an effective resource for transfer learning. The RoBERTa-based model fine-tuned on
WinoGrande achieved a new state of the art on WSC and four other related datasets:
• 90.1%on WSC;
• 93.1%on DPR;
• 90.6%on COPA;
• 85.6%on KnowRef; and
• 97.1%on Winogender.
What are future research areas?
• Exploring new algorithmic approaches for systematicbias reduction.
• Debiasing other NLP benchmarks.
12. Exploring the Limits of Transfer Learning
with a Unified Text-to-Text Transformer
• Original Abstract
• Transfer learning, where a model is first pre-trained on a data-richtask before being fine-tuned on a
downstream task, has emerged as a powerful technique in naturallanguage processing (NLP).The
effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
• In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified
framework that converts every language problem into a text-to-textformat.Our systematicstudy
compares pre-training objectives, architectures,unlabeled datasets, transfer approaches,and other
factorson dozens of language understandingtasks.
• By combining the insights from our explorationwith scale and our new “Colossal Clean Crawled Corpus”,
we achieve state-of-the-artresults on many benchmarks covering summarization, question answering,
text classification,and more. To facilitate future work on transfer learning for NLP, we release our dataset,
pre-trained models, and code.
Summary
• The Google research team suggests a unified approach to transfer learning in NLP with the
goal to set a new stateof the art in the field. To this end, they propose treating each NLP
problem as a “text-to-text” problem.
• Such a framework allows using the same model, objective, training procedure, and
decoding process for different tasks, including summarization, sentimentanalysis, question
answering, and machine translation. The researchers call their model a Text-to-Text
Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-
of-the-art results on a number of NLP tasks.
Summary
What’s the core idea of this paper?
• The paper has several important contributions:
• Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing
techniques.
• Introducing a new approach to transfer learning in NLP by suggesting to treat every NLP problem as a text-to-
text task:
• The mode understands which tasks should be performed thanks to the task-specific prefix added to the
original input sentence (e.g., “translate English to German:”, “summarize:”).
• Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text,
the Colossal Clean Crawled Corpus (C4).
• Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.
What’s the key achievement?
• The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out
of 24 tasks considered, including:
• the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI
tasks;
• the Exact Match score of 90.06 on SQuAD dataset;
• the SuperGLUE score of 88.9, which is a very significant improvement over the previous
state-of-the-art result (84.6)and very close to human performance (89.8);
• the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarizationtask.
What are future research areas?
• Researching the methods to achieve stronger performance with cheaper models.
• Exploring more efficient knowledge extraction techniques.
• Further investigating the language-agnosticmodels.
What are
possible
business
applications?
• Even though the introduced model has billions
of parameters and can be too heavy to be
applied in the business setting,
• the presented ideas can be used to improve
the performance on different NLP tasks,
including summarization, question answering,
and sentiment analysis.
13. Reformer: The Efficient Transformer
• Original Abstract
• Large Transformer models routinely achieve state-of-the-art results on a number of tasks
but training these models can be prohibitively costly, especially on long sequences. We
introduce two techniques to improve the efficiency of Transformers.
• For one, we replace dot-product attention by one that uses locality-sensitive hashing,
changing its complexity from O(L^2) to O(L log L), where L is the length of the sequence.
• Furthermore, we use reversible residual layers instead of the standard residuals, which
allows storing activations only once in the training process instead of N times, where N is
the number of layers. The resulting model, the Reformer, performs on par with
Transformer models while being much more memory-efficient and much faster on long
sequences.
Summary
• The leading Transformer models have become so big that they can be realistically trained only in
large research laboratories. To address this problem, the Google Research team introduces
several techniques that improve the efficiency of Transformers. In particular, they suggest
• (1) using reversible layers to allow storing the activations only once instead of for each layer,
and
• (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-
product attention. Experiments on several text tasks demonstrate that the
introduced Reformer model matches the performance of the full Transformer but runs much
faster and with much better memory efficiency.
Summary
Locality-Sensitive Hashing Attention showing the hash-bucketing,
sorting, and chunking steps, and the resulting causal attentions,
together with the corresponding attention matrices (a–d)
What’s the
core idea
of this
paper?
The leading Transformer models require huge
computational resources because of the very high number
of parameters and several other factors:
• The activations of every layer need to be stored for back-propagation.
• The intermediate feed-forward layers accountfor a large fractionof
memory use since their depth is often much larger than the depth of
attentionactivations.
• The complexity of attentionon a sequence of length L is O(L^2).
To address these problems, the research team introduces
the Reformer model with the following improvements:
• using reversiblelayersto store only a single copy of activations;
• splittingactivations inside the feed-forward layers and processing them
in chunks;
• approximatingattentioncomputationbased on locality-sensitive
hashing.
What’s the key achievement?
• By analyzing the introduced techniques one by one, the authors show that model accuracy
is not sacrificed by:
• switching to locality-sensitive hashing attention;
• using reversible layers.
• Reformer performs on par with the full Transformer model while demonstrating much
higher speed and memory efficiency:
• For example, on the newstest2014 taskfor machine translation from English to
German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et
al. (2017)BLEU score of 27.3.
What are
possible
business
applications?
• The suggestedefficiency improvements
enable more widespread Transformer
application, especially for the tasks that
depend on large-contextdata, such as:
• text generation;
• visual content generation;
• music generation;
• time-series forecasting.
14. Longformer: The Long-Document
Transformer
• Original Abstract
• Transformer-based models are unable to process long sequences due to their self-attention
operation, which scales quadratically with the sequence length. To address this limitation, we
introduce the Longformer with an attention mechanism that scales linearly with sequence
length, making it easy to process documents of thousands of tokens or longer.
• Longformer’s attention mechanism is a drop-in replacement for the standard self-attention
and combines a local windowed attention with a task motivated global attention. Following
prior work on long-sequence transformers, we evaluate Longformer on character-level
language modeling and achieve state-of-the-art results on text8 and enwik8.
• In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of
downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long
document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.
Summary
• Self-attention is one of the key factors behind the success of Transformer architecture. However,
it also makes transformer-based models hard to apply to long documents. The existing
techniques usually divide the long input into a number of chunks and then use complex
architectures to combine information across these chunks.
• The research team from the Allen Institute for Artificial Intelligence introduces a more elegant
solution to this problem. The suggested Longformer model employs an attention pattern that
combines local windowed attention with task-motivated global attention.
• This attention mechanism scales linearly with the sequence length and enables processing of
documents with thousands of tokens. The experiments demonstrate that Longformer achieves
state-of-the-art results on character-level language modeling tasks, and when pre-trained,
consistently outperforms RoBERTa on long-document tasks.
Summary
Full self-attention pattern vs. Longformer’s configuration of attention patterns
What’s the core idea of this paper?
• The computational requirements of self-attention grow quadratically with sequence length, making it hard to
process on current hardware.
• To address this issue, the researchers present Longformer, a modified version of Transformer architecture that:
• allows memory usage to scale linearly, and not quadratically, with the sequence length;
• includes an attention mechanism that combines:
• a windowed local-context self-attention to build contextual representations;
• an end task motivated global attention to encode inductive bias about the task and build full sequence
representation.
• Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication
that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce
a custom CUDA kernel for implementing these attention operations.
What’s the key achievement?
• The Longformer model achieves a new state of the art on character-level language modeling tasks:
• BPC of 1.10 on text8; (Bits Per Character)
• BPC of 1.00 on enwik8.
• After pre-training and fine-tuning for six tasks, including classification, question answering, and
coreference resolution, the Longformer-base consistently outperformers the RoBERTa-base with:
• accuracy of 75.0 vs. 72.4 on WikiHop;
• F1 score of 75.2 vs. 74.2 on TriviaQA;
• joint F1 score of 64.4 vs. 63.5 on HotpotQA;
• average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
• accuracy of 95.7 vs. 95.3 on the IMDB classification task;
• F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
• The performance gains are especially remarkable for the tasks that require a long context (i.e.,
WikiHop and Hyperpartisan).
What are future research areas?
Exploring other attentionpatternsthat are more
efficient due to dynamic adaptationto the input.
Applying Longformer to other relevant long
document tasks such as summarization.
What are
possible
business
applications?
• The Longformer architecture can be very
advantageous for the downstream NLP tasks that
often require processing of long documents:
• document classification;
• question answering;
• coreference resolution;
• summarization;
• semantic search.
15. ELECTRA: Pre-training Text Encoders as
Discriminators Rather Than Generators
• Original Abstract
• Masked language modeling(MLM) pre-trainingmethods such as BERT corrupt the input byreplacingsome tokens with
[MASK] and then train a model to reconstruct the original tokens.While they produce good results when transferred to
downstreamNLP tasks,theygenerallyrequire large amounts ofcompute to be effective.
• As an alternative,we propose a more sample-efficient pre-trainingtaskcalled replaced token detection.Instead of
maskingthe input,ourapproach corrupts it byreplacingsome tokens with plausible alternatives sampledfroma small
generator network.
• Then,instead oftraininga model that predicts the original identities ofthe corrupted tokens,we train a discriminative
model that predicts whether each token in the corrupted input was replaced bya generator sample or not.Thorough
experiments demonstrate this newpre-trainingtaskis more efficient than MLM because the task is defined overall
input tokens rather than just the small subset that was masked out.
• As a result,the contextual representations learned byourapproach substantiallyoutperform the ones learned byBERT
given the same model size, data,and compute.The gains are particularlystrongfor small models;for example, we train
a model on one GPU for 4 days that outperforms GPT (trained using30× more compute)on the GLUE naturallanguage
understandingbenchmark.Our approach also works well at scale, where it performs comparablyto RoBERTa and XLNet
while usingless than 1/4 of theircompute and outperforms them when usingthe same amount of compute.
Summary
• The pre-training task for popular language models like BERT and XLNet involves masking a
small subset of unlabeled input and then training the network to recover this original
input. Even though it works quite well, this approachis not particularly data-efficient as it
learns from only a small fractionof tokens (typically ~15%).
• As an alternative, the researchers from StanfordUniversity and Google Brain propose a
new pre-training task called replacedtoken detection.Insteadof masking, they suggest
replacing some tokens with plausible alternatives generated by a small language model.
Then, the pre-trained discriminatoris used to predict whether each token is an original or
a replacement.
• As a result, the model learns from all input tokens instead of the small masked fraction,
making it much more computationally efficient. The experiments confirm that the
introduced approachleads to significantly faster trainingand higher accuracyon
downstream NLP tasks.
Summary
What’s the core idea of this paper?
• Pre-training methods that are based on masked language modeling are computationally
inefficient as they use only a small fraction of tokens for learning.
• Researchers propose a new pre-training task called replaced token detection, where:
• some tokens are replaced by samples from a small generator network;
• a model is pre-trained as a discriminator to distinguish between original and replaced
tokens.
• The introduced approach, called ELECTRA (Efficiently Learning an Encoder
that Classifies Token Replacements Accurately):
• enables the model to learn from all input tokens instead of the small masked-out
subset;
• is not adversarial, despite the similarity to GAN, as the generator producing tokens
for replacement is trained with maximum likelihood.
What’s the key achievement?
• Demonstrating that the discriminative task of distinguishing between real data and
challenging negative samples is more efficient than existing generative methods for
language representation learning.
• Introducing a model that substantially outperforms state-of-the-art approaches while
requiring less pre-training compute:
• ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT
model with a score of 75.1 and a much larger GPT model with a score of 78.8.
• An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25%
of their pre-training compute.
• ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and
SQuAD benchmarks while still requiring less pre-training compute.
What are
possible
business
applications?
Because of its computational
efficiency,
the ELECTRA approach can make
the application of pre-trained
text encoders more accessible to
business practitioners.
16. Language
Models are
Few-Shot
Learners
Summary
• The OpenAI research team draws attention to the fact that the need for a labeled dataset
for every new language task limits the applicability of language models.
• Considering that there is a wide range of possible tasks and it’s often difficult to collect a
large labeled training dataset, the researchers suggest an alternative solution, which is
scaling up language models to improve task-agnostic few-shot performance.
• They test their solution by training a 175B-parameter autoregressive language model,
called GPT-3, and evaluating its performance on over two dozen NLP tasks. The
evaluation under few-shot learning, one-shot learning, and zero-shot learning
demonstrates that GPT-3 achieves promising results and even occasionally outperforms
the state of the art achieved by fine-tuned models.
Summary
What’s the
core idea of
this paper?
Sparse Transformer
What’s the
key
achievement?
What does
the AI
community
think?
Sam Altman, CEO and co-
founder of OpenAI
Abubakar
Abid, CEO and founder of Gradio
Gary
Marcus,CEO and founder of Robust.ai
Geoffrey Hinton, Turing
Award winner
What are
future
research
areas?
Improving
pre-training
sample
efficiency.
What are possible business applications?
17. Beyond Accuracy: Behavioral Testing of
NLP models with CheckList
• Original Abstract
• Although measuring held-out accuracyhas been the primary approachto evaluate generalization,it often
overestimates the performance of NLP models, while alternative approachesfor evaluating models either
focus on individual tasks or on specific behaviors.
• Inspired by principles of behavioral testing in software engineering, we introduceCheckList, a task-
agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities
and test types that facilitatecomprehensive test ideation, as well as a software tool to generate a large
and diverse number of test cases quickly.
• We illustratethe utility of CheckList with tests for three tasks, identifying critical failures in both
commercial and state-of-artmodels. In a user study, a team responsible for a commercial sentiment
analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP
practitionerswith CheckList created twice as many tests, and found almost three times as many bugs as
users without it.
Summary
• The authors point out the shortcomings of existing approaches to evaluating performance
of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate
where the model is failing and how to fix it. The alternative evaluation approaches usually
focus on individual tasks or specific capabilities.
• To address the lack of comprehensive evaluation approaches, the researchers
introduce CheckList, a new evaluation methodology for testing of NLP models. The
approach is inspired by principles of behavioral testing in software engineering.
• Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test
ideation. Multiple user studies demonstrate that CheckList is very effective at discovering
actionable bugs, even in extensively tested NLP models.
Summary
What’s the
core idea
of this
paper?
Existing approaches to evaluation of NLP models have many significant
shortcomings:
• The primary approach to the evaluation of models’ generalization capabilities, which is accuracy
on held-out data, may lead to performance overestimation, as the held-out data often contains
the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much
in figuring out where the NLP model is failing and how to fix these bugs.
• The alternative approaches are usually designed for evaluation of specific behaviors on
individual tasks and thus, lack comprehensiveness.
To address this problem, the research team introduces CheckList,a new
methodology for evaluating NLP models, inspired by the behavioral testing
in software engineering:
• CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named
entity recognition, and negation.
• Then, to break down potential capability failures into specific behaviors, CheckList suggests
different test types, such as prediction invariance or directional expectation tests in case of
certain perturbations.
• Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
The suggested implementation of CheckList also introducesa variety of
abstractionsto help users generate large numbers of test cases easily.
What’s the
key
achievement?
What does
the AI
community
think?
• The paper received the Best
Paper Award at ACL 2020, the
leading conference in natural
language processing.
What are
possible
business
applications?
• CheckList can be used to create more
exhaustivetesting for a variety of NLP
tasks.
• Such comprehensive testing that
helps in identifying many actionable
bugs is likely to lead to more robust
NLP systems.
18. Tangled up in BLEU: Reevaluating the Evaluation of
Automatic Machine Translation Evaluation Metrics
• Original Abstract
• Automatic metrics are fundamentalfor the developmentand evaluationof machine translation
systems. Judging whether, and to what extent, automatic metrics concur with the gold standardof
human evaluation isnot a straightforward problem.
• We show that current methods for judging metrics are highly sensitive to the translationsused for
assessment, particularlythe presence of outliers, which often leadsto falsely confident conclusions
about a metric’s efficacy.
• Finally,we turn to pairwise system ranking, developinga method for thresholding performance
improvement under an automaticmetric againsthuman judgements, which allowsquantification of
type I versus type II errors incurred, i.e., insignificanthumandifferences in system qualitythat are
accepted, and significanthuman differences that are rejected.
• Together, these findings suggest improvementsto the protocolsfor metric evaluationandsystem
performance evaluationin machine translation.
Summary
• The most recent Conference on Machine Translation (WMT) has revealed that,
based on Pearson’s correlation coefficient, automatic metrics poorly match human
evaluations of translation quality when comparing only a few best systems. Even
negative correlations were exhibited in some instances.
• The research team from the University of Melbourne investigates this issue by
studying the role of outlier systems, exploring how the correlation coefficient
reflects different patterns of errors (type I vs. type II errors), and what magnitude of
difference in the metric score corresponds to true improvements in translation
quality as judged by humans.
• Their findings suggest that small BLEU differences (i.e., 1–2 points) have little
meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over
BLEU. However, only human evaluations can be a reliable basis for drawing
important empirical conclusions.
Summary
What’s the core idea of this paper?
• Automaticmetrics are used as a proxyforhuman translation evaluation,which is considerablymore expensiveand time-
consuming.
• However, evaluatinghowwell different automaticmetrics concur with human evaluationis not a straightforwardproblem:
• For example, the recent findings show that if the correlation between leadingmetrics and human evaluations is computed
usinga large set of translationsystems,it is typicallyvery high (i.e., 0.9). However, if onlya few best systems are considered,
the correlation reduces markedlyand can even be negativein some cases.
• The authors ofthis paper take a closer lookat this problem and discoverthat:
• The identified problem with Pearson’s correlationis due to the small sample size and not specific to comparingstrongMT
systems.
• Outlier systems,whose qualityis much higher or lower than the rest of the systems,havea disproportionate effect on the
computed correlationand shouldbe removed.
• The same correlation coefficient can reflect different patterns oferrors.Thus,a better approach for gaininginsights into
metric reliabilityis to visualize metricscores against human scores.
• Small BLEU differences of 1-2 points correspondto true improvements in translationquality(as judged by humans)onlyin
50% of cases.
What’s the key achievement?
• Conducting a thorough analysis of automatic metrics performance metrics vs.
human judgments in machine translation, and providing key recommendations
on evaluating MT systems:
• Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU
and TER.
• Moving away from using small changes in evaluation metrics as the sole basis to
draw important empirical conclusions, and always ensuring support from human
evaluations before claiming that one MT system significantly outperforms
another one.
19. Towards
a Human-
like Open-
Domain
Chatbot
Summary
Example of Meena generating a response, “The Next Generation” (Google AI Blog)
Summary
What’s the
core idea of
this paper?
Evolved
Transformer
What’s the key
achievement?
What does
the AI
community
think?
Elliot Turner, CEO and founder
of Hyperia
Graham Neubig, Associate
professor at Carnegie Mellon University
What are future research areas?
What are future research
areas?
What are possible business applications?
The authors suggest some
interesting applications
for open-domain chatbots
such as Meena:
further humanizing
computer interactions;
improving foreign
languagepractice;
making interactive movie
and videogame characters
relatable.
20. Recipes for Building an Open-Domain
Chatbot
• Original Abstract
• Building open-domain chatbotsis a challenging area for machine learning research. While prior work has
shown that scaling neural models in the number of parameters and the size of the data they are trained
on gives improved results, we show that other ingredients are important for a high-performing chatbot.
• Good conversationrequires a number of skills that an expert conversationalistblends in a seamless way:
providing engaging talking points and listening to their partners, and displaying knowledge, empathy and
personality appropriately, while maintaining a consistentpersona.
• We show that large scale models can learn these skills when given appropriate trainingdata and choice of
generation strategy.We build variants of these recipes with 90M,2.7Band 9.4Bparameter models, and
make our models and code publicly available.
• Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in
terms of engagingnessand humanness measurements. We then discuss the limitations of this work by
analyzing failure cases of our models.
Summary
• The Facebook AI Research team shows that with appropriate training data and
generation strategy, large-scale models can learn many important
conversational skills, such as engagingness, knowledge, empathy, and persona
consistency.Thus, to build their state-of-the-art conversational agent,
called BlenderBot, they leveraged a model with 9.4B parameters, trained it on
a novel task called Blended Skill Talk, and deployed beam search with carefully
selected hyperparameters as a generation strategy.
• Human evaluations demonstrate that BlenderBot outperforms Meena in
pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in
terms of humanness.
Summary
What’s the core idea of this paper?
• The introduced recipe for building a state-of-the-artopen-domain chatbotincludes three key ingredients:
• Largescale. The largest model has 9.4 billion parametersand was trained on 1.5 billion training examples of
extractedconversations.
• Blendedskills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of
personality, engaginguse of knowledge, and display of empathy.
• Beam search used for decoding. The researchers show that this generation strategy,deployed with
carefully selected hyperparameters,gives strongresults. In particular,it was demonstratedthat the lengths
of the agent’sutterancesis very important for chatbot performance (i.e, too short responses are often
considered dull and too long responses make the chatbot appear to waffle and not listen).
What’s the key achievement?
The introduced chatbot outperforms
the previous best-performing open-
domain chatbot Meena. Thus, in
pairwise match-ups,BlenderBot
with 2.7B parameters wins:
• 75% of the time in terms of
engagingness;
• 65% of the time in terms of
humanness.
In an A/B comparison between
human-to-human and human-to-
BlenderBot conversations, the latter
were preferred 49% of the time as
more engaging.
What are future
research areas?
Thanks for listening

More Related Content

What's hot

Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptxTuCaoMinh2
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithmszamakhan
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 

What's hot (20)

BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
BERT
BERTBERT
BERT
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
NLP
NLPNLP
NLP
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Similar to Nlp research presentation

IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyNUPUR YADAV
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdfAnime196637
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONijaia
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
 

Similar to Nlp research presentation (20)

LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
short_story.pptx
short_story.pptxshort_story.pptx
short_story.pptx
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdf
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Nlp research presentation

  • 1. NLP Research Papers -- Surya SG
  • 2. Today's Agenda • Trends of NLP Research Paper • Real Time Example of Transformer • Baseline and Overview of Transformers in NLP • Quick Code Tour at the Transformers library features. • Summary of the models • Summary of selective Research Papers
  • 4. Shifting away from huge labeled datasets • Unsupervised: • Yadav et al. propose a retrieval-based QA approach that iteratively refines the query to a KB to retrieve evidence for answering a certain question. Tamborrino et al. achieve impressive results on commonsense multiple choice tasks by computing a plausibility score for each answer candidate using a masked LM. • Data augmentation: • Fabbri et al. propose an approach to automatically generate (context, question, answer) triplets to train a QA model. They retrieve contexts that are similar to those in the original dataset, generate yes/no and templated WH questions for these contexts, and train the model on the synthetic triplets. Jacob Andreas proposes replacing rare phrases with a more frequent phrase that appears in similar contexts in order to improve compositional generalization in neural networks. Asai and Hajishirzi augment QA training data with synthetic examples that are logically derived from the original training data, to enforce symmetry and transitivity consistency. • Meta learning: • Yu et al. use meta learning to transfer knowledge for hypernymy detection from high-resource to low-resource languages. • Active learning: • Li et al. developed an efficient annotation framework for coreference resolution that selects the most valuable samples to annotate through active learning.
  • 5. Language models is not all you need — retrieval is back • Retrieval: • Two of the invited talks at the Repl4NLP workshop mentioned retrieval-augmented LMs. Kristina Toutanova talked about Google’s REALM, and about augmenting LMs with knowledge about entities (e.g. here, and here). Mike Lewis talked about the nearest neighbor LM that improves the prediction of factual knowledge, and Facebook’s RAG model that combines a generator with a retrieval component. • Using external KBs: • this has been commonly done for several years now. Guan et al. enhance GPT-2 with knowledge from commonsense KBs for commonsense tasks. Wu et al. used such KBs for dialogue generation. • Enhancing LMs with new abilities: • Zhou et al. trained a LM to capture temporal knowledge (e.g. on the frequency and duration of events) using training instances obtained through information extraction with patterns and SRL. Geva and Gupta inject numerical skills into BERT by fine-tuning it on numerical data generated using templates and textual data that requires reasoning over numbers.
  • 6. Explainable NLP • It seems that this year looking at attention weights has gone out of fashion and instead the focus is on generating textual rationales, preferably ones that are faithful — • i.e. reflect the discriminative model’s decision. Kumar and Talukdar predict faithful explanations for NLI by generating candidate explanations for each label, and using them to predict the label. Jain et al. develop a faithful explanation model that relies on post-hoc explanation methods (which are not necessarily faithful) and heuristics to generate training data. • To evaluate explanation models, Hase and Bansal propose to measure users’ ability to predict model behavior with and without a given explanation.
  • 7. Reflecting on current achievements, limitations, and thoughts about the future of NLP We are solving datasets, not tasks. There are inherent limitations in current models and data. We need to move away from classification tasks. We need to learn to handle ambiguity and uncertainty.
  • 8. Discussions about ethics (it’s complicated) • Who benefits from the system? • Who could be harmed by it? • Can users choose to opt out? • Does the system enforce or worsen systemic inequalities? • Is it generallybettering the world?
  • 9. 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding • Original Abstract • We introduce a new language representationmodel called BERT, which standsfor Bidirectional Encoder Representationsfrom Transformers.Unlike recent language representationmodels, BERT is designed to pre-train deep bidirectional representationsby jointly conditioning on both left and right contextin all layers. As a result, the pre-trained BERT representationscan be fine-tuned with justone additional output layer to create state-of-the-artmodels for a wide range of tasks, such as question answering and language inference, without substantialtask-specific architecturemodifications. • BERT is conceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven naturallanguage processingtasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute improvement), MultiNLIaccuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2(1.5% absolute improvement), outperforming human performance by 2.0%.
  • 10. Summary • A Google AI team presents a new cutting-edge model for Natural Language Processing (NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design allows the model to consider the context from both the left and the right sides of each word. While being conceptually simple, BERTobtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.
  • 12. What’s the core idea of this paper? • Training a deep bidirectional model by randomly masking a percentage of input tokens – thus, avoiding cycles where words can indirectly “see themselves”. • Also pre-training a sentence relationship model by building a simple binary classification task to predict whether sentence B immediately follows sentence A, thus allowing BERTto better understand relationships between sentences. • Training a very big model (24 Transformer blocks, 1024-hidden, 340Mparameters) with lots of data (3.3 billion word corpus).
  • 13. What’s the key achievement? • Advancing the state-of-the-art for 11 NLP tasks, including: • getting a GLUE score of 80.4%, which is 7.6% of absolute improvement from the previous best result; • achieving 93.2% accuracy on SQuAD 1.1 and outperforming human performance by 2%. • Suggesting a pre-trained model, which doesn’t require any substantial architecture modifications to be applied to specific NLP tasks.
  • 14. What does the AI community think? • BERT model marks a new era of NLP. • In a nutshell, two unsupervised tasks together (“fill in the blank” and “does sentence B comes after sentence A?” ) provide great results for many NLP tasks. • Pre-training of language models becomes a new standard. • What are future research areas? • Testing the method on a wider range of tasks. • Investigating the linguistic phenomena that may or may not be captured by BERT.
  • 15. What are possible business applications? • BERT may assist businesses with a wide range of NLP problems, including: • chatbots for better customerexperience; • analysis of customer reviews; • the search for relevant information, etc.
  • 16. 2. XLNet: Generalized Autoregressive Pretraining for Language Understanding • Original Abstract • With the capability of modeling bidirectional contexts,denoising autoencodingbased pretraininglike BERT achieves better performance than pretraining approachesbased on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. • In light of these pros and cons, we propose XLNet, a generalized autoregressive pretrainingmethod that (1) enables learning bidirectional contextsby maximizing the expected likelihood over all permutations of the factorizationorder and • (2) overcomes the limitations of BERT thanks to its autoregressive formulation.Furthermore, XLNet integratesideas from Transformer-XL,the state-of-the-artautoregressive model, into pretraining. • Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, naturallanguage inference, sentiment analysis, and document ranking.
  • 17. Summary • The researchers from Carnegie Mellon University and Google have developed a new model, XLNet, for natural language processing (NLP) tasks such as reading comprehension, text classification, sentiment analysis, and others. • XLNet is a generalized autoregressive pretraining method that leverages the best of both autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT) while avoiding their limitations. The experiments demonstrate that the new model outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on 18 NLP tasks.
  • 19. What’s the core idea of this paper? • XLNet combines the bidirectional capability of BERT with the autoregressive technologyof Transformer-XL: • Like BERT, XLNet uses a bidirectional context, which means it looks at the words before and after a given token to predict what it should be. To this end, XLNet maximizes the expected log-likelihood of a sequence with respect to all possible permutations of the factorization order. • As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the assumptionthat unmasked tokens are independent of each other. • To further improve architectural designs for pretraining, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL.
  • 20. What’s the key achievement? • XLnet outperforms BERT on 20 tasks,often by a large margin. • The new model achieves state-of-the-artperformance on 18 NLP tasks including question answering, natural language inference, sentiment analysis, and document ranking. • What are future research areas? • Extending XLNet to new areas, such as computer vision and reinforcement learning.
  • 21. What does the AI community think? • The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence. • “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian Ruder, a research scientist at Deepmind. • “XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.
  • 22. What are possible business applications? XLNetmayassistbusinesses witha wide range of NLP problems,including: chatbotsfor first-line customersupportor answeringproductinquiries; sentimentanalysisfor gaugingbrandawarenessand perceptionbasedon customerreviewsandsocial media; the search forrelevant informationindocument basesor online,etc.
  • 23. 3. RoBERTa: A Robustly Optimized BERT Pretraining Approach • Original Abstract • Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. • We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. • These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
  • 24. Summary • Natural languageprocessing models have made significant advances thanks to the introduction of pretraining methods, but the computational expense of training has made replication and fine-tuning parameters difficult. • In this study, Facebook AI and the University of Washingtonresearchers analyzed the training of Google’s Bidirectional Encoder Representations from Transformers (BERT) model and identified several changes to the training procedure that enhance its performance. • Specifically, the researchers used a new, larger dataset for training, trained the model over far more iterations, and removed the next sequence prediction training objective. The resulting optimized model, RoBERTa(Robustly Optimized BERT Approach), matched the scores of the recently introduced XLNet model on the GLUE benchmark.
  • 25. What’s the core idea of this paper? • The Facebook AI research team found that BERT was significantly undertrained and suggested an improved recipe for its training, called RoBERTa: • More data: 160GB of text instead of the 16GB dataset originally used to train BERT. • Longer training: increasing the number of iterations from 100K to 300K and then further to 500K. • Larger batches: 8K instead of 256 in the original BERT base model. • Larger byte-level BPE vocabulary with 50K subword units instead of character-level BPE vocabulary of size 30K. • Removing the next sequence prediction objective from the training procedure. • Dynamically changing the masking pattern applied to the training data.
  • 26. What’s the key achievement? • RoBERTa outperforms BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark. • The new model matches the recently introduced XLNet model on the GLUE benchmark and sets a new state of the art in four out of nine individual tasks. • What are future research areas? • Incorporating more sophisticated multi-taskfinetuning procedures.
  • 27. What are possible business applications? Big pretrained language frameworks like RoBERTa can be leveraged in the business setting for a wide range of downstream tasks, including dialogue systems, question answering, document classification, etc.
  • 28. 4. Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts • Original Abstract • Emotion cause extraction (ECE), the task aimed at extracting the potentialcauses behind certain emotionsin text, has gained much attentionin recent yearsdue to its wide applications.However, it suffers from two shortcomings: • 1) the emotion must be annotatedbefore cause extraction in ECE, which greatly limitsits applicationsin real-world scenarios; • 2) the way to first annotateemotion and then extract the cause ignores the fact that they are mutuallyindicative.In this work, we propose a new task: emotion-cause pair extraction (ECPE), which aims to extract the potentialpairsof emotions and corresponding causes in a document. • We propose a 2-step approachto address this new ECPE task, which first performs individual emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause pairing and filtering. • The experimentalresults on a benchmark emotion cause corpus prove the feasibilityof the ECPE task as well as the effectiveness of our approach.
  • 29. Summary • Emotion cause extraction (ECE) is an approach used in natural language processing to identify statements containing the causes behind vocabulary expressing emotion. However, ECE requires emotions to first be annotated and ignores mutual relationships between causes and emotional effects. The researchers sought to solve this problem by simultaneously identifying pairs of emotions and causes in a task they call emotion-cause pair extraction (ECPE). • ECPE uses a two-step approach: the first step uses two multi-task learning networks to identify emotion and cause clauses, while the second step pairs all causes and emotions, and uses a trained filter to eliminate pairings that do not contain a causal relationship. The resulting ECPE task is able to identify emotion-cause pairs at an accuracy on par with existing ECE methods but without requiring emotion annotation.
  • 31. What’s the core idea of this paper? • The paper introduces a new emotion-cause pair extraction (ECPE) task to overcome the limitations of the traditional ECE task, where emotion annotation is required prior to cause extraction and mutual indicativeness of emotion and cause is not taken into account. • The introduced approach consists of two steps: • In the first step, the two individual tasks of emotion extraction and cause extraction are performed via two kinds of multi-task learning networks: • Inter-EC that uses emotion extraction to improve cause extraction; • Inter-CE that leverages cause extraction to enhance emotion extraction. • In the second step, the model combines all elements of the two sets into pairs by applying a Cartesian product. Then, a logistic regression model is trained to eliminate pairs that do not contain a causal relationship.
  • 32. What’s the core idea of this paper?
  • 33. What’s the key achievement? • ECPE is able to achieve F1 scores of 0.83 for emotion extraction, 0.65 for cause extraction, and 0.61 for emotion-causepairing. • On the ECE benchmark dataset, ECPE performs on par with existing ECE methods that require emotion annotation before causal clauses can be identified. • What are future research areas? • Altering the ECPE approach from a two-stepto a one-step process that directly extracts emotion-cause pairs in an end-to-end fashion.
  • 34. What are possible business applications? Sentiment analysis for marketing campaigns. Opinion monitoring from social media.
  • 35. 5. CTRL: A Conditional Transformer Language Model For Controllable Generation • Original Abstract • Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. • Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. • This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://www.github.com/salesforce/ctrl.
  • 36. Summary • Language models used for text generation are very powerful, but they are often “black boxes”, so users do not have much control over the output. • To address this problem, the Salesforce research team has introduced the Conditional TransformerLanguage (CTRL) model that conditions on a set of control codes. With these codes, the users can control domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. • Moreover, all control codes can be traced back to a specific subset of the training data, allowing CTRL to predict the subset of the training data mostlikely leveraged for a particular sequence. • This relationship between CTRL and its training data provides new possibilities for analyzing the correlations learned from each domain.
  • 37. What’s the core idea of this paper? • Text generation tools are very powerful, but they do not give users much control over the content, style or genre of the generated text. • The Salesforce research team has released CTRL, a 1.6 billion-parameter conditional transformer language model, that gives users more control over the generated content: • CTRL exposes keywords called control codes which allow users to specify a domain, style, topics, dates, entities, relationships between entities, plot points, and task- related behavior. • CTRL is trained on control codes derived from the structure that naturally co-occurs with the raw text. In particular, CTRL leverages the fact that training data is usually associated with a URL that contains information relevant to the text it represents.
  • 38. What’s the key achievement? • Introducing and open-sourcing a language model that: • enables more controllable text generation; • provides new opportunities for analyzing large amounts of text via model-based source attribution; • can be used to detect artificially generated text.
  • 39. What are future research areas? • Introducing a greater variety of control codes to allow finer-grained control. • Extending to other areas of NLP including abstractive summarizationand commonsense reasoning. • Analyzing the relationships between training data and language models. • Exploring the possibilities to make the interface between humans and language models more explicit and intuitive.
  • 40. What are possible business applications? Improved and tailored text generationfor question-answering systems and other human- computer interactionapplications. Identifying artificially generatedtext, to detect malicious uses such as automatically generated essays or fake reviews.
  • 41. 6. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations • Original Abstract • Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. • To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. • We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
  • 42. Summary • The Google Research team addresses the problem of the continuouslygrowing size of the pretrained language models, which results in memory limitations, longer training time, and sometimes unexpectedlydegraded performance. • Specifically, they introduce A Lite BERT (ALBERT) architecture that incorporates two parameter-reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. • In addition, the suggested approach includes a self-supervised loss for sentence- order prediction to improve inter-sentencecoherence. • The experiments demonstrate that the best version of ALBERT sets new state-of- the-art results on GLUE, RACE, and SQuAD benchmarks while having fewer parameters than BERT-large.
  • 43. What’s the core idea of this paper? • It is not reasonable to further improve language models by making them larger because of memory limitations of available hardware, longer training times, and unexpected degradation of model performance with the increased number of parameters. • To address this problem, the researchers introduce the ALBERT architecture that incorporates two parameter-reduction techniques: • factorized embedding parameterization, where the size of the hidden layers is separated from the size of vocabulary embeddings by decomposing the large vocabulary-embedding matrix into two small matrices; • cross-layer parameter sharing to prevent the number of parameters from growing with the depth of the network. • The performance of ALBERT is further improved by introducing the self-supervised loss for sentence-order prediction to address BERT’s limitations with regard to inter-sentence coherence.
  • 44. What’s the key achievement? • With the introduced parameter-reduction techniques, the ALBERT configuration with 18× fewer parameters and 1.7× faster training compared to the original BERT-large model achieves only slightly worse performance. • The much larger ALBERT configuration, which still has fewer parameters than BERT-large, outperforms all of the current state-of-the-artlanguage modes by getting: • 89.4% accuracy on the RACE benchmark; • 89.4 score on the GLUE benchmark; and • An F1 score of 92.2 on the SQuAD 2.0 benchmark.
  • 45. What are possible business applications? THE ALBERT LANGUAGE MODEL CAN BE LEVERAGED IN THE BUSINESSSETTING TO IMPROVEPERFORMANCEON A WIDE RANGE OF DOWNSTREAMTASKS, INCLUDINGCHATBOT PERFORMANCE, SENTIMENT ANALYSIS, DOCUMENT MINING,AND TEXT CLASSIFICATION.
  • 46. 7. Explain Yourself! Leveraging Language Models for Commonsense Reasoning • Original Abstract • Deep learning models perform poorly on tasks that require commonsense reasoning, which often necessitates some form of world-knowledge or reasoning over information not immediately present in the input. • We collect human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations (CoS-E). We use CoS-E to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework. • CAGE improves the state-of-the-art by 10% on the challenging CommonsenseQA task. We further study commonsense reasoning in DNNs using both human and auto-generated explanations including transfer to out-of-domain tasks. Empirical results indicate that we can effectively leverage language models for commonsense reasoning.
  • 47. Summary • Natural language processing algorithms are limited to information contained in texts, and often these algorithms lack commonsensereasoning that allows them to make inferences as most humans do. • The Salesforce research team suggestsaddressing this problem by training the language model to automatically generate commonsenseexplanations. This task is accomplished by providing the model with human explanations alongside the question answering samples. • These autogenerated explanations are then used by a neural network to solve the CommonsenseQA(CQA) task. This two-stepapproach improved accuracy on the CommonsenseQAmultiple-choice test by 10% compared to existing models.
  • 49. What’s the core idea of this paper? • Natural language processing struggles with inference based on common sense and real-world knowledge. • The paper suggests addressing this issue in two phases: • First, the researchers train the model to generate Common Sense Explanations (CoS-E) by providing human-generated explanations in the form of both open- ended sentences and highlighted span annotations, alongside Commonsense Question Answering (CQA) examples. • In the second phase, the authors use this trained language model to generate explanations for each sample in the training and validation sets. These Commonsense Auto-Generated Explanations (CAGE) are then leveraged to solve the CQA task.
  • 50. What’s the key achievement? • The explanation-generating model improves performance in a natural language reasoning test by 10% over the previous best model and improves understanding of how neural networks apply knowledge. • Moreover, the experiments demonstrate that the introduced approach can be successfullytransferred to out-of-domain datasets.
  • 51. What are future research areas? Combining the explanation-generating model into an answer prediction model. Combining Extending the dataset of explanations to other tasks to create a more general explanatory language model. Extending Removing bias from training datasets to eliminate bias in generated explanations. Removing
  • 52. What are possible business applications? • The model with improved common-sense reasoning capabilitiescan be leveraged: • to provide bettercustomer service via chatbots; • to improve the performance of information retrieval systems.
  • 53. 8. Detecting Concealed Informationin Text and Speech • Original Abstract • Motivatedby infamouscheating scandalsin variousindustries and politicalevents, we address the problem of detecting concealed informationin technicalsettings. • In this work, we explore acoustic-prosodicand linguisticindicatorsof information concealmentby collecting a uniquecorpus of professionalspracticing for oral exams while concealinginformation. • We reveal subtle signs of concealedinformation in speech and text, compare, and contrast them with those in deceptiondetection literature,thus uncovering the link between concealing information and deception. • We then present a series of experiments that automatically detectconcealed informationfrom text and speech. We compare the use of acoustic-prosodic,linguistic,and individual featuresets, using different machine learning models. Finally,we present a multi-task learning framework with acoustic, linguistic,and individualfeatures, that outperforms human performance by over 15%.
  • 54. Summary • When confidential information is leaked, it is often difficult to tell who originally obtained the leaked information and who it has been leaked to. Even though previous work has demonstrated that changes in voice tone, lexicon, and speech patterns can identify when someone is concealing information, research in this subject area is very scarce. It is partly due to the lack of datasets that include ground truth labels indicating information concealment. • To address this issue, the present study introduces a new dataset collected from a unique audio corpus of professional wine tasters practicing for oral exams while concealing information. By leveraging this dataset, the researcher was able to develop a new multi-task learning model for detecting concealed information that performs 11% better than baseline models and 15% better than humans.
  • 56. What’s the core idea of this paper? While there are machine learning-based methods for detecting when someone does not have information but pretends to, there are few comparable models for detecting when someone is concealing leaked information. In this study, Hu from Cornell University captured linguistic and acoustic-prosodic features from a controlled human experiment to create a dataset of speech patterns when people were speaking honestly and when they were concealing some information. The author leverages this dataset to develop a multi-task learning framework where, as well as identifying concealed information, the system is also predicting whether the speaker’s answer is correct and the identity of the wine.
  • 57. What’s the key achievement? A multi-task learning model outperformed baseline models by 11% and humans by 15% at detecting when someone is concealing information. Moreover, the introduced framework outperforms humans even in the case where some of the humans in the experiment knew one another and could read social cues (e.g. gestures) that are not available to the model.
  • 58. What are future research areas? STUDYING INDIVIDUAL DIFFERENCESIN BOTH DETECTING CONCEALEDINFORMATION AND CONCEALINGINFORMATION. EXPLORINGTHE PREDICTIVEPOWER OF PHONOTACTICVARIATION FEATURES. CONDUCTINGDOMAIN ADAPTATION WITHREGARDS TO DETECTING CONCEALED INFORMATION. IMPROVINGTHE SCALABILITYOF THE MULTI-TASK LEARNINGMODEL.
  • 59. What are possible business applications? Detecting insider trading in financial markets. Detecting Controlling data leaks within different testing procedures. Controlling Tracing and limiting the extent of information leaks around political campaigns. Tracing and limiting
  • 60. 9. Improving Visual Question Answering by Referring to Generated Paragraph Captions • Original Abstract • Paragraph-style image captions describe diverse aspects ofan image as opposed to the more common single-sentence captions that onlyprovide an abstract description ofthe image. These paragraph captions can hence contain substantial information ofthe image for tasks such as visual questionanswering. • Moreover,this textual informationis complementarywith visual information presentin the image because it can discuss both more abstract concepts and more explicit,intermediate symbolicinformation about objects,events,and scenes that can directlybe matched with the textual question and copied intothe textualanswer (i.e.,via easier modalitymatch). • Hence, we propose a combined Visual and Textual QuestionAnswering(VTQA) model which takes as input a paragraph caption as well as the correspondingimage, and answers the given question based on both inputs.In our model,the inputs are fused to extract related information bycross-attention (earlyfusion),then fused again in the form of consensus (late fusion),and finallyexpected answers are given an extra score to enhance the chance of selection (later fusion). • Empirical results showthat paragraphcaptions,even when automaticallygenerated (via an RL-based encoder-decoder model),help correctly answer more visual questions.Overall,our joint model,when trained on the Visual Genome dataset,significantlyimprovesthe VQA performance over a strongbaseline model.
  • 61. Summary • Computer models struggle with answering questions about visual images, a task known as visual question answering (VQA). • In this study, the researchers sought to improve VQA performance by providing a VQA model with a text description of an image’s content produced by a paragraph captioning model. • The two models were fused over three stages to generate a consensus answer to questions posed about the image. • The resulting visual and textual question answering (VQTA) model was 1.92% more accurate than the standalone VQA model.
  • 63. What’s the core idea of this paper? • VQA models struggle with identifying all of the necessary informationin images, and particularly abstract concepts,required to answer questions. • The researchers suggest using a pre-trained paragraphcaptioningmodel to provide additional information to the VQA model. • The text and image input are fused at three levels: • in the early fuse stage, visual features are fused with paragraphcaptionand object property features by cross-attention; • in the late fuse stage,the inputs are fused again in the form of consensus,i.e. logits from each module are integratedinto one vector; • in the later fuse stage,the model accountsfor the fact that some regions of the image are more likely to draw people’s attention,and thus questions and answers are more likely to be related to those regions. So, the model gives an extra score to the answers related to the salient regions.
  • 64. What’s the key achievement? • Improving visual question answering performance by 1.92% compared to the baseline VQA model.
  • 65. What are future research areas? ImprovingVTQA models to extract more information from textual captions,and enhancingparagraph captioningmodels to generate better captions. Trainingthe VTQA model jointlywith the paragraph captioningmodel.
  • 66. What are possible business applications? • Improving image search and retrieval. • Imageannotation and interactivity for blind people. • Creating “interactive” images for online education.
  • 67. 10. Thieves on Sesame Street! Model Extraction of BERT-based APIs • Original Abstract • We study the problem of model extraction in natural language processing, in which an adversary with only query access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary and victim model fine-tune a large pretrained language model such as BERT (Devlin et al., 2019), we show that the adversary does not need any real training data to successfully mount the attack. • In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random sequences of words coupled with task-specific heuristics form effective queries for model extraction on a diverse set of NLP tasks, including natural language inference and question answering. • Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model. • Finally, we study two defense strategies against model extraction—membership classification and API watermarking—which while successful against naive adversaries, are ineffective against more sophisticated ones.
  • 68. Summary • This paper highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model on SST2, SQuAD, MNLI, and BoolQ. On the SST2 task, the victim model had a 93.1%accuracy compared to their extracted model’s 90.1%. • They show that an adversary does not need any real training data to mount the attack successfully. The attacker does not even need to use grammatical or semantically meaningful queries. They used random sequences of words coupled with task-specific heuristics to form useful queries for model extraction on a diverse set of NLP tasks.
  • 69. Summary • Why It Matters: Outputs of modern NLP APIs on nonsensical text provide strong signals about model internals, allowing adversaries to train their own models and avoid paying for the API.
  • 70. What’s the core idea of this paper? • DEFENSES • MEMBERSHIPCLASSIFICATION • Our first defense uses membership inference, which is traditionally used to determinewhether a classifier was trained on a particular input point. • In our setting we use membership inference for “outlier detection”,where nonsensicaland ungrammaticalinputs (which are unlikely to be issued by a legitimate user) are identified • When such out-of-distributioninputs are detected, the API issues a random outputinstead of the model’s predicted output, which eliminates the extractionsignal. • WATERMARKING • in which a tiny fractionof queries are chosen at random and modified to return a wrong output. • These “watermarked queries” and their outputs are stored on the API side. Since deep neural networks have the ability to memorize arbitrary information,this defense anticipatesthat extractedmodels will memorize some of the watermarked queries, leaving them vulnerable to post-hoc detection if they are deployed publicly
  • 71. What’s the key achievement? • Our results show that fine-tuning large pretrained language models simplifies the process of extraction for an attacker. • Unfortunately, existing defenses against extraction, while effective in some scenarios, are generally inadequate, and further research is necessary to develop defenses robust in the face of adaptive adversaries who develop counter-attacksanticipating simple defenses.
  • 72. What are future research areas? • Other interesting future directions that follow from the results in this paper include • (1) leveraging nonsensical inputs to improve model distillation on tasks for which it is difficult to procure input data; • (2) diagnosing dataset complexity by using query efficiency as a proxy; and • (3) further investigation of the agreement between victim models as a method to identify proximity in input distribution and its incorporation into an active learning setup for model extraction.
  • 73. What are possible business applications? • Avoids Paid APIs from possible thefts. • Decision Analysis on API Cost Models for NLU and NLG.
  • 74. 11. WinoGrande: An Adversarial Winograd Schema Challenge at Scale • Original Abstract • The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. • To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human- detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. • Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
  • 75. Summary The research group from the Allen Institute for Artificial Intelligence introduces WinoGrande, a new benchmark for commonsense reasoning. They buildon the design of the famous Winograd Schema Challenge(WSC) benchmark but significantlyincrease the scale of the dataset to 44K problemsand reduce systematic bias using a novel AfLite algorithm. The experimentsdemonstrate that state-of- the-art methods achieve up to 79.1% accuracy on WinoGrande,which is significantlybelow the human performance of 94%. Furthermore, the researchers show that WinoGrande is an effective resource for transfer learning, by using a RoBERTa model fine-tuned with WinoGrandeto achieve new state-of-the-art results on WSC and four other relatedbenchmarks.
  • 77. What’s the core idea of this paper? • The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense reasoning. • They introduce WinoGrande, a new large-scale dataset for commonsense reasoning. Their approach has two key features: • A carefully designed crowdsourcing procedure: • Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers. • Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected questions, 53K were deemed valid. • A novel algorithm AfLite for systematic bias reduction: • It generalizes human-detectable biases based on word occurrences to machine-detectable biases based on embedding occurrences. • After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples.
  • 78. What’s the key achievement? • WinoGrande is easy for humans and challenging for machines: • Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%); • RoBERTa achieves 79.1%test-set accuracy; • whereas human performance achieves 94% accuracy. • WinoGrande is also an effective resource for transfer learning. The RoBERTa-based model fine-tuned on WinoGrande achieved a new state of the art on WSC and four other related datasets: • 90.1%on WSC; • 93.1%on DPR; • 90.6%on COPA; • 85.6%on KnowRef; and • 97.1%on Winogender.
  • 79. What are future research areas? • Exploring new algorithmic approaches for systematicbias reduction. • Debiasing other NLP benchmarks.
  • 80. 12. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer • Original Abstract • Transfer learning, where a model is first pre-trained on a data-richtask before being fine-tuned on a downstream task, has emerged as a powerful technique in naturallanguage processing (NLP).The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. • In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-textformat.Our systematicstudy compares pre-training objectives, architectures,unlabeled datasets, transfer approaches,and other factorson dozens of language understandingtasks. • By combining the insights from our explorationwith scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-artresults on many benchmarks covering summarization, question answering, text classification,and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
  • 81. Summary • The Google research team suggests a unified approach to transfer learning in NLP with the goal to set a new stateof the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. • Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentimentanalysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state- of-the-art results on a number of NLP tasks.
  • 83. What’s the core idea of this paper? • The paper has several important contributions: • Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques. • Introducing a new approach to transfer learning in NLP by suggesting to treat every NLP problem as a text-to- text task: • The mode understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”). • Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the Colossal Clean Crawled Corpus (C4). • Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.
  • 84. What’s the key achievement? • The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out of 24 tasks considered, including: • the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks; • the Exact Match score of 90.06 on SQuAD dataset; • the SuperGLUE score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6)and very close to human performance (89.8); • the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarizationtask.
  • 85. What are future research areas? • Researching the methods to achieve stronger performance with cheaper models. • Exploring more efficient knowledge extraction techniques. • Further investigating the language-agnosticmodels.
  • 86. What are possible business applications? • Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, • the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.
  • 87. 13. Reformer: The Efficient Transformer • Original Abstract • Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. • For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(L log L), where L is the length of the sequence. • Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
  • 88. Summary • The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest • (1) using reversible layers to allow storing the activations only once instead of for each layer, and • (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot- product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.
  • 89. Summary Locality-Sensitive Hashing Attention showing the hash-bucketing, sorting, and chunking steps, and the resulting causal attentions, together with the corresponding attention matrices (a–d)
  • 90. What’s the core idea of this paper? The leading Transformer models require huge computational resources because of the very high number of parameters and several other factors: • The activations of every layer need to be stored for back-propagation. • The intermediate feed-forward layers accountfor a large fractionof memory use since their depth is often much larger than the depth of attentionactivations. • The complexity of attentionon a sequence of length L is O(L^2). To address these problems, the research team introduces the Reformer model with the following improvements: • using reversiblelayersto store only a single copy of activations; • splittingactivations inside the feed-forward layers and processing them in chunks; • approximatingattentioncomputationbased on locality-sensitive hashing.
  • 91. What’s the key achievement? • By analyzing the introduced techniques one by one, the authors show that model accuracy is not sacrificed by: • switching to locality-sensitive hashing attention; • using reversible layers. • Reformer performs on par with the full Transformer model while demonstrating much higher speed and memory efficiency: • For example, on the newstest2014 taskfor machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et al. (2017)BLEU score of 27.3.
  • 92. What are possible business applications? • The suggestedefficiency improvements enable more widespread Transformer application, especially for the tasks that depend on large-contextdata, such as: • text generation; • visual content generation; • music generation; • time-series forecasting.
  • 93. 14. Longformer: The Long-Document Transformer • Original Abstract • Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. • Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. • In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.
  • 94. Summary • Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. • The research team from the Allen Institute for Artificial Intelligence introduces a more elegant solution to this problem. The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention. • This attention mechanism scales linearly with the sequence length and enables processing of documents with thousands of tokens. The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when pre-trained, consistently outperforms RoBERTa on long-document tasks.
  • 95. Summary Full self-attention pattern vs. Longformer’s configuration of attention patterns
  • 96. What’s the core idea of this paper? • The computational requirements of self-attention grow quadratically with sequence length, making it hard to process on current hardware. • To address this issue, the researchers present Longformer, a modified version of Transformer architecture that: • allows memory usage to scale linearly, and not quadratically, with the sequence length; • includes an attention mechanism that combines: • a windowed local-context self-attention to build contextual representations; • an end task motivated global attention to encode inductive bias about the task and build full sequence representation. • Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.
  • 97. What’s the key achievement? • The Longformer model achieves a new state of the art on character-level language modeling tasks: • BPC of 1.10 on text8; (Bits Per Character) • BPC of 1.00 on enwik8. • After pre-training and fine-tuning for six tasks, including classification, question answering, and coreference resolution, the Longformer-base consistently outperformers the RoBERTa-base with: • accuracy of 75.0 vs. 72.4 on WikiHop; • F1 score of 75.2 vs. 74.2 on TriviaQA; • joint F1 score of 64.4 vs. 63.5 on HotpotQA; • average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task; • accuracy of 95.7 vs. 95.3 on the IMDB classification task; • F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task. • The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).
  • 98. What are future research areas? Exploring other attentionpatternsthat are more efficient due to dynamic adaptationto the input. Applying Longformer to other relevant long document tasks such as summarization.
  • 99. What are possible business applications? • The Longformer architecture can be very advantageous for the downstream NLP tasks that often require processing of long documents: • document classification; • question answering; • coreference resolution; • summarization; • semantic search.
  • 100. 15. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators • Original Abstract • Masked language modeling(MLM) pre-trainingmethods such as BERT corrupt the input byreplacingsome tokens with [MASK] and then train a model to reconstruct the original tokens.While they produce good results when transferred to downstreamNLP tasks,theygenerallyrequire large amounts ofcompute to be effective. • As an alternative,we propose a more sample-efficient pre-trainingtaskcalled replaced token detection.Instead of maskingthe input,ourapproach corrupts it byreplacingsome tokens with plausible alternatives sampledfroma small generator network. • Then,instead oftraininga model that predicts the original identities ofthe corrupted tokens,we train a discriminative model that predicts whether each token in the corrupted input was replaced bya generator sample or not.Thorough experiments demonstrate this newpre-trainingtaskis more efficient than MLM because the task is defined overall input tokens rather than just the small subset that was masked out. • As a result,the contextual representations learned byourapproach substantiallyoutperform the ones learned byBERT given the same model size, data,and compute.The gains are particularlystrongfor small models;for example, we train a model on one GPU for 4 days that outperforms GPT (trained using30× more compute)on the GLUE naturallanguage understandingbenchmark.Our approach also works well at scale, where it performs comparablyto RoBERTa and XLNet while usingless than 1/4 of theircompute and outperforms them when usingthe same amount of compute.
  • 101. Summary • The pre-training task for popular language models like BERT and XLNet involves masking a small subset of unlabeled input and then training the network to recover this original input. Even though it works quite well, this approachis not particularly data-efficient as it learns from only a small fractionof tokens (typically ~15%). • As an alternative, the researchers from StanfordUniversity and Google Brain propose a new pre-training task called replacedtoken detection.Insteadof masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminatoris used to predict whether each token is an original or a replacement. • As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient. The experiments confirm that the introduced approachleads to significantly faster trainingand higher accuracyon downstream NLP tasks.
  • 103. What’s the core idea of this paper? • Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning. • Researchers propose a new pre-training task called replaced token detection, where: • some tokens are replaced by samples from a small generator network; • a model is pre-trained as a discriminator to distinguish between original and replaced tokens. • The introduced approach, called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately): • enables the model to learn from all input tokens instead of the small masked-out subset; • is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.
  • 104. What’s the key achievement? • Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning. • Introducing a model that substantially outperforms state-of-the-art approaches while requiring less pre-training compute: • ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8. • An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute. • ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.
  • 105. What are possible business applications? Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.
  • 107. Summary • The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. • Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. • They test their solution by training a 175B-parameter autoregressive language model, called GPT-3, and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.
  • 109. What’s the core idea of this paper? Sparse Transformer
  • 111. What does the AI community think? Sam Altman, CEO and co- founder of OpenAI Abubakar Abid, CEO and founder of Gradio Gary Marcus,CEO and founder of Robust.ai Geoffrey Hinton, Turing Award winner
  • 113. What are possible business applications?
  • 114. 17. Beyond Accuracy: Behavioral Testing of NLP models with CheckList • Original Abstract • Although measuring held-out accuracyhas been the primary approachto evaluate generalization,it often overestimates the performance of NLP models, while alternative approachesfor evaluating models either focus on individual tasks or on specific behaviors. • Inspired by principles of behavioral testing in software engineering, we introduceCheckList, a task- agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitatecomprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. • We illustratethe utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-artmodels. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitionerswith CheckList created twice as many tests, and found almost three times as many bugs as users without it.
  • 115. Summary • The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. • To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList, a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. • Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.
  • 117. What’s the core idea of this paper? Existing approaches to evaluation of NLP models have many significant shortcomings: • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs. • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness. To address this problem, the research team introduces CheckList,a new methodology for evaluating NLP models, inspired by the behavioral testing in software engineering: • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation. • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types, such as prediction invariance or directional expectation tests in case of certain perturbations. • Potential tests are structured as a matrix, with capabilities as rows and test types as columns. The suggested implementation of CheckList also introducesa variety of abstractionsto help users generate large numbers of test cases easily.
  • 119. What does the AI community think? • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
  • 120. What are possible business applications? • CheckList can be used to create more exhaustivetesting for a variety of NLP tasks. • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
  • 121. 18. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics • Original Abstract • Automatic metrics are fundamentalfor the developmentand evaluationof machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standardof human evaluation isnot a straightforward problem. • We show that current methods for judging metrics are highly sensitive to the translationsused for assessment, particularlythe presence of outliers, which often leadsto falsely confident conclusions about a metric’s efficacy. • Finally,we turn to pairwise system ranking, developinga method for thresholding performance improvement under an automaticmetric againsthuman judgements, which allowsquantification of type I versus type II errors incurred, i.e., insignificanthumandifferences in system qualitythat are accepted, and significanthuman differences that are rejected. • Together, these findings suggest improvementsto the protocolsfor metric evaluationandsystem performance evaluationin machine translation.
  • 122. Summary • The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. • The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. • Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.
  • 124. What’s the core idea of this paper? • Automaticmetrics are used as a proxyforhuman translation evaluation,which is considerablymore expensiveand time- consuming. • However, evaluatinghowwell different automaticmetrics concur with human evaluationis not a straightforwardproblem: • For example, the recent findings show that if the correlation between leadingmetrics and human evaluations is computed usinga large set of translationsystems,it is typicallyvery high (i.e., 0.9). However, if onlya few best systems are considered, the correlation reduces markedlyand can even be negativein some cases. • The authors ofthis paper take a closer lookat this problem and discoverthat: • The identified problem with Pearson’s correlationis due to the small sample size and not specific to comparingstrongMT systems. • Outlier systems,whose qualityis much higher or lower than the rest of the systems,havea disproportionate effect on the computed correlationand shouldbe removed. • The same correlation coefficient can reflect different patterns oferrors.Thus,a better approach for gaininginsights into metric reliabilityis to visualize metricscores against human scores. • Small BLEU differences of 1-2 points correspondto true improvements in translationquality(as judged by humans)onlyin 50% of cases.
  • 125. What’s the key achievement? • Conducting a thorough analysis of automatic metrics performance metrics vs. human judgments in machine translation, and providing key recommendations on evaluating MT systems: • Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER. • Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.
  • 126. 19. Towards a Human- like Open- Domain Chatbot
  • 128. Example of Meena generating a response, “The Next Generation” (Google AI Blog) Summary
  • 129. What’s the core idea of this paper? Evolved Transformer
  • 131. What does the AI community think? Elliot Turner, CEO and founder of Hyperia Graham Neubig, Associate professor at Carnegie Mellon University
  • 132. What are future research areas?
  • 133. What are future research areas?
  • 134. What are possible business applications? The authors suggest some interesting applications for open-domain chatbots such as Meena: further humanizing computer interactions; improving foreign languagepractice; making interactive movie and videogame characters relatable.
  • 135. 20. Recipes for Building an Open-Domain Chatbot • Original Abstract • Building open-domain chatbotsis a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. • Good conversationrequires a number of skills that an expert conversationalistblends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistentpersona. • We show that large scale models can learn these skills when given appropriate trainingdata and choice of generation strategy.We build variants of these recipes with 90M,2.7Band 9.4Bparameter models, and make our models and code publicly available. • Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingnessand humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.
  • 136. Summary • The Facebook AI Research team shows that with appropriate training data and generation strategy, large-scale models can learn many important conversational skills, such as engagingness, knowledge, empathy, and persona consistency.Thus, to build their state-of-the-art conversational agent, called BlenderBot, they leveraged a model with 9.4B parameters, trained it on a novel task called Blended Skill Talk, and deployed beam search with carefully selected hyperparameters as a generation strategy. • Human evaluations demonstrate that BlenderBot outperforms Meena in pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in terms of humanness.
  • 138. What’s the core idea of this paper? • The introduced recipe for building a state-of-the-artopen-domain chatbotincludes three key ingredients: • Largescale. The largest model has 9.4 billion parametersand was trained on 1.5 billion training examples of extractedconversations. • Blendedskills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of personality, engaginguse of knowledge, and display of empathy. • Beam search used for decoding. The researchers show that this generation strategy,deployed with carefully selected hyperparameters,gives strongresults. In particular,it was demonstratedthat the lengths of the agent’sutterancesis very important for chatbot performance (i.e, too short responses are often considered dull and too long responses make the chatbot appear to waffle and not listen).
  • 139. What’s the key achievement? The introduced chatbot outperforms the previous best-performing open- domain chatbot Meena. Thus, in pairwise match-ups,BlenderBot with 2.7B parameters wins: • 75% of the time in terms of engagingness; • 65% of the time in terms of humanness. In an A/B comparison between human-to-human and human-to- BlenderBot conversations, the latter were preferred 49% of the time as more engaging.