The presentation explains the ELECTRA model.
ELECTRA means 'Efficiently Learning an Encoder that Classifies Token Replacements Accurately'.
This paper proposes the replaced token detection and it is more compute-efficient than masked language models.
(11st March 2021)
"Even bad code can function. But if code isn't clean, it can bring a development organization to its knees. Every year, countless hours and significant resources are lost because of poorly written code. But it doesn't have to be that way." In this knolx session, a few important topics for having clean code are covered. Basically the following topics - Meaningful name, Functions, Comments and Classes.
Training language models to follow instructions with human feedback (Instruct...Rama Irsheidat
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Fwdays
In this talk, we will look at the current state (post-BERT era) of GEC and share our experience of building the state-of-the-art system to perform this task. We will talk about the pros and cons of different architectures and compare inference times.
Testing is fundamental in software development. Quality gates demand high coverage levels, pull requests need sufficient tests, leading to teams spending considerable time writing and maintaining them. But are we using our tests to their full potential?
'If code is hard to test, the design can be improved'. Starting from this mantra, this deep-dive session unveils hints to simplify code, break-down complexity, and effectively use functional programming. We'll delve into topics like fixture creep, partial mocks, onion architecture, and pure functions, providing numerous best practices and practical tips for your testing.
Be warned: This session may significantly disrupt your work routine and will likely change how you see testing. Attend at your own risk.
These slides contain an introduction to Symbolic execution and an introduction to KLEE.
I made this for a small demo/intro for my research group's meeting.
Presentation about an eclipse framework that allows to generate ecore model instances as input for tests and benchmarks. Held at the 3rd BigMDE workshop at STAF in L'Aquia, Italy in July 2015.
Faculty of ScienceDepartment of ComputingFinal Examinati.docxmydrynan
Faculty of Science
Department of Computing
Final Examination 2013
Unit: COMP229 Object Oriented Programming Practices
Release Date: 9:00am, November 15, 2013
Due Date: 11:45pm, November 19, 2013
Total Number
of Questions: Six (6)
Total Marks: Sixty Four (68)
Instructions: Answer ALL questions.
All references to program code or behaviour refer to the Java language.
All answers to questions that ask for code must be written using Java.
Every attempt has been made to make questions unambiguous. However, if
you are not sure what a question is asking, make some reasonable assumption
and state it at the beginning of your answer.
COMP229 Object Oriented Programming Practices, November 2013
Question 1 (Design Patterns, 5 marks)
The template method pattern and the strategy pattern both abstract some computation in
the form of methods. What defining characteristic distinguishes the template method patern
from the strategy pattern? Explain your answer. [5 marks]
Question 2 (Concurrency, 12 marks)
Consider the following class definition. This class is considered to be in an inconsistent state
if the isConsistent() method returns falsefalsefalse;
publicpublicpublic classclassclass Foo {
longlonglong mValue;
longlonglong mValueTimesTwo;
/**
* Sets the state of our object.
*
* Pauses briefly between setting the first and second
* values in order to increase the probability that the
* object will be interrogated while in an inconsistent
* state.
*
* @param pValue the value to update the current state with ,
*/
publicpublicpublic synchronizedsynchronizedsynchronized voidvoidvoid setValues(longlonglong pValue) {
mValue = pValue;
doPause (3);
mValueTimesTwo = pValue * 2;
}
/**
* Checks to see if the current state of our object is
* consistent.
*
* @return true if it is.
*/
publicpublicpublic synchronizedsynchronizedsynchronized booleanbooleanboolean isConsistent () {
returnreturnreturn (mValue * 2 == mValueTimesTwo );
}
/**
* Utility routine - pauses our thread by calling
* sleep and supressing any InterruptedException.
*/
privateprivateprivate staticstaticstatic voidvoidvoid doPause(longlonglong pPause) {
trytrytry {
Thread.sleep(pPause );
} catchcatchcatch (InterruptedException e) {
e.printStackTrace ();
}
}
}
Page 1 of 5
COMP229 Object Oriented Programming Practices, November 2013
a. Imagine a hypothetical version of Java where the object lock is replaced by a method
lock. Under this system a call to a synchronised method would assign a lock for that
method to the calling thread. No other thread could then call this method because
the lock is already allocated. However, other methods of the same object could still
be called. Upon the method completing, the lock is released. Under this system, is it
possible to put an instance of the Foo class into an inconsistant state? If so, give a code
example which could create this situation and explain how it does so. If not, explain
how the method lock preven ...
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Benoit Combemale
Talk given at the 8th ACM SIGPLAN Int'l Conf. on Software Language Engineering (SLE 2015), Pittsburgh, PA, USA on October 27, 2015. Preprint available at https://hal.inria.fr/hal-01182517
"Even bad code can function. But if code isn't clean, it can bring a development organization to its knees. Every year, countless hours and significant resources are lost because of poorly written code. But it doesn't have to be that way." In this knolx session, a few important topics for having clean code are covered. Basically the following topics - Meaningful name, Functions, Comments and Classes.
Training language models to follow instructions with human feedback (Instruct...Rama Irsheidat
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Fwdays
In this talk, we will look at the current state (post-BERT era) of GEC and share our experience of building the state-of-the-art system to perform this task. We will talk about the pros and cons of different architectures and compare inference times.
Testing is fundamental in software development. Quality gates demand high coverage levels, pull requests need sufficient tests, leading to teams spending considerable time writing and maintaining them. But are we using our tests to their full potential?
'If code is hard to test, the design can be improved'. Starting from this mantra, this deep-dive session unveils hints to simplify code, break-down complexity, and effectively use functional programming. We'll delve into topics like fixture creep, partial mocks, onion architecture, and pure functions, providing numerous best practices and practical tips for your testing.
Be warned: This session may significantly disrupt your work routine and will likely change how you see testing. Attend at your own risk.
These slides contain an introduction to Symbolic execution and an introduction to KLEE.
I made this for a small demo/intro for my research group's meeting.
Presentation about an eclipse framework that allows to generate ecore model instances as input for tests and benchmarks. Held at the 3rd BigMDE workshop at STAF in L'Aquia, Italy in July 2015.
Faculty of ScienceDepartment of ComputingFinal Examinati.docxmydrynan
Faculty of Science
Department of Computing
Final Examination 2013
Unit: COMP229 Object Oriented Programming Practices
Release Date: 9:00am, November 15, 2013
Due Date: 11:45pm, November 19, 2013
Total Number
of Questions: Six (6)
Total Marks: Sixty Four (68)
Instructions: Answer ALL questions.
All references to program code or behaviour refer to the Java language.
All answers to questions that ask for code must be written using Java.
Every attempt has been made to make questions unambiguous. However, if
you are not sure what a question is asking, make some reasonable assumption
and state it at the beginning of your answer.
COMP229 Object Oriented Programming Practices, November 2013
Question 1 (Design Patterns, 5 marks)
The template method pattern and the strategy pattern both abstract some computation in
the form of methods. What defining characteristic distinguishes the template method patern
from the strategy pattern? Explain your answer. [5 marks]
Question 2 (Concurrency, 12 marks)
Consider the following class definition. This class is considered to be in an inconsistent state
if the isConsistent() method returns falsefalsefalse;
publicpublicpublic classclassclass Foo {
longlonglong mValue;
longlonglong mValueTimesTwo;
/**
* Sets the state of our object.
*
* Pauses briefly between setting the first and second
* values in order to increase the probability that the
* object will be interrogated while in an inconsistent
* state.
*
* @param pValue the value to update the current state with ,
*/
publicpublicpublic synchronizedsynchronizedsynchronized voidvoidvoid setValues(longlonglong pValue) {
mValue = pValue;
doPause (3);
mValueTimesTwo = pValue * 2;
}
/**
* Checks to see if the current state of our object is
* consistent.
*
* @return true if it is.
*/
publicpublicpublic synchronizedsynchronizedsynchronized booleanbooleanboolean isConsistent () {
returnreturnreturn (mValue * 2 == mValueTimesTwo );
}
/**
* Utility routine - pauses our thread by calling
* sleep and supressing any InterruptedException.
*/
privateprivateprivate staticstaticstatic voidvoidvoid doPause(longlonglong pPause) {
trytrytry {
Thread.sleep(pPause );
} catchcatchcatch (InterruptedException e) {
e.printStackTrace ();
}
}
}
Page 1 of 5
COMP229 Object Oriented Programming Practices, November 2013
a. Imagine a hypothetical version of Java where the object lock is replaced by a method
lock. Under this system a call to a synchronised method would assign a lock for that
method to the calling thread. No other thread could then call this method because
the lock is already allocated. However, other methods of the same object could still
be called. Upon the method completing, the lock is released. Under this system, is it
possible to put an instance of the Foo class into an inconsistant state? If so, give a code
example which could create this situation and explain how it does so. If not, explain
how the method lock preven ...
Efficient and Advanced Omniscient Debugging for xDSMLs (SLE 2015)Benoit Combemale
Talk given at the 8th ACM SIGPLAN Int'l Conf. on Software Language Engineering (SLE 2015), Pittsburgh, PA, USA on October 27, 2015. Preprint available at https://hal.inria.fr/hal-01182517
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
With the recent explosion in development and interest in large language, vision and speech models, it has become apparent that running large models in production will be a key driver in enterprise adoption of ML. Traditional MLOps, i.e. running machine learning models in production, already has so many variabilities to address starting from data integrity, data drift and model optimization. Running a large model (language or vision) in production keeping in mind business requirements is different altogether. In this talk, I will try to explain the general framework for LLMOps and certain considerations while designing a system for inferencing a large model.
This talk will be covered in sub-topics:
1. Model Optimization
2. Model fine-tuning
3. Model Editing
4. Model Serving and deployment
5. Model metrics monitoring
6. Embedding and artifact management
In each sub-topic, a brief understanding of the current open-source tool sets will also be mentioned so that tool-chain selection is a bit easier.
Crf based named entity recognition using a korean lexical semantic networkDanbi Cho
They extracted the features for the named entity recognition task.
They use the UWordMap to learn the characteristics of the korean words.
(28th May, 2021)
I summarized the GPT models in this slide and compared the GPT1, GPT2, and GPT3.
GPT means Generative Pre-Training of a language model and was implemented based on the decoder structure of the transformer model.
(24th May, 2021)
Attention boosted deep networks for video classificationDanbi Cho
The presentation explains the integrating attention with CNN and LSTM.
This paper carried out the video classification task using the attention with CNNLSTM models.
(9th April 2021)
A survey on deep learning based approaches for action and gesture recognition...Danbi Cho
The presentation surveys the methodologies for action and gesture recognition tasks with deep learning models and feature engineering methods.
(6th April 2021)
A survey on automatic detection of hate speech in textDanbi Cho
The presentation survey on automatic detection of hate speech in the text.
It explains the motivation of the research, the definition of hate speech, and literature reviews.
(8th Febulary 2021)
Zero wall detecting zero-day web attacks through encoder-decoder recurrent ne...Danbi Cho
The presentation describes the zero-day detection using encoder-decoder recurrent neural networks while getting ideas from machine translation of natural language processing.
I presented this in a graduate class.
(Dec 2nd, 2020)
The presentation explains the decision tree and ensemble in machine learning.
I presented this at the Big data club for college students.
(Jan 31st, 2019)
The presentation explains the recurrent neural networks warp time.
It considers the invariance to time rescaling and invariance to time warpings with pure warpings and padding.
(Nov 18th, 2019)
Man is to computer programmer as woman is to homemaker debiasing word embeddingsDanbi Cho
This presentation describes the gender bias explaining the debiasing algorithms.
This paper uses the embedding method for debiasing.
I presented this paper in the natural language processing lab as an undergraduate research assistant.
(July 30th, 2019)
Situation recognition visual semantic role labeling for image understandingDanbi Cho
This presentation explains the situation recognition with visual semantic role labeling for image understanding.
I presented this paper in the natural language processing lab as an undergraduate research assistant.
(July 16th, 2019)
Mitigating unwanted biases with adversarial learningDanbi Cho
The presentation describes the AI bias with adversarial learning.
It includes the AI Fairness 360 open source by IBM.
I presented this paper in the natural language processing lab as an undergraduate research assistant.
(July 9th, 2019)
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Using IESVE for Room Loads Analysis - Australia & New Zealand
ELECTRA_Pretraining Text Encoders as Discriminators rather than Generators
1. 자연어처리 연구실
M2020064
조단비
Published in: The 8th International Conference on Learning Representations (ICLR 2020)
URL: https://arxiv.org/abs/2003.10555
2. Content
1. Idea
2. Introduce
3. Mothed
4. Experiments and results
5. Summary
#Kookmin_University #Natural_Language_Processing_lab. 1
3. Idea
#Kookmin_University #Natural_Language_Processing_lab. 2
[BERT]
> Replacing some token
with [MASK]
(masked language modeling)
[ELECTRA]
> Replacing some token
with plausible alternatives sampled
from a small generator network
Problem)
require large amounts of compute
Proposal)
a more sample-efficient pre training task
: replaced token detection
[BERT]
> Train a model
> Predicts the original identities
of the corrupted token
[ELECTRA]
> Train a discriminative model
> Predicts whether each token
in the corrupted input was replaced
by a generator sample or not
5. Introduce
> SOTA representation learning = learning the DAE(Denoising Autoencoder)
> Proposal method: replaced token detection
> Goal: improve the efficiency of pre-training
#Kookmin_University #Natural_Language_Processing_lab. 4
Restoring the original input token
Masking or attention (BERT, XLNet)
Input token
Substantial compute cost is incurred
The network only learns from 15% of the tokens per example
Predict the original token or replacement token
Replacement using samples
Input token
Samples are generated by a small masked language model
Model learns all input tokens as discriminator
8. Method
> Generator
(1) Input token sequence 𝒙 = [𝑥1, 𝑥2, … , 𝑥𝑛]
(2) Select a random set of positions (between 1 to n) for masking 𝑚𝑖~𝑢𝑛𝑖𝑓 1, 𝑛 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑘 (k=0.15n)
(3) Replace the token of selected position with [MASK] 𝒙𝒎𝒂𝒔𝒌𝒆𝒅
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, 𝑀𝐴𝑆𝐾 )
(4) Learn to predict the original identities of masking tokens using a small MLM 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒐𝒓
(5) Output the predicted token from generator with softmax ෝ
𝑥𝑖~𝑃𝐺 𝑥𝑖 𝒙𝒎𝒂𝒔𝒌𝒆𝒅 𝑓𝑜𝑟 𝑖 ∈ 𝒎
(6) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
#Kookmin_University #Natural_Language_Processing_lab. 7
(1) (2,3) (4) (5,6)
9. Method
> Discriminator
(1) Replace the masking token with predicted token in generator 𝒙𝒄𝒐𝒓𝒓𝒖𝒑𝒕
= 𝑅𝐸𝑃𝐿𝐴𝐶𝐸(𝒙, 𝒎, ෝ
𝒙)
(2) Learn to distinguish the original token from the corrupted token 𝒅𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒐𝒓
(3) Output the predicted type of input token with sigmoid
> Loss function
- Minimize the combined loss
𝑚𝑖𝑛𝜃𝐺,𝜃𝐷
𝒙∈𝜒
𝐿𝑀𝐿𝑀 𝑥, 𝜃𝐺 + 𝜆𝐿𝐷𝑖𝑠𝑐(𝑥, 𝜃𝐷)
(* 𝐿𝑀𝐿𝑀: loss of generator, 𝐿𝐷𝑖𝑠𝑐: loss of discriminator)
#Kookmin_University #Natural_Language_Processing_lab. 8
(1) (2) (3)
10. Experiments and results
#Kookmin_University #Natural_Language_Processing_lab. 9
1) Experimental setup
2) Model Extensions
3) Small Models
4) Large Models
5) Efficiency Analysis
11. Experimental Setup
> Evaluation
- GLUE (General Language Understanding Evaluation): 9 tasks (average score)
- CoLA: Is the sentence grammatical or ungrammatical?
- SST: Is the movie review positive, negative or neutral?
- MRPC: Is the sentence B a paraphrase of sentence A?
- STS: How similar are sentences A and B?
- QQP: Are the two questions similar?
- MNLI: Does sentence A entail or contradict sentence B?
- QNLI: Does sentence B contain the answer to the question in sentence A?
- RTE: Does sentence A entail sentence B?
- WNLI: Sentence B replaces sentence A’s ambiguous pronoun with one of the nouns – Is this the correct noun?
- SQuAD (Stanford Question Answering)
#Kookmin_University #Natural_Language_Processing_lab. 10
https://rajpurkar.github.io/SQuAD-explorer/
https://gluebenchmark.com/
12. Model Extensions
> Weight sharing
- Sharing weights between the generator and discriminator
- Model size(generator) == Model size(discriminator) (*model size = the number of hidden units)
#. Compare the weight tying strategies (GLUE score)
- for no weight tying: 83.6
- for tying token embeddings: 84.3 (*advantage: MLM task is effective to learn the token embedding)
- for tying all weights: 84.4 (*disadvantage: requiring the generator and discriminator to be the same size)
- Model size(generator) < Model size(discriminator)
: using the token and positional embedding weights (it is effective)
#Kookmin_University #Natural_Language_Processing_lab. 11
13. Model Extensions
> Smaller generators
- If the generator and discriminator are the same size, it has expensive computing cost
- Reducing the model size by decreasing the layer sizes while keeping the other hyperparameters constant
- When generators have ¼~½ the size of the discriminator, GLUE score is best
#Kookmin_University #Natural_Language_Processing_lab. 12
When the sizes of generator and discriminator are the same
When the sizes of generator and discriminator are different
14. Model Extensions
> Training algorithms (try)
1. Train only the generator with loss of MLM for 𝑛 steps
2. Initialize the weights of the discriminator with the weights of the generator
Then train the discriminator with loss of discriminator for 𝑛 steps, keeping the generator’s weights frozen
- Addition, explore training the generator adversarially as GAN (58%)
- Problem1: inefficiency of reinforcement learning when working in the large action space of generating text
- Problem2: low-entropy of output distribution in generator with adversarial learning
#Kookmin_University #Natural_Language_Processing_lab. 13
15. Small Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 14
16. Large Models - GLUE
#Kookmin_University #Natural_Language_Processing_lab. 15
= ¼ of RoBERTa (400K steps)
= RoBERTa (1,750K steps)
Dev set
Test set
17. Small Models & Large Models - SQuAD
#Kookmin_University #Natural_Language_Processing_lab. 16
18. Efficiency Analysis
> Effect validation of ELECTRA
1. ELECTRA 15%: loss of discriminator uses the 15% masked token
- testing the effective of calculating the loss from all tokens
> ELECTRA (85%) > ELECTRA 15% (82.4%)
2. Replace MLM: replace the [MASK] token with sample token of generator model
- testing the effective of replacing the [MASK] token with the sample token of generator
> Replace MLM (82.4%) > BERT (82.2%)
3. All-Tokens MLM: discriminator predicts the all token as well as masked token
- testing the effective of sigmoid layer for deciding whether the original token copies
> All-Tokens MLM (84.3%) > Replace MLM (82.4%)
#Kookmin_University #Natural_Language_Processing_lab. 17
19. Summary
> Proposal:
replaced token detection (a new self-supervised task for language representation learning)
> Key idea:
Training a text encoder to distinguish input tokens from high-quality negative samples by generator
> Performance:
ELECTRA is more compute-efficient and better performance than masked language models
#Kookmin_University #Natural_Language_Processing_lab. 18