PRESENTATION TITLE
SUBTITLE. SUBTITLE. SUBTITLE.
SUBTITLE. SUBTITLE.
4
PART 3: BERT
5
3. BERT
The BERT model is built upon the Transformer
architecture, which consists of a stack of identical layers
6
3. BERT
BERT’s pre-training process consists of two
main stages:
● pre-processing
● pre-training
7
3. BERT
In the pre-processing stage, the text data is tokenized
into sentences and then further divided into smaller
segments called "tokens." These tokens are then
encoded into numerical representations using a
vocabulary mapping.
8
3. BERT
In the pre- training, BERT is trained on the pre-
processing data:
● MLM task
● NSP task
9
3. BERT
Masked Language Model
Masked Language Modeling (MLM): BERT randomly
masks a certain percentage of the input tokens and
predicts the masked tokens based on the surrounding
context
10
3. BERT
Next Sentence Prediction (NSP): BERT also
learns to predict whether two sentences appear
consecutively in the original text or not
11
3. BERT
BERT is trained on a large corpus of unlabeled text,
such as Wikipedia and BooksCorpus, with large scale
up to billions of words and millions of sentences. This
diverse and extensive dataset helps BERT in learning
rich and contextual language representations.
12
3. BERT
Although BERT learns contextual representations during
pre-training, it needs to be further fine-tuned on task-
specific data to achieve optimal performance.
Illustration of the pre-training / fine-tuning approach. 3 different downstream NLP tasks, MNLI, NER, and
SQuAD, are all solved with the same pre-trained language model, by fine-tuning on the specific task.
Image credit: Devlin et al 2019.
13
PART 4: EXPERIMENTS
14
4. EXPERIMENTS
4.1 GLUE
The General Language Understanding Evaluation
(GLUE) benchmark is a collection of resources for
training, evaluating, and analyzing natural language
understanding systems.
15
4. EXPERIMENTS
Table 1: GLUE test results
16
4. EXPERIMENTS
4.2 SQuAD1.1
The Stanford Question Answering Dataset
(SQuAD v1.1) is a collection of 100k crowd-
sourced question/answer pairs
17
4. EXPERIMENTS
Table 2: SQuAD 1.1 results
18
4. EXPERIMENTS
4.3 SQuAD2.0
The SQuAD 2.0 task extends the SQuAD 1.1
problem definition by allowing for the
possibility that no short answer exists in the
provided paragraph, making the problem
more realistic.
19
4. EXPERIMENTS
Table 3: SQuAD 2.0 results.
Table 3: SQuAD 2.0 results
20
21
22
23
24
25
THANK YOU !

sliffffffffffffffffffdasddasdffffffffh2.pptx

  • 3.
    PRESENTATION TITLE SUBTITLE. SUBTITLE.SUBTITLE. SUBTITLE. SUBTITLE.
  • 4.
  • 5.
    5 3. BERT The BERTmodel is built upon the Transformer architecture, which consists of a stack of identical layers
  • 6.
    6 3. BERT BERT’s pre-trainingprocess consists of two main stages: ● pre-processing ● pre-training
  • 7.
    7 3. BERT In thepre-processing stage, the text data is tokenized into sentences and then further divided into smaller segments called "tokens." These tokens are then encoded into numerical representations using a vocabulary mapping.
  • 8.
    8 3. BERT In thepre- training, BERT is trained on the pre- processing data: ● MLM task ● NSP task
  • 9.
    9 3. BERT Masked LanguageModel Masked Language Modeling (MLM): BERT randomly masks a certain percentage of the input tokens and predicts the masked tokens based on the surrounding context
  • 10.
    10 3. BERT Next SentencePrediction (NSP): BERT also learns to predict whether two sentences appear consecutively in the original text or not
  • 11.
    11 3. BERT BERT istrained on a large corpus of unlabeled text, such as Wikipedia and BooksCorpus, with large scale up to billions of words and millions of sentences. This diverse and extensive dataset helps BERT in learning rich and contextual language representations.
  • 12.
    12 3. BERT Although BERTlearns contextual representations during pre-training, it needs to be further fine-tuned on task- specific data to achieve optimal performance. Illustration of the pre-training / fine-tuning approach. 3 different downstream NLP tasks, MNLI, NER, and SQuAD, are all solved with the same pre-trained language model, by fine-tuning on the specific task. Image credit: Devlin et al 2019.
  • 13.
  • 14.
    14 4. EXPERIMENTS 4.1 GLUE TheGeneral Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
  • 15.
    15 4. EXPERIMENTS Table 1:GLUE test results
  • 16.
    16 4. EXPERIMENTS 4.2 SQuAD1.1 TheStanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowd- sourced question/answer pairs
  • 17.
    17 4. EXPERIMENTS Table 2:SQuAD 1.1 results
  • 18.
    18 4. EXPERIMENTS 4.3 SQuAD2.0 TheSQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.
  • 19.
    19 4. EXPERIMENTS Table 3:SQuAD 2.0 results. Table 3: SQuAD 2.0 results
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.