SlideShare a Scribd company logo
1 of 40
Download to read offline
백지윤 2021.04.24
Visualizing and Measuring the
Geometry of BERT
자연어처리 팀 : 김은희, 백지윤, 신동진, 진명훈
발표자 : 백지윤
INDEX
• BERT 소개, 논문 intro 

• Geometry of syntax
• Geometry of word senses 

• 결론
BERT 소개
• Bert 는 트랜스포머의 인코더만 따옴 

• GPT 는 트랜스포머의 디코더만 따옴 

•
Bert
Bert
GPT
항목 GPT BERT
Transformer
block
Decoder
Block
Encoder
Block
Attention 방
향
Uni-
directional
Bi-
directional
문장 생성
활용 여부
문장 생성
가능
직접 생성
불가능
언어모델 : 언어라는 현상을 모델링하고자 단어 시퀀스에 확률을 할당하는 모델
=> 언어를 배우는 모델
나는 MASK 1 을 먹었다. 나는 배가 고프다. Input
나는 떡볶이 을 먹었다.
CLS
나는 배가 고프다.
0 output
Cross entropy loss 가 Mask 토큰에만 매겨짐 !
Let Bert learn languages !
BERT 가 언어를 배우는 과정을 면밀히 알아보자 !
•BERT 가 두 과제를 수행하는 과정
에서 구축된 geometry of syntax
•BERT 가 두 과제를 수행하는 과정
에서 구축된 geometry of word
senses
논문 intro
Geometry of syntax
1. Attention probes
About the Dataset : the Penn Treebank
• 말뭉치 주석 : 말뭉치의 활용도를 극대화하기 위해 말뭉치의 본문에 특별한 표시를 하는
작업 (tagging) ex. 품사 주석, 구문 주석, 의미 주석 등등

• 구문 주석에서는, 다양한 태그들이 문장 내 두 단어의 관계를 표현함 

• 펜 트리뱅크(Penn Treebank, 1990-92, 미국) : 330만 어절 이상 : 주로 월 스트리트
저널의 문장들로 되어있음. 공개되어 접근이 용이.
Adjectival modifier
Head : rainy 👩 Child : weather 👧 🧒
“ rainy ” 에 의해서 “ weather ” 의 의미가 결정
• Filtered 30 dependency relations
with more than 5,000 examples
in the dataset 

• 30% train/test split
Dataset : the Penn Treebank
attention probe
⛳ : classify a given relation between two tokens (token i, token j)
• Input : a model-wide attention vector formed by concatenating the entries
a[i,j] in every attention matrix from every attention head in every layer
Input
Head 1 의 value 를 곱해서 더하기 전의 토큰 1의 다른 토큰들에 대한 attention score


Head 1 의 value 를 곱해서 더하기 전의 토큰 2의 다른 토큰들에 대한 attention score
⛳ : classify a given relation between two tokens ex.(token 1, token 3)
0.0
Layer 1 , Head 1 에서의 토큰 1 과 토큰 3 의 score
…
=> layer 개수 * head 개수 = attention vector 차원
attention probe
• : 두 토큰 사이에 dependency relation 이 존재하는지
(binary classification)
• : dependency relation 이 존재한다면 어떠한 dependency relation 인지
(multi class classification)
• probing task 에 이용한 tool : L2 regularized linear classifiers
⛳ : classify a given relation between two tokens (token i, token j)
First probe
Second probe
First probe
Second probe
Attention vector
Bert- base
The Penn Tree dataset
extracts
(L2 regularized)
Linear
outputs
Train !
per sentence
attention probe results
First probe Second probe
Accuracy 85.8 % Accuracy 71.9 %
Attention vectors 에 syntactic information 이 담겨져있다 ! => True
2. Geometry of parse tree
embeddings
“Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과
syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ”
STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳
사전 지식
The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence
approximates the number of hops between them in the dependency tree
• Language model 이 통사적 정보를 받아들이는 과정을 알아보자 !
- <A Structural Probe for Finding Syntax in Word Representation>

• h : 한 문장에 있는 토큰 m 개의 임베딩 값을 모아둔 벡터 (m x 1)

• A = B^T B 따라서, h^T A h >=0 A 는 준양행렬 

• B : 최적화시켜줄 선형변환 (k x m) parameter 

• Bh -> h 의 새로운 차원 (k x 1)
There exists an inner product on the representation space whose squared distance - encodes syntax tree distance
L 번째 문장의 i 번째 단어 L 번째 문장의 j 번째 단어
=
[ 1⃣
2⃣
]
[
= 🅰
🅱 ]
A =
[a c/2
c/2 b ]
-
( )
T
A
( -
)=
a ( 1⃣ - 🅰 ) ^2 + b ( 2⃣ - 🅱 ) ^2 + c ( 1⃣ - 🅰 ) ( 2⃣ - 🅱 )
=
💟
B =[ ] B = [☪
]
,
( 💟 - ☪ ) ^2
STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳
Root 🧓
👩
👧
🧒
tree embedding distance seems to correspond specifically to the square of Euclidean distance !
“Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과
syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ”
STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳
STEP 2. Parse Tree embedding vs Bert embedding Transformation (Hewitt and Manning’s)
Parse Tree vs Bert embedding Transformation visualization (PCA)
3. Geometry of word senses
1. Visualization of word senses
Remind
STEP 1. A user enters a word
STEP 2. The system retrieves 1,000 sentences containing that word
STEP 3. It sends these sentences to BERT - base as input
STEP 3. For each one, it retrieves the context embedding for the word from

a layer of the user’s choosing
He died soon
concatenate back to normal size Context embedding
Visualization of 1,000 sentences’
context embeddings of a specific word the user chose (‘die’)
2. Measurement of word sense
disambiguation capability
Measurement of word sense disambiguation capability
training data : SemCor (33,362 senses)
• word sense : one of the meanings of a word
“ I rolled a dice ”
Each neighbor is the centroid of a given word sense’s BERT-base embeddings in the training data
Test data : data from <Word sense disambiguation: A unified evaluation framework and empirical comparison>
(3,669 senses)
2.1 An embedding subspace for
word senses
BERT with probe
Geometry of syntax 때와 같이 From <A Structural Probe for Finding Syntax in Word Representation> 논문의 방식 그대로 활용

즉, 임베딩 값들간에 loss function 이 지시하는 거리 규칙을 만족시키는 공간으로 선형 변환시키는 B (parameter) 을 찾아가는 것
Loss function : 

min (the average cosine similarity with the same sense - the average cosine similar with different senses)
min max
참고 : https://www.youtube.com/watch?v=VAzpZh01g58&t=799s
3. Embedding distance and
context : a concatenation
experiment
went sense A went sense B
Individual similarity ratio : 

Similarity between the keyword embeddings and their matching sense centroids

_______________________________________________________________________

Similarity between the keyword embeddings in their opposing sense centroids
Concatenate
He thereupon went to London and spent the winter talking to men of wealth 

and 

He went prone on his stomach, the better to pursue his examination.
concatenate 함으로써 went 의 context embedding 에 went 정보도 반영
Concatenate similarity ratio : 

Similarity between the keyword embeddings and their matching sense centroids

_______________________________________________________________________

Similarity between the keyword embeddings in their opposing sense centroids
Failure mode for attention-based models : tokens indiscriminately absorb meaning from all neighbors !

More Related Content

What's hot

Role of unification and realization in natural language generation
Role of unification and realization in natural language generationRole of unification and realization in natural language generation
Role of unification and realization in natural language generationRishav Bhurtel
 
Meta back translation
Meta back translationMeta back translation
Meta back translationHyunKyu Jeon
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing Sandeep Wakchaure
 
Compound Noun Polysemy and Sense Enumeration in WordNet
Compound Noun Polysemy and Sense Enumeration in WordNet Compound Noun Polysemy and Sense Enumeration in WordNet
Compound Noun Polysemy and Sense Enumeration in WordNet Biswanath Dutta
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimizationSimple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimizationAttaporn Ninsuwan
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
 
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015Codemotion
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportAkshit Arora
 
Sdd metalanguage
Sdd metalanguageSdd metalanguage
Sdd metalanguagemary_ramsay
 

What's hot (20)

Role of unification and realization in natural language generation
Role of unification and realization in natural language generationRole of unification and realization in natural language generation
Role of unification and realization in natural language generation
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Acm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-finalAcm ihi-2010-pedersen-final
Acm ihi-2010-pedersen-final
 
AI Lesson 11
AI Lesson 11AI Lesson 11
AI Lesson 11
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
Compound Noun Polysemy and Sense Enumeration in WordNet
Compound Noun Polysemy and Sense Enumeration in WordNet Compound Noun Polysemy and Sense Enumeration in WordNet
Compound Noun Polysemy and Sense Enumeration in WordNet
 
P-6
P-6P-6
P-6
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
AI Lesson 09
AI Lesson 09AI Lesson 09
AI Lesson 09
 
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimizationSimple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
Icml12
Icml12Icml12
Icml12
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
 
Sdd metalanguage
Sdd metalanguageSdd metalanguage
Sdd metalanguage
 

Similar to visualizing and measuring the geometry of bert

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to TransformersSuman Debnath
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
 
Lecture 02 lexical analysis
Lecture 02 lexical analysisLecture 02 lexical analysis
Lecture 02 lexical analysisIffat Anjum
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdffikki11
 

Similar to visualizing and measuring the geometry of bert (11)

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
BERT (v3).pptx
BERT (v3).pptxBERT (v3).pptx
BERT (v3).pptx
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
Lecture 02 lexical analysis
Lecture 02 lexical analysisLecture 02 lexical analysis
Lecture 02 lexical analysis
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Recently uploaded

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 

Recently uploaded (20)

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 

visualizing and measuring the geometry of bert

  • 1. 백지윤 2021.04.24 Visualizing and Measuring the Geometry of BERT 자연어처리 팀 : 김은희, 백지윤, 신동진, 진명훈 발표자 : 백지윤
  • 2. INDEX • BERT 소개, 논문 intro • Geometry of syntax • Geometry of word senses • 결론
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. • Bert 는 트랜스포머의 인코더만 따옴 • GPT 는 트랜스포머의 디코더만 따옴 • Bert Bert GPT 항목 GPT BERT Transformer block Decoder Block Encoder Block Attention 방 향 Uni- directional Bi- directional 문장 생성 활용 여부 문장 생성 가능 직접 생성 불가능
  • 9. 언어모델 : 언어라는 현상을 모델링하고자 단어 시퀀스에 확률을 할당하는 모델 => 언어를 배우는 모델 나는 MASK 1 을 먹었다. 나는 배가 고프다. Input 나는 떡볶이 을 먹었다. CLS 나는 배가 고프다. 0 output Cross entropy loss 가 Mask 토큰에만 매겨짐 ! Let Bert learn languages !
  • 10. BERT 가 언어를 배우는 과정을 면밀히 알아보자 ! •BERT 가 두 과제를 수행하는 과정 에서 구축된 geometry of syntax •BERT 가 두 과제를 수행하는 과정 에서 구축된 geometry of word senses 논문 intro
  • 13. About the Dataset : the Penn Treebank • 말뭉치 주석 : 말뭉치의 활용도를 극대화하기 위해 말뭉치의 본문에 특별한 표시를 하는 작업 (tagging) ex. 품사 주석, 구문 주석, 의미 주석 등등 • 구문 주석에서는, 다양한 태그들이 문장 내 두 단어의 관계를 표현함 • 펜 트리뱅크(Penn Treebank, 1990-92, 미국) : 330만 어절 이상 : 주로 월 스트리트 저널의 문장들로 되어있음. 공개되어 접근이 용이. Adjectival modifier Head : rainy 👩 Child : weather 👧 🧒 “ rainy ” 에 의해서 “ weather ” 의 의미가 결정
  • 14. • Filtered 30 dependency relations with more than 5,000 examples in the dataset • 30% train/test split Dataset : the Penn Treebank
  • 15. attention probe ⛳ : classify a given relation between two tokens (token i, token j) • Input : a model-wide attention vector formed by concatenating the entries a[i,j] in every attention matrix from every attention head in every layer Input
  • 16. Head 1 의 value 를 곱해서 더하기 전의 토큰 1의 다른 토큰들에 대한 attention score Head 1 의 value 를 곱해서 더하기 전의 토큰 2의 다른 토큰들에 대한 attention score ⛳ : classify a given relation between two tokens ex.(token 1, token 3) 0.0 Layer 1 , Head 1 에서의 토큰 1 과 토큰 3 의 score … => layer 개수 * head 개수 = attention vector 차원
  • 17. attention probe • : 두 토큰 사이에 dependency relation 이 존재하는지 (binary classification) • : dependency relation 이 존재한다면 어떠한 dependency relation 인지 (multi class classification) • probing task 에 이용한 tool : L2 regularized linear classifiers ⛳ : classify a given relation between two tokens (token i, token j) First probe Second probe First probe Second probe Attention vector Bert- base The Penn Tree dataset extracts (L2 regularized) Linear outputs Train ! per sentence
  • 18. attention probe results First probe Second probe Accuracy 85.8 % Accuracy 71.9 % Attention vectors 에 syntactic information 이 담겨져있다 ! => True
  • 19. 2. Geometry of parse tree embeddings
  • 20. “Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과 syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ” STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳
  • 21. 사전 지식 The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence approximates the number of hops between them in the dependency tree • Language model 이 통사적 정보를 받아들이는 과정을 알아보자 ! - <A Structural Probe for Finding Syntax in Word Representation> • h : 한 문장에 있는 토큰 m 개의 임베딩 값을 모아둔 벡터 (m x 1) • A = B^T B 따라서, h^T A h >=0 A 는 준양행렬 • B : 최적화시켜줄 선형변환 (k x m) parameter • Bh -> h 의 새로운 차원 (k x 1) There exists an inner product on the representation space whose squared distance - encodes syntax tree distance
  • 22. L 번째 문장의 i 번째 단어 L 번째 문장의 j 번째 단어 = [ 1⃣ 2⃣ ] [ = 🅰 🅱 ] A = [a c/2 c/2 b ] - ( ) T A ( - )= a ( 1⃣ - 🅰 ) ^2 + b ( 2⃣ - 🅱 ) ^2 + c ( 1⃣ - 🅰 ) ( 2⃣ - 🅱 ) = 💟 B =[ ] B = [☪ ] , ( 💟 - ☪ ) ^2
  • 23. STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳
  • 24. Root 🧓 👩 👧 🧒 tree embedding distance seems to correspond specifically to the square of Euclidean distance !
  • 25. “Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과 syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ” STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳 STEP 2. Parse Tree embedding vs Bert embedding Transformation (Hewitt and Manning’s) Parse Tree vs Bert embedding Transformation visualization (PCA)
  • 26.
  • 27.
  • 28. 3. Geometry of word senses
  • 29. 1. Visualization of word senses
  • 30. Remind STEP 1. A user enters a word STEP 2. The system retrieves 1,000 sentences containing that word STEP 3. It sends these sentences to BERT - base as input STEP 3. For each one, it retrieves the context embedding for the word from a layer of the user’s choosing He died soon concatenate back to normal size Context embedding Visualization of 1,000 sentences’ context embeddings of a specific word the user chose (‘die’)
  • 31. 2. Measurement of word sense disambiguation capability
  • 32. Measurement of word sense disambiguation capability training data : SemCor (33,362 senses) • word sense : one of the meanings of a word “ I rolled a dice ” Each neighbor is the centroid of a given word sense’s BERT-base embeddings in the training data Test data : data from <Word sense disambiguation: A unified evaluation framework and empirical comparison> (3,669 senses)
  • 33.
  • 34. 2.1 An embedding subspace for word senses
  • 35. BERT with probe Geometry of syntax 때와 같이 From <A Structural Probe for Finding Syntax in Word Representation> 논문의 방식 그대로 활용 즉, 임베딩 값들간에 loss function 이 지시하는 거리 규칙을 만족시키는 공간으로 선형 변환시키는 B (parameter) 을 찾아가는 것 Loss function : min (the average cosine similarity with the same sense - the average cosine similar with different senses) min max
  • 37. 3. Embedding distance and context : a concatenation experiment
  • 38. went sense A went sense B Individual similarity ratio : Similarity between the keyword embeddings and their matching sense centroids _______________________________________________________________________ Similarity between the keyword embeddings in their opposing sense centroids
  • 39. Concatenate He thereupon went to London and spent the winter talking to men of wealth and He went prone on his stomach, the better to pursue his examination. concatenate 함으로써 went 의 context embedding 에 went 정보도 반영 Concatenate similarity ratio : Similarity between the keyword embeddings and their matching sense centroids _______________________________________________________________________ Similarity between the keyword embeddings in their opposing sense centroids
  • 40. Failure mode for attention-based models : tokens indiscriminately absorb meaning from all neighbors !