visualizing and measuring the geometry of bert

백지윤 2021.04.24
Visualizing and Measuring the
Geometry of BERT
자연어처리 팀 : 김은희, 백지윤, 신동진, 진명훈
발표자 : 백지윤

INDEX
• BERT 소개, 논문 intro

• Geometry of syntax
• Geometry of word senses

• 결론

• Bert 는 트랜스포머의 인코더만 따옴

• GPT 는 트랜스포머의 디코더만 따옴

•
Bert
Bert
GPT
항목 GPT BERT
Transformer
block
Decoder
Block
Encoder
Block
Attention 방
향
Uni-
directional
Bi-
directional
문장 생성
활용 여부
문장 생성
가능
직접 생성
불가능

언어모델 : 언어라는 현상을 모델링하고자 단어 시퀀스에 확률을 할당하는 모델
=> 언어를 배우는 모델
나는 MASK 1 을 먹었다. 나는 배가 고프다. Input
나는 떡볶이 을 먹었다.
CLS
나는 배가 고프다.
0 output
Cross entropy loss 가 Mask 토큰에만 매겨짐 !
Let Bert learn languages !

BERT 가 언어를 배우는 과정을 면밀히 알아보자 !
•BERT 가 두 과제를 수행하는 과정
에서 구축된 geometry of syntax
•BERT 가 두 과제를 수행하는 과정
에서 구축된 geometry of word
senses
논문 intro

About the Dataset : the Penn Treebank
• 말뭉치 주석 : 말뭉치의 활용도를 극대화하기 위해 말뭉치의 본문에 특별한 표시를 하는
작업 (tagging) ex. 품사 주석, 구문 주석, 의미 주석 등등

• 구문 주석에서는, 다양한 태그들이 문장 내 두 단어의 관계를 표현함

• 펜 트리뱅크(Penn Treebank, 1990-92, 미국) : 330만 어절 이상 : 주로 월 스트리트
저널의 문장들로 되어있음. 공개되어 접근이 용이.
Adjectival modifier
Head : rainy 👩 Child : weather 👧 🧒
“ rainy ” 에 의해서 “ weather ” 의 의미가 결정

• Filtered 30 dependency relations
with more than 5,000 examples
in the dataset

• 30% train/test split
Dataset : the Penn Treebank

attention probe
⛳ : classify a given relation between two tokens (token i, token j)
• Input : a model-wide attention vector formed by concatenating the entries
a[i,j] in every attention matrix from every attention head in every layer
Input

Head 1 의 value 를 곱해서 더하기 전의 토큰 1의 다른 토큰들에 대한 attention score

Head 1 의 value 를 곱해서 더하기 전의 토큰 2의 다른 토큰들에 대한 attention score
⛳ : classify a given relation between two tokens ex.(token 1, token 3)
0.0
Layer 1 , Head 1 에서의 토큰 1 과 토큰 3 의 score
…
=> layer 개수 * head 개수 = attention vector 차원

attention probe
• : 두 토큰 사이에 dependency relation 이 존재하는지
(binary classification)
• : dependency relation 이 존재한다면 어떠한 dependency relation 인지
(multi class classification)
• probing task 에 이용한 tool : L2 regularized linear classifiers
⛳ : classify a given relation between two tokens (token i, token j)
First probe
Second probe
First probe
Second probe
Attention vector
Bert- base
The Penn Tree dataset
extracts
(L2 regularized)
Linear
outputs
Train !
per sentence

attention probe results
First probe Second probe
Accuracy 85.8 % Accuracy 71.9 %
Attention vectors 에 syntactic information 이 담겨져있다 ! => True

2. Geometry of parse tree
embeddings

“Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과
syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ”
STEP 1. Bert embedding 값과 비교가능하도록 Tree 를 임베딩 하기 🌳

사전 지식
The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence
approximates the number of hops between them in the dependency tree
• Language model 이 통사적 정보를 받아들이는 과정을 알아보자 !
- <A Structural Probe for Finding Syntax in Word Representation>

• h : 한 문장에 있는 토큰 m 개의 임베딩 값을 모아둔 벡터 (m x 1)

• A = B^T B 따라서, h^T A h >=0 A 는 준양행렬

• B : 최적화시켜줄 선형변환 (k x m) parameter

• Bh -> h 의 새로운 차원 (k x 1)
There exists an inner product on the representation space whose squared distance - encodes syntax tree distance

L 번째 문장의 i 번째 단어 L 번째 문장의 j 번째 단어
=
[ 1⃣
2⃣
]
[
= 🅰
🅱 ]
A =
[a c/2
c/2 b ]
-
( )
T
A
( -
)=
a ( 1⃣ - 🅰 ) ^2 + b ( 2⃣ - 🅱 ) ^2 + c ( 1⃣ - 🅰 ) ( 2⃣ - 🅱 )
=
💟
B =[ ] B = [☪
]
,
( 💟 - ☪ ) ^2

Root 🧓
👩
👧
🧒
tree embedding distance seems to correspond specifically to the square of Euclidean distance !

“Bert 가 syntax 를 어떻게 이해하는지를 보려면 Bert 임베딩 값과
syntax 를 나타내는 parse tree 와의 비교를 해야하지 않을까 ? ”
STEP 2. Parse Tree embedding vs Bert embedding Transformation (Hewitt and Manning’s)
Parse Tree vs Bert embedding Transformation visualization (PCA)

1. Visualization of word senses

Remind
STEP 1. A user enters a word
STEP 2. The system retrieves 1,000 sentences containing that word
STEP 3. It sends these sentences to BERT - base as input
STEP 3. For each one, it retrieves the context embedding for the word from

a layer of the user’s choosing
He died soon
concatenate back to normal size Context embedding
Visualization of 1,000 sentences’
context embeddings of a specific word the user chose (‘die’)

2. Measurement of word sense
disambiguation capability

Measurement of word sense disambiguation capability
training data : SemCor (33,362 senses)
• word sense : one of the meanings of a word
“ I rolled a dice ”
Each neighbor is the centroid of a given word sense’s BERT-base embeddings in the training data
Test data : data from <Word sense disambiguation: A unified evaluation framework and empirical comparison>
(3,669 senses)

2.1 An embedding subspace for
word senses

BERT with probe
Geometry of syntax 때와 같이 From <A Structural Probe for Finding Syntax in Word Representation> 논문의 방식 그대로 활용

즉, 임베딩 값들간에 loss function 이 지시하는 거리 규칙을 만족시키는 공간으로 선형 변환시키는 B (parameter) 을 찾아가는 것
Loss function :

min (the average cosine similarity with the same sense - the average cosine similar with diﬀerent senses)
min max

참고 : https://www.youtube.com/watch?v=VAzpZh01g58&t=799s

3. Embedding distance and
context : a concatenation
experiment

went sense A went sense B
Individual similarity ratio :

Similarity between the keyword embeddings and their matching sense centroids

_______________________________________________________________________

Similarity between the keyword embeddings in their opposing sense centroids

Concatenate
He thereupon went to London and spent the winter talking to men of wealth

and

He went prone on his stomach, the better to pursue his examination.
concatenate 함으로써 went 의 context embedding 에 went 정보도 반영
Concatenate similarity ratio :

Similarity between the keyword embeddings and their matching sense centroids

_______________________________________________________________________

Similarity between the keyword embeddings in their opposing sense centroids

Failure mode for attention-based models : tokens indiscriminately absorb meaning from all neighbors !

visualizing and measuring the geometry of bert

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to visualizing and measuring the geometry of bert

Similar to visualizing and measuring the geometry of bert (11)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

visualizing and measuring the geometry of bert