SlideShare a Scribd company logo
Incremental Sense Weight Training for In-depth
Interpretation of Contextualized Word Embeddings
Xinyi Jiang, Zhengzhe Yang, Jinho D. Choi
Computer Science @ Emory University
● Contextualized word embeddings
❖ Considerable performance boosts by simply substituting
distributional word embeddings.
❖ Different representations generated for the same word type
with different topical senses.
❖ Contextualized word embedding dimensions get increased
to over 1000 without really understanding what the
dimensions mean.
● Word Embedding Interpretibility
Introduction
● Datasets and Word Embeddings
❖ Trained on SemCor and evaluated on SenEval (2001, 2004,
2007, 2013, 2015).
❖ ELMo: 256, 512 and 1,024 (all with 2-layer bi-LSTM).
❖ BERT: 768 (12 output layers) and 1,024 (24 layers).
❖ Flair: 2,048 and 4,096.
● KNN Methods
❖ Sense-based KNN.
❖ Word-based KNN.
● Baseline results:
WF:Wordbased KNN with
fall back using WordNet. W: Wordbased.
SF: Sense-based with fall back. S: Sense-based.
● Results for the original and embeddings with 5% dimensions
masked. For evaluations, each target word tries all the masks
of its appeared senses and selects the masking that produces
the closest distance d, where d is the sum of the distances
from the masked word to its k-nearest neighbors. From table 2,
we can see that word-based KNN with fall back works better in
general, where k = min(# of occurrences; 5).
● Results with individual embedding output layers with 5%
masked dimensions.
Half the embeddings are improved if being masked using the 5%
threshold, especially the last 10 layer outputs. Surprisingly, the
last layer output score is boosted by 3%.
❖ more considerable performance drop, worse original scores
❖ embeddings from output layers closer to the input layer
contain less insignificant dimensions.
Experiments
● Sense Weight Training
Given a large embedding dimension size, the hypothesis is that
not every embedding dimension plays a role in representing a
sense.
❖ the objective function: maximize the average pairwise
cosine similarity in all sense groups.
❖ the weight matrix: vector the same size of the word
embedding initialized for each sense. Each dimension
represents the importance of a specific dimension to that
sense.
❖ exploration-exploitation: In the first phase, N is randomly
generated. After a certain number of epochs, the
exploration-exploitation policy is applied where for
exploitation, the higher the dimension weight, the less likely
the dimension number generated.
❖ Furthermore, l1
regularization is applied for feature
❖ selection purpose, and AdaGrad is used to encourage
convergence.
Figure 1a and Figure 1b display a smaller with-in group distance
and a greater separation of sense groups when the embeddings
with weights lower than 0.5 are masked to 0 .
Sense Weight Training (SWT)
● Analysis
❖ ⍴ is the Spearman’s Rank-Order Correlation Coefficient
between the pair-wise cosine similarity of sense vectors and
the pair-wise path similarity scores between senses
provided by WordNet.
❖ The average cosine similarities within sense groups all
increase after dimensions are masked out for all models.
❖ The dimension weights are learned by our objective
function.
❖ The correlation test shows no significant performance
decrease (some even increase).
❖ The masked dimensions do not contribute to the sense
group relations.
❖ ELMo and Flair: a better correlation score.
❖ The number of embeddings that can be discarded increases
with the distance of the output layer to the input layer. This
result corresponds to ELMo’s claim that the embeddings
with output layers closer to the input layer are semantically
richer.
❖ Another pattern is that the verb sense groups tend to have
less number of dimensions getting masked out because
verb sense groups have more possible forms of tokens
belonging to the same sense group.
❖ We also attempted to mask out embedding dimensions with
higher weights. The 100 top most similar masked
embeddings fail to output any patterns and points.
Experiments
● We propose an algorithm for learning the importance of
dimension weights in sense groups.After training the weights
for word dimensions, the dimensions with less importance are
masked out and tested using a word sense disambiguation task
and two other evaluations.
● Limitations:
❖ For the evaluation, the path similarity provided by the
WordNet may not be the best to fit human judgements.
❖ Limited by dataset size.
● Future Works:
❖ Extend the algorithm to sense vector extraction.
❖ The applications of the algorithm can theoretically be
applied to any other grouped embeddings.
Conclusion
Table 1: Baseline results
Table 2: Results for the original and masked embeddings.
Figure 2: BERT-Large embeddings with 24 hidden layers.
Figure 3: ELMo embeddings with 3 models and 3 layers
each. 1: 1024. 2: 512. 3: 256.
Figure 4: BERT-Base embeddings with
12 layers
Figure 5: Flair results
Table 3: Correlation coefficient test results for original and masked word
embeddings with N dimensions masked.
Table 4: the embedding models with specific word embedding sense groups
(Sense) and numbers masked out (Nmasked
): “ask.v.01” is a verb sense,
meaning “inquire about”; “three.s.01” is an adjective sense, meaning “being
one more than two”; “man.n.01” is a noun sense, meaning “an adult person
who is male (as opposed to a woman)”.
(b) masked (threshold of 0.5)
Figure 1: Graphs of 20 selected sense groups with 100 embeddings each
for ELMo with a dimension size of 512 (third output layer). The projection of
dimensions from 512 to 2 is done by Linear Discriminant Analysis.
(a) original

More Related Content

What's hot

K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basics
ananth
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
Frozen Paradise
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
ananth
 
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and TrackingMATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
Ahmed Gad
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
rameswara reddy venkat
 
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Ahmed Gad
 
Clustering part 1
Clustering part 1Clustering part 1
Clustering part 1
Abdul Kawsar Tushar
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
waseem khan
 
Training machine learning knn 2017
Training machine learning knn 2017Training machine learning knn 2017
Training machine learning knn 2017
Iwan Sofana
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
Derek Kane
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
Dessy Amirudin
 
Cmpe 255 Short Story Assignment
Cmpe 255 Short Story AssignmentCmpe 255 Short Story Assignment
Cmpe 255 Short Story Assignment
San Jose State University
 
Attention
AttentionAttention
Attention
SEMINARGROOT
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Fuzzy neural networks
Fuzzy neural networksFuzzy neural networks
Fuzzy neural networks
Brainware University
 
Multi-task Learning for Dense Prediction Tasks in Computer Vision
Multi-task Learning for Dense Prediction Tasks in Computer VisionMulti-task Learning for Dense Prediction Tasks in Computer Vision
Multi-task Learning for Dense Prediction Tasks in Computer Vision
Arun Talkad
 
Mca2020 advanced data structure
Mca2020  advanced data structureMca2020  advanced data structure
Mca2020 advanced data structure
smumbahelp
 
Neural collaborative filtering-발표
Neural collaborative filtering-발표Neural collaborative filtering-발표
Neural collaborative filtering-발표
hyunsung lee
 
Catching co occurrence information using word2vec-inspired matrix factorization
Catching co occurrence information using word2vec-inspired matrix factorizationCatching co occurrence information using word2vec-inspired matrix factorization
Catching co occurrence information using word2vec-inspired matrix factorization
hyunsung lee
 

What's hot (20)

K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Machine Learning Lecture 2 Basics
Machine Learning Lecture 2 BasicsMachine Learning Lecture 2 Basics
Machine Learning Lecture 2 Basics
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and TrackingMATLAB Code + Description : Real-Time Object Motion Detection and Tracking
MATLAB Code + Description : Real-Time Object Motion Detection and Tracking
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
Derivation of Convolutional Neural Network (ConvNet) from Fully Connected Net...
 
Clustering part 1
Clustering part 1Clustering part 1
Clustering part 1
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Training machine learning knn 2017
Training machine learning knn 2017Training machine learning knn 2017
Training machine learning knn 2017
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Cmpe 255 Short Story Assignment
Cmpe 255 Short Story AssignmentCmpe 255 Short Story Assignment
Cmpe 255 Short Story Assignment
 
Attention
AttentionAttention
Attention
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Fuzzy neural networks
Fuzzy neural networksFuzzy neural networks
Fuzzy neural networks
 
Multi-task Learning for Dense Prediction Tasks in Computer Vision
Multi-task Learning for Dense Prediction Tasks in Computer VisionMulti-task Learning for Dense Prediction Tasks in Computer Vision
Multi-task Learning for Dense Prediction Tasks in Computer Vision
 
Mca2020 advanced data structure
Mca2020  advanced data structureMca2020  advanced data structure
Mca2020 advanced data structure
 
Neural collaborative filtering-발표
Neural collaborative filtering-발표Neural collaborative filtering-발표
Neural collaborative filtering-발표
 
Catching co occurrence information using word2vec-inspired matrix factorization
Catching co occurrence information using word2vec-inspired matrix factorizationCatching co occurrence information using word2vec-inspired matrix factorization
Catching co occurrence information using word2vec-inspired matrix factorization
 

Similar to Incremental Sense Weight Training for In-depth Interpretation of Contextualized Word Embeddings

Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
NameetDaga1
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
GeonDoPark1
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
Gyeongman Kim
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
Jinho Choi
 
Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
Shubhangi Tandon
 
J017256674
J017256674J017256674
J017256674
IOSR Journals
 
Supervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting TechniqueSupervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting Technique
iosrjce
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
Masahiro Kaneko
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deren Lei
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
JOBANPREETSINGH62
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
ShivangiYadav42
 
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge Bases
Kushal Arora
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
saurav singla
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
Sharvil Katariya
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
Ahmed Magdy Ezzeldin, MSc.
 

Similar to Incremental Sense Weight Training for In-depth Interpretation of Contextualized Word Embeddings (20)

Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
 
Distilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model CompressionDistilling Linguistic Context for Language Model Compression
Distilling Linguistic Context for Language Model Compression
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 
Word embedding
Word embedding Word embedding
Word embedding
 
Nn kb
Nn kbNn kb
Nn kb
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
J017256674
J017256674J017256674
J017256674
 
Supervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting TechniqueSupervised WSD Using Master- Slave Voting Technique
Supervised WSD Using Master- Slave Voting Technique
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge Bases
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
Jinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
Jinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Jinho Choi
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
Jinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
Jinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
Jinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
Jinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
Jinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
Jinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
Jinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological Sort
Jinho Choi
 
Tries - Put
Tries - PutTries - Put
Tries - Put
Jinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Jinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Jinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
Jinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

Incremental Sense Weight Training for In-depth Interpretation of Contextualized Word Embeddings

  • 1. Incremental Sense Weight Training for In-depth Interpretation of Contextualized Word Embeddings Xinyi Jiang, Zhengzhe Yang, Jinho D. Choi Computer Science @ Emory University ● Contextualized word embeddings ❖ Considerable performance boosts by simply substituting distributional word embeddings. ❖ Different representations generated for the same word type with different topical senses. ❖ Contextualized word embedding dimensions get increased to over 1000 without really understanding what the dimensions mean. ● Word Embedding Interpretibility Introduction ● Datasets and Word Embeddings ❖ Trained on SemCor and evaluated on SenEval (2001, 2004, 2007, 2013, 2015). ❖ ELMo: 256, 512 and 1,024 (all with 2-layer bi-LSTM). ❖ BERT: 768 (12 output layers) and 1,024 (24 layers). ❖ Flair: 2,048 and 4,096. ● KNN Methods ❖ Sense-based KNN. ❖ Word-based KNN. ● Baseline results: WF:Wordbased KNN with fall back using WordNet. W: Wordbased. SF: Sense-based with fall back. S: Sense-based. ● Results for the original and embeddings with 5% dimensions masked. For evaluations, each target word tries all the masks of its appeared senses and selects the masking that produces the closest distance d, where d is the sum of the distances from the masked word to its k-nearest neighbors. From table 2, we can see that word-based KNN with fall back works better in general, where k = min(# of occurrences; 5). ● Results with individual embedding output layers with 5% masked dimensions. Half the embeddings are improved if being masked using the 5% threshold, especially the last 10 layer outputs. Surprisingly, the last layer output score is boosted by 3%. ❖ more considerable performance drop, worse original scores ❖ embeddings from output layers closer to the input layer contain less insignificant dimensions. Experiments ● Sense Weight Training Given a large embedding dimension size, the hypothesis is that not every embedding dimension plays a role in representing a sense. ❖ the objective function: maximize the average pairwise cosine similarity in all sense groups. ❖ the weight matrix: vector the same size of the word embedding initialized for each sense. Each dimension represents the importance of a specific dimension to that sense. ❖ exploration-exploitation: In the first phase, N is randomly generated. After a certain number of epochs, the exploration-exploitation policy is applied where for exploitation, the higher the dimension weight, the less likely the dimension number generated. ❖ Furthermore, l1 regularization is applied for feature ❖ selection purpose, and AdaGrad is used to encourage convergence. Figure 1a and Figure 1b display a smaller with-in group distance and a greater separation of sense groups when the embeddings with weights lower than 0.5 are masked to 0 . Sense Weight Training (SWT) ● Analysis ❖ ⍴ is the Spearman’s Rank-Order Correlation Coefficient between the pair-wise cosine similarity of sense vectors and the pair-wise path similarity scores between senses provided by WordNet. ❖ The average cosine similarities within sense groups all increase after dimensions are masked out for all models. ❖ The dimension weights are learned by our objective function. ❖ The correlation test shows no significant performance decrease (some even increase). ❖ The masked dimensions do not contribute to the sense group relations. ❖ ELMo and Flair: a better correlation score. ❖ The number of embeddings that can be discarded increases with the distance of the output layer to the input layer. This result corresponds to ELMo’s claim that the embeddings with output layers closer to the input layer are semantically richer. ❖ Another pattern is that the verb sense groups tend to have less number of dimensions getting masked out because verb sense groups have more possible forms of tokens belonging to the same sense group. ❖ We also attempted to mask out embedding dimensions with higher weights. The 100 top most similar masked embeddings fail to output any patterns and points. Experiments ● We propose an algorithm for learning the importance of dimension weights in sense groups.After training the weights for word dimensions, the dimensions with less importance are masked out and tested using a word sense disambiguation task and two other evaluations. ● Limitations: ❖ For the evaluation, the path similarity provided by the WordNet may not be the best to fit human judgements. ❖ Limited by dataset size. ● Future Works: ❖ Extend the algorithm to sense vector extraction. ❖ The applications of the algorithm can theoretically be applied to any other grouped embeddings. Conclusion Table 1: Baseline results Table 2: Results for the original and masked embeddings. Figure 2: BERT-Large embeddings with 24 hidden layers. Figure 3: ELMo embeddings with 3 models and 3 layers each. 1: 1024. 2: 512. 3: 256. Figure 4: BERT-Base embeddings with 12 layers Figure 5: Flair results Table 3: Correlation coefficient test results for original and masked word embeddings with N dimensions masked. Table 4: the embedding models with specific word embedding sense groups (Sense) and numbers masked out (Nmasked ): “ask.v.01” is a verb sense, meaning “inquire about”; “three.s.01” is an adjective sense, meaning “being one more than two”; “man.n.01” is a noun sense, meaning “an adult person who is male (as opposed to a woman)”. (b) masked (threshold of 0.5) Figure 1: Graphs of 20 selected sense groups with 100 embeddings each for ELMo with a dimension size of 512 (third output layer). The projection of dimensions from 512 to 2 is done by Linear Discriminant Analysis. (a) original