Hierarchical Transformer for Early Detection of Alzheimer’s Disease

Hierarchical Transformer
for Early Detection of
Alzheimer’s Disease
Renxuan Li
Advisor: Dr. Jinho Choi
Emory University, Department of Computer Science

Content
• Classification task: MCI detection
• Background and Previous Work
• Dataset: B-SHARP
• Models for Individual Tasks
• Hierarchical Transformer Model
• Attention Analysis
• Ensemble Model Analysis

Introduction and motivation
• In a short word, we are trying to help the diagnosis of the early stage of
Alzheimer’s Disease(AD) with Natural Language Processing (NLP) methods.
• This study is essentially an MCI/normal classification task
• Early stage of AD is also known as Mild Cognitive Impairment (MCI)
• AD is not reversible or curable, but there are treatments to slow down the
development of AD, therefore, MCI diagnosis is very crucial.
• Our study focus on predicting subjects’ brain-health condition by processing their
transcribed speech.

General structure for classification tasks

Classification layers
• All classification layers in this study are Multi-Layer Perceptron with
different input size.
• MLP in formula: 𝑜𝑢𝑡 = 𝑖𝑛𝑝𝑢𝑡 ∗ 𝑊 + 𝑏

Related works: DementiaBank
• Tasks: Cookie Theft Picture Description
• 99 control subjects, 194 dementia subjects;
243 control recordings and 309 dementia
recordings; 552 in total.
• Yearly visits from subjects.
• Include grammatical features
• Cleaned up transcripts

Previous works: CNN
• Convolutional Neural Network (CNN)
• Captures local features well
• May use filters of different sizes
https://arxiv.org/pdf/1408.5882.pdf

Previous work: C-LSTM
• LSTM: Long Short Term Memory
Network, contains information of
sequence over time
• C-LSTM: Add LSTM layer over CNN
https://www.aclweb.org/anthology/N18-2110.pdf

B-SHARP dataset
• Brain Stress Hypertension Aging
Research Program
• Focus on MCI
• 185 normal subjects, 141 MCI
patients
• 650 transcripts in total: 385 Normal
and 265 MCI
• Transcribed using online tool
without manual correction

Speech Protocol
• In this study, we use only three of out five tasks of the speech protocol to do
classification.
• Question 1: Describe subjects’ events of the day
• Question 2: Describe the examination room
• Question 3: Describe the picture: Circus Procession

Language features
• We count 7 major average measures: # tokens, # sentences, # nouns, #verb,
# Conjunctions, #Complex structures, # Discourse.
• We see in all categories except # Discourse, the control subjects have
higher values, which means longer, and more complex speech.

Challenge
• 1. Transcription error: our text is not corrected manually.
• 2. Long sequence: Control entire transcript has on average 578 tokens and
MCI transcript has on average 548 tokens. Too long to fit into transformer
model directly

Approach
• CNN Baseline
• On individual Tasks
• BERT
• RoBERTa
• ALBERT
• Pipeline Hierarchical Transformer Model
• Hierarchical Transformer Joint Learning

CNN with larger filters
• We add filters of size 3,5,7,9,11,13,15, meaning that we can capture
relations 15 words apart.
• Static FastText Embedding with embedding size 300
• Channel size 600

self attention and
transformer encoder
• Transformer encoder:
• Based entirely on Self-
Attention Mechanism
• Multi-head attention of 12
heads
https://www.kdnuggets.com/2019/03/deconstructing-bert-part-2-visualizing-inner-workings-
attention.html

BERT for sequence classification
• Base model: 12 layers of transformer encoder, 12 attention heads each
• Pretrain tasks: Mask Language Model, Next Sentence Prediction
• Trained on BookCurpus (800M words) English Wiki(2500M words)
• Normally sequence over 512 cause memory overflow (on our machine)
• Only trained on individual tasks in speech protocol

BERT for
sequence
classification:
architecture
https://arxiv.org/pdf/1810.04805.pdf

RoBERTa for sequence classification
• Same architecture as BERT
• Dynamic Masked Language Model
• More careful parameter choice
• Longer training and bigger batch
• Normally sequence over length 512 cause memory overflow (on our
machine)
• Only trained on individual tasks in speech protocol

ALBERT for sequence classification
• Similar Architecture with BERT and RoBERTa
• Compressed embedding size.
• Share parameters across layers
• 9 times smaller than BERT-base
• Can fit three models onto our machine

Pipeline Hierarchical Transformer Model
• Each layer of model are kept
independent during training.
• Take [CLS] tokens from each
individual transformer models and
feed into an MLP after
concatenation
• Transformer trained on their own.

Hierarchical Transformer Joint Learning
• Entire model gets trained
• Train from initially saved individual
models

Data split
• Dataset is too small for conventional
8:1:1 train:dev:test set split,
therefore we use Five-Fold Cross
Validation.
• Each subject only appears in one set
Set0 Set1 Set2 Set3 Set4
# MCI 53 53 53 53 53
# Control 77 77 77 77 77
avg
MoCA
MCI
21.8 20.0 21.4 22.3 22.5
avg
MoCA
Control
26.1 26.5 25.7 26.9 26.7
MoCA: Montreal Cognitive Assessment.
Higher scores means better brain-condition

Transformers settings
• BERT: “bert-base-uncased”, 12 layers, 12 attention heads, 768 output
length, 768 embed size.
• RoBERTa: “roberta-base”, 12 layers, 12 attention heads, 768 output length,
768 embed size.
• ALBERT: “albert-base-v1”, 12 layers, 12 attention heads, 768 output length,
128 embed size.

Experiment set up
• Every model is trained three times, average and standard derivation are recorded.
• Three metric:
• Accuracy:
!"#$%&'()(*$+,"#$-$./)(*$
011
• Sensitivity:
!"#$%&'()(*$
!"#$%&'()(*$+2/3'$-$./)(*$
• Specificity:
!"#$-$./)(*$
!"#$-$./)(*$+2/3'$%&'()(*$

Performance on Individual tasks
• All three models performs good on Q2
• Overall, RoBERTa does the best.
• BERT does better on Q3

Performance on entire transcript
• We see gradual and substantial improvement on classification
performance
• RoBERTa ensemble model does the best among single model ensemble
• Sensitivity Improvement
• Specificity remains level
• Joint Learning not as good as Pipeline style

Analysis
Attention
analysis
RoBERTa model
analysis by
samples
Finding
Validation
Ensemble
analysis
RoBERTa
ensemble
analysis
B+R+A
ensemble
analysis

Attention analysis:
measures and methods
• We rank the tokens from a
transcript from highest ACR to
the lowest ACR.
• If a token has high ACR, it’s
getting more attended during
training, and more related to
how our models predict.
Attention Change Ratio:

On Roberta q1 model
MCI
• 1 Yeah 142.50%
• 2 ina 129.14%
• 3 tests 124.17%
• 4 Okay 122.25%
• 5 And 112.76%
• 6 ud 80.30%
• 7 now 75.60%
• 8 no 70.89%
• 9 body 66.21%
• 10 me 60.03%
Normal
• 1 friend 182.31%
• 2 arrived 169.92%
• 3 From 145.92%
• 4 Tiffany 140.62%
• 5 visit 137.53%
• 6 taken 122.23%
• 7 tests 114.66%
• 8 downstairs 109.35%
• 9 snack 107.38%
• 10 how 93.17%

On Roberta q2 model
MCI
• 1 door 147.34%
• 2 picture 131.71%
• 3 nine 121.43%
• 4 dog 118.22%
• 5 elt 115.01%
• 6 OK 114.02%
• 7 is 108.72%
• 8 hole 88.07%
• 9 did 85.81%
• 10 wall 81.19%
Nomral
• 1 holder 546.90%
• 2 ah 415.62%
• 3 screen 342.11%
• 4 stool 301.36%
• 5 sink 300.02%
• 6 counter 256.21%
• 7 area 244.19%
• 8 chart 237.63%
• 9 Tiffany 191.25%
• 10 cabinet 180.24%

On Roberta q3 model
MCI
• 1 ine 168.21%
• 2 anger 164.96%
• 3 stick 163.53%
• 4 glasses 137.93%
• 5 hat 118.84%
• 6 for 116.56%
• 7 view 103.13%
• 8 circle(circus) 94.01%
• 9 Ride 88.52%
• 10 outfit 86.98%
Normal
• 1 vest 184.92%
• 2 glasses 168.59%
• 3 white 167.55%
• 4 (tr)icycle 151.47%
• 5 collar 144.58%
• 6 uff 132.31%
• 7 handc 123.84%
• 8 tr(icycle) 121.67%
• 9 pocket 116.91%
• 10 cap 113.91%

Ensemble analysis:
Concept of vote
Vote for q1 Vote for q2 Vote for q3

RoBERTa
ensemble: vote
distribution

Task Comparison: Measures
• Correct Vote Ratio for question i: among 466 transcripts that the
ensemble model correctly classifies, in how much percentage that
the vote of question i is correct.
• Normalized Correct Vote Ratio: for each transcript, if the ensemble
model predicts correctly, then there will be 1 “point” given evenly to
the questions whose votes are correct. NCV is the average “point” a
question get for each transcript.
• % of Dominant Cases: percentage that the vote for question i is the
only correct one.

RoBERTa ensemble: Task Comparison

B+R+A ensemble
86.6% ARE
MAJORITY VOTES
6-CORRECT
SITUATION:34.50%
5-CORRECT
SITUATION 28.51%
RARELY ALL 9
INPUTS AGREE
(0.21%)
NO SITUATIONS
THAT LESS THAN 4
VOTES ARE
CORRECT.

B+R+A Ensemble: which input is more
important?
• The top three inputs form a set of transcripts
• Again, individual model accuracy (IMA) does not always match
importance in ensemble model

Bibliography
• [1] James T. Becker, Francois Boller, Oscar L. Lopez, Judith Saxton, andKaren L. McGonigle. The
natural history of alzheimer’s disease: descrip-tion of study cohort and accuracy of
diagnosis.Archives of Neurology,51(6):585–594, 1994.
• [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of Deep
Bidirectional Transformers for Lan-guage Understanding. InProceedings of the Conference of the
NorthAmerican Chapter of the Association for Computational Linguis-tics: Human Language
Technologies, pages 4171–4186, 2019. URLhttps://www.aclweb.org/anthology/N19-1423.
• [3] Serge Gauthier, Michel Panisset, Josephine Nalban-toglu, and JudesPoirie. Alzheimer’s dis-
ease: current knowledge, management and re-search.Canadian Medical Association Journal,
157(8):1047–1052, 1997.54
• [4] George G Glenner.Alzheimers disease.Springer, 1990.

Bibliography
• [5] Sweta Karleka, Tong Niu, and Mohit Bansal. Detecting linguistic charac-teristics of alzheimer’s dementia by interpreting
neural models. InPro-ceedings of the 2018 Conference of the North American Chapter of the As-sociation for
Computational Linguistics: Human Language Technologies,Volume 2 (Short Papers), pages 701–707, New Orleans,
Louisiana, June2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2110.
URLhttps://www.aclweb.org/anthology/N18-2110.
• [6] Yoon Kim.Convolutional Neural Networks for Sentence Clas-sification.InProceedings of the 2014 Conference on
EmpiricalMethods in Natural Language Processing (EMNLP), 2014.URLhttps://www.aclweb.org/anthology/D14-1181.
• [7] Taku Kudo and John Richardson. SentencePiece: A simple and lan-guage independent subword tokenizer and
detokenizer for Neural TextProcessing. InProceedings of the 2018 Conference on Empirical Methodsin Natural Language
Processing: System Demonstrations, pages 66–71,2018. URLhttps://www.aclweb.org/anthology/D18-2012.[8]Zhenzhong
Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel,
• [8] Piyush Sharma, and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.arXiv,
11942(1909),2019.

Bibliography
• [9]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, DanqiChen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and
Veselin Stoyanov.RoBERTa: A Robustly Optimized BERT Pretraining Approach. InProceedings of the International
Conference on Learning Representations,2020. URLhttps://openreview.net/forum?id=SyxS0T4tvS.
• [10]Frank Rudzicz, Rosalie Wang, Momotaz Begum, and Alex Mihailidis.Speech recognition in alzheimer’s disease with
personal assistive robots.InProceedings of the 5th Workshop on Speech and Language Processingfor Assistive
Technologies), page 20–28, Baltimore, Maryland, USA, June2014. Association for Computational Linguistics. doi:
10.3115/v1/W14-1904. URLhttps://www.aclweb.org/anthology/W14-1904.
• [11] Richard Suzman and John Beard. Global health and aging., 2011.
• [12]Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, MohammadNorouzi, Wolfgang Macherey, Maxim Krikun, Yuan
Cao, Qin Gao,Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaob]ing Liu, Lukasz Kaiser, Stephan Gouws,
Yoshikiyo Kato, Taku Kudo,Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, WeiWang, Cliff Young, Jason Smith,
Jason Riesa, Alex Rudnick, OriolVinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’sNeural Machine
Translation System: Bridging the Gap between Hu-man and Machine Translation.arXiv, 1609(08144),
2016.URLhttp://arxiv.org/abs/1609.08144.

Hierarchical Transformer for Early Detection of Alzheimer’s Disease

Recommended

Recommended

More Related Content

Similar to Hierarchical Transformer for Early Detection of Alzheimer’s Disease

Similar to Hierarchical Transformer for Early Detection of Alzheimer’s Disease (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

Hierarchical Transformer for Early Detection of Alzheimer’s Disease