SlideShare a Scribd company logo
1 of 43
CS 886 Deep Learning for Biotechnology
Ming Li
Jan. 21, 2022
CONTENT
01. Word2Vec
02. Attention / Transformers
03. Pretraining: GPT / BERT
04. Deep learning applications in proteomics
05. Student presentations begin
03 LECTURE
THREE
Pretraining: GPT-2 and BERT
Avoiding Information bottleneck
03 GPT-2, BERT
Last time we introduced transformer
03 GPT-2, BERT
Transformers, GPT-2, and BERT
03
1. A transformer uses Encoder stack to model input, and uses Decoder stack
to model output (using input information from encoder side).
2. But if we do not have input, we just want to model the “next word”, we
can get rid of the Encoder side of a transformer and output “next word”
one by one. This gives us GPT.
3. If we are only interested in training a language model for the input for
some other tasks, then we do not need the Decoder of the transformer, that
gives us BERT.
GPT-2, BERT
GPT-2, BERT
03
GPT-2, BERT
03
1542M
762M
345M
117M parameters
GPT released June 2018
GPT-2 released Nov. 2019 with 1.5B parameters
GPT-3: 175B parameters trained on 45TB texts
GPT-2 in action
GPT-2, BERT
03
not
injure
injure
a
a
human
human
being
being
Byte Pair Encoding (BPE)
GPT-2, BERT
03
Word embedding sometimes is too high level, pure character embedding too
low level. For example, if we have learned
old older oldest
We might also wish the computer to infer
smart smarter smartest
But at the whole word level, this might not be so direct. Thus the idea is to
break the words up into pieces like er, est, and embed frequent fragments of
words.
GPT adapts this BPE scheme.
Byte Pair Encoding (BPE)
GPT-2, BERT
03
GPT uses BPE scheme. The subwords are calculated by:
1. Split word to sequence of characters (add </w> char)
2. Joining the highest frequency pattern.
3. Keep doing step 2, until it hits the pre-defined maximum number of sub-words
or iterations.
Example (5, 2, 6, 3 are number of occurrences)
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 }
{‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9)
{‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7)
…..
Masked Self-Attention (to compute more efficiently)
03 GPT-2, BERT
Masked Self-Attention
03 GPT-2, BERT
Note: encoder-decoder attention block is gone
Masked Self-Attention Calculation
03 GPT-2, BERT
Note: encoder-decoder attention block is gone
Re-use previous computation results: at any step, only
need to results of q, k , v related to the new output
word, no need to re-compute the others. Additional
computation is linear, instead of quadratic.
GPT-2 fully connected network has two layers (Example for GPT-2
small)
03 GPT-2, BERT
768 is small model size
GPT-2 has a parameter top-k, so that we sample words from top k
(highest probability from softmax) words for each output
03 GPT-2, BERT
This top-k parameter, if k=1, we would have output like:
03 GPT-2, BERT
The first time I saw the new version of the game, I was
so excited. I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
the game, I was so excited to see the new version of
GPT Training
03 GPT-2, BERT
GPT-2 uses unsupervised learning approach
to training the language model.
There is no custom training for GPT-2, no
separation of pre-training and fine-tuning
like BERT.
A story generated by GPT-2
03 GPT-2, BERT
“The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-
horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally
solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions,
were exploring the Andes Mountains when they found a small valley, with no other animals or
humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded
by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. `By the time we reached the top of
one peak, the water looked blue, with some crystals on top,’ said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen
from the air without having to move too much to see them – they were so close they could
touch their horns."
Transformer / GPT prediction
03 GPT-2, BERT
GPT-2 Application: Translation
03 GPT-2, BERT
GPT-2 Application: Summarization
03 GPT-2, BERT
Using wikipedia data
03 GPT-2, BERT
BERT (Bidirectional Encoder Representation from Transformers)
03 GPT-2, BERT
Model input dimension 512
Input and output vector size
03 GPT-2, BERT
BERT pretraining
03 GPT-2, BERT
ULM-FiT (2018): Pre-training ideas, transfer learning in NLP.
ELMo: Bidirectional training (LSTM)
Transformer: Although used things from left, but still missing
from the right.
GPT: Use Transformer Decoder half.
BERT: Switches from Decoder to Encoder, so that it can use
both sides in training and invented corresponding training
tasks: masked language model
BERT Pretraining Task 1: masked words
03 GPT-2, BERT
Out of this 15%,
80% are [Mask],
10% random words
10% original words
BERT Pretraining Task 2: two sentences
03 GPT-2, BERT
BERT Pretraining Task 2: two sentences
03 GPT-2, BERT
50% true second sentences
50% random second sentences
Fine-tuning BERT for other specific tasks
03 GPT-2, BERT
SST (Stanford
sentiment treebank):
215k phrases with
fine-grained
sentiment labels in
the parse trees of
11k sentences.
MNLI
QQP (Quaro Question Pairs)
Semantic equivalence)
QNLI (NL inference dataset)
STS-B (texture similarity)
MRPC (paraphrase, Microsoft)
RTE (textual entailment)
SWAG (commonsense inference)
SST-2 (sentiment)
CoLA (linguistic acceptability
SQuAD (question and answer)
NLP Tasks: Multi-Genre Natural Lang. Inference
03 GPT-2, BERT
MNLI: 433k
pairs of
examples,
labeled by
entailment,
neutral or
contraction
NLP Tasks (SQuAD -- Stanford Question Answering Dataset):
03 GPT-2, BERT
Sample: Super Bowl 50 was an American football game to
determine the champion of the National Football League (NFL)
for the 2015 season. The American Football Conference (AFC)
champion Denver Broncos defeated the National Football
Conference (NFC) champion Carolina Panthers 24–10 to earn
their third Super Bowl title. The game was played on February 7,
2016, at Levi's Stadium in the San Francisco Bay Area at Santa
Clara, California. As this was the 50th Super Bowl, the league
emphasized the "golden anniversary" with various gold-themed
initiatives, as well as temporarily suspending the tradition of
naming each Super Bowl game with Roman numerals (under
which the game would have been known as "Super Bowl L"), so
that the logo could prominently feature the Arabic numerals 50.
Which NFL team represented the
AFC at Super Bowl 50?
Ground Truth Answers: Denver
Broncos
Which NFL team represented the
NFC at Super Bowl 50?
Ground Truth Answers: Carolina
Panthers
Add indices for sentences and paragraphs
SegaTron/SegaBERT
H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling
and understanding. AAAI’2021
Conversion speed much faster:
Testing on GLUE dataset
H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling
and understanding. AAAI’2021
Reading comprehesion – SQUAD tasks
F1 = 2 (P*R) / (P+R), P is precision, R is recall, all in percentage, EM – exact match
Improving Transformer-XL
Looking at Attention
Looking at Attention
Feature Extraction
03 GPT-2, BERT
We start with
independent
word embedding
at first level
We end up with
some embedding
for each word
related to current
input
Feature Extraction, which embedding to use?
03 GPT-2, BERT
What we have learned
03 GPT-2, BERT
1. Model size matters (345 million parameters is better than 110
million parameters).
2. With enough training data, more training steps imply higher
accuracy
3. Key innovation: Learning from unannotated data.
4. In biotechnology, we also have a lot of such data (for example
meta-genomes).
Literature & Resources for Transformers
03
Resources:
OpenAI GPT-2 implementation: https://github.com/openai/gpt-2
BERT paper: J. Devlin et al, BERT, pretraining of deep bidirectional
transformers for language understanding. Oct. 2018.
ELMo paper: M. Peters, et al, Deep contextualized word representation, 2018
ULM-FiT paper: Universal language model fine-tuning for text classification. J.
Howeard, S. Ruder., 2018
Jay Alammar, The illustrated GPT-2, https://jalammar.github.io/illustrated-
gpt2/
GPT-2, BERT

More Related Content

Similar to Deep learning for biotechnology presentation

1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTPJordi Llonch
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsSubhabrata Mukherjee
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Assignments of source coding theory and applications
Assignments of source coding theory and applicationsAssignments of source coding theory and applications
Assignments of source coding theory and applicationsGaspard Ggas
 
Arrays and pointers
Arrays and pointersArrays and pointers
Arrays and pointersKevin Nguyen
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into wordsLiang Wang
 
A TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CA TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CJoshua Gorinson
 
Traductor ingl e9edimburghl
Traductor ingl e9edimburghlTraductor ingl e9edimburghl
Traductor ingl e9edimburghlAlex Curtis
 
Machine translation using mutiplexed pdt for chatting slang
Machine translation using mutiplexed pdt for chatting slangMachine translation using mutiplexed pdt for chatting slang
Machine translation using mutiplexed pdt for chatting slangIAEME Publication
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptManimaran A
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...cscpconf
 
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...Katerina Vylomova
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkKazuki Fujikawa
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 

Similar to Deep learning for biotechnology presentation (20)

1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP
 
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual ModelsXtremeDistil: Multi-stage Distillation for Massive Multilingual Models
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
EB-eye Back End
EB-eye Back EndEB-eye Back End
EB-eye Back End
 
Assignments of source coding theory and applications
Assignments of source coding theory and applicationsAssignments of source coding theory and applications
Assignments of source coding theory and applications
 
Arrays and pointers
Arrays and pointersArrays and pointers
Arrays and pointers
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into words
 
A TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN CA TUTORIAL ON POINTERS AND ARRAYS IN C
A TUTORIAL ON POINTERS AND ARRAYS IN C
 
Traductor ingl e9edimburghl
Traductor ingl e9edimburghlTraductor ingl e9edimburghl
Traductor ingl e9edimburghl
 
Machine translation using mutiplexed pdt for chatting slang
Machine translation using mutiplexed pdt for chatting slangMachine translation using mutiplexed pdt for chatting slang
Machine translation using mutiplexed pdt for chatting slang
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.ppt
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
 
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
Pycon Korea 2020
Pycon Korea 2020 Pycon Korea 2020
Pycon Korea 2020
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Pointers in C
Pointers in CPointers in C
Pointers in C
 

Recently uploaded

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Recently uploaded (20)

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Deep learning for biotechnology presentation

  • 1. CS 886 Deep Learning for Biotechnology Ming Li Jan. 21, 2022
  • 2. CONTENT 01. Word2Vec 02. Attention / Transformers 03. Pretraining: GPT / BERT 04. Deep learning applications in proteomics 05. Student presentations begin
  • 5. Last time we introduced transformer 03 GPT-2, BERT
  • 6. Transformers, GPT-2, and BERT 03 1. A transformer uses Encoder stack to model input, and uses Decoder stack to model output (using input information from encoder side). 2. But if we do not have input, we just want to model the “next word”, we can get rid of the Encoder side of a transformer and output “next word” one by one. This gives us GPT. 3. If we are only interested in training a language model for the input for some other tasks, then we do not need the Decoder of the transformer, that gives us BERT. GPT-2, BERT
  • 8. GPT-2, BERT 03 1542M 762M 345M 117M parameters GPT released June 2018 GPT-2 released Nov. 2019 with 1.5B parameters GPT-3: 175B parameters trained on 45TB texts
  • 9. GPT-2 in action GPT-2, BERT 03 not injure injure a a human human being being
  • 10. Byte Pair Encoding (BPE) GPT-2, BERT 03 Word embedding sometimes is too high level, pure character embedding too low level. For example, if we have learned old older oldest We might also wish the computer to infer smart smarter smartest But at the whole word level, this might not be so direct. Thus the idea is to break the words up into pieces like er, est, and embed frequent fragments of words. GPT adapts this BPE scheme.
  • 11. Byte Pair Encoding (BPE) GPT-2, BERT 03 GPT uses BPE scheme. The subwords are calculated by: 1. Split word to sequence of characters (add </w> char) 2. Joining the highest frequency pattern. 3. Keep doing step 2, until it hits the pre-defined maximum number of sub-words or iterations. Example (5, 2, 6, 3 are number of occurrences) {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w e s t </w>’: 6, ‘w i d e s t </w>’: 3 } {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w es t </w>’: 6, ‘w i d es t </w>’: 3 } {‘l o w </w>’: 5, ‘l o w e r </w>’: 2, ‘n e w est </w>’: 6, ‘w i d est </w>’: 3 } (est freq. 9) {‘lo w </w>’: 5, ‘lo w e r </w>’: 2, ‘n e w est</w>’: 6, ‘w i d est</w>’: 3 } (lo freq 7) …..
  • 12. Masked Self-Attention (to compute more efficiently) 03 GPT-2, BERT
  • 13. Masked Self-Attention 03 GPT-2, BERT Note: encoder-decoder attention block is gone
  • 14. Masked Self-Attention Calculation 03 GPT-2, BERT Note: encoder-decoder attention block is gone Re-use previous computation results: at any step, only need to results of q, k , v related to the new output word, no need to re-compute the others. Additional computation is linear, instead of quadratic.
  • 15. GPT-2 fully connected network has two layers (Example for GPT-2 small) 03 GPT-2, BERT 768 is small model size
  • 16. GPT-2 has a parameter top-k, so that we sample words from top k (highest probability from softmax) words for each output 03 GPT-2, BERT
  • 17. This top-k parameter, if k=1, we would have output like: 03 GPT-2, BERT The first time I saw the new version of the game, I was so excited. I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of the game, I was so excited to see the new version of
  • 18. GPT Training 03 GPT-2, BERT GPT-2 uses unsupervised learning approach to training the language model. There is no custom training for GPT-2, no separation of pre-training and fine-tuning like BERT.
  • 19. A story generated by GPT-2 03 GPT-2, BERT “The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four- horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. `By the time we reached the top of one peak, the water looked blue, with some crystals on top,’ said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns."
  • 20. Transformer / GPT prediction 03 GPT-2, BERT
  • 23. Using wikipedia data 03 GPT-2, BERT
  • 24. BERT (Bidirectional Encoder Representation from Transformers) 03 GPT-2, BERT
  • 25. Model input dimension 512 Input and output vector size 03 GPT-2, BERT
  • 26. BERT pretraining 03 GPT-2, BERT ULM-FiT (2018): Pre-training ideas, transfer learning in NLP. ELMo: Bidirectional training (LSTM) Transformer: Although used things from left, but still missing from the right. GPT: Use Transformer Decoder half. BERT: Switches from Decoder to Encoder, so that it can use both sides in training and invented corresponding training tasks: masked language model
  • 27. BERT Pretraining Task 1: masked words 03 GPT-2, BERT Out of this 15%, 80% are [Mask], 10% random words 10% original words
  • 28. BERT Pretraining Task 2: two sentences 03 GPT-2, BERT
  • 29. BERT Pretraining Task 2: two sentences 03 GPT-2, BERT 50% true second sentences 50% random second sentences
  • 30. Fine-tuning BERT for other specific tasks 03 GPT-2, BERT SST (Stanford sentiment treebank): 215k phrases with fine-grained sentiment labels in the parse trees of 11k sentences. MNLI QQP (Quaro Question Pairs) Semantic equivalence) QNLI (NL inference dataset) STS-B (texture similarity) MRPC (paraphrase, Microsoft) RTE (textual entailment) SWAG (commonsense inference) SST-2 (sentiment) CoLA (linguistic acceptability SQuAD (question and answer)
  • 31. NLP Tasks: Multi-Genre Natural Lang. Inference 03 GPT-2, BERT MNLI: 433k pairs of examples, labeled by entailment, neutral or contraction
  • 32. NLP Tasks (SQuAD -- Stanford Question Answering Dataset): 03 GPT-2, BERT Sample: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. Which NFL team represented the AFC at Super Bowl 50? Ground Truth Answers: Denver Broncos Which NFL team represented the NFC at Super Bowl 50? Ground Truth Answers: Carolina Panthers
  • 33. Add indices for sentences and paragraphs SegaTron/SegaBERT H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling and understanding. AAAI’2021
  • 35. Testing on GLUE dataset H. Bai, S. Peng, J. Lin, L. Tan, K. Xiong, W. Gao, M. Li: SgaTron: Segment-aware transformer for language modeling and understanding. AAAI’2021
  • 36. Reading comprehesion – SQUAD tasks F1 = 2 (P*R) / (P+R), P is precision, R is recall, all in percentage, EM – exact match
  • 40. Feature Extraction 03 GPT-2, BERT We start with independent word embedding at first level We end up with some embedding for each word related to current input
  • 41. Feature Extraction, which embedding to use? 03 GPT-2, BERT
  • 42. What we have learned 03 GPT-2, BERT 1. Model size matters (345 million parameters is better than 110 million parameters). 2. With enough training data, more training steps imply higher accuracy 3. Key innovation: Learning from unannotated data. 4. In biotechnology, we also have a lot of such data (for example meta-genomes).
  • 43. Literature & Resources for Transformers 03 Resources: OpenAI GPT-2 implementation: https://github.com/openai/gpt-2 BERT paper: J. Devlin et al, BERT, pretraining of deep bidirectional transformers for language understanding. Oct. 2018. ELMo paper: M. Peters, et al, Deep contextualized word representation, 2018 ULM-FiT paper: Universal language model fine-tuning for text classification. J. Howeard, S. Ruder., 2018 Jay Alammar, The illustrated GPT-2, https://jalammar.github.io/illustrated- gpt2/ GPT-2, BERT

Editor's Notes

  1. CoLA: language acceptability, SST – sentiment, MRPC paraphrase, STS textual similarity, QQP: question paraphrase, RTE MNLI: textual entailment, NLI: question entailment.