SlideShare a Scribd company logo
CS 886 Deep Learning for Biotechnology
Ming Li
Jan. 2022
Prelude: our world
What do they have in common?
0
These are all lives encoded by DNA / RNA from 30k bases to 3B bases and more.
Genes encoded by DNA are translated to 1000s’of proteins, hence life.
Deep Learning
Since its invention, deep learning has changed many
research fields: speech recognition, image
processing, natural language processing, automatic
driving, industrial control, especially biotechnology
(for example, protein structure prediction). In this
class, we will review applications of deep learning in
biotechnology. The first few lectures will be on the
necessary backgrounds of deep learning.
01. Word2Vec
02. Attention / Transformer
03. Pretraining: GPT and BERT
04. Deep learning applications in proteomics
05. Student presentations begin
01
LECTURE ONE
Word2Vec: from discrete to
continuous space
Things you need to know:
01
Dot product:
a  b = ||a||||b||cos(θab)
= a1b1+a2b2+ … +anbn
One can derive Cosine similarity
cos θab = a  b / ||a||||b||
Softmax Function:
If we take an input of [1,2,3,4,1,2,3], the softmax of that is
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The softmax function highlights the largest values and
suppress other values, so that they are all positive and
sum to 1.
Word2Vec
Transforming from discrete to “continuous” space
01
1. Calculus (for computing the space covered by a curve)
2. Word2Vec (for computing the “space” covered by the meaning of a
word, or a gene, or a protein)
Word2Vec
Traditional representation of a word's meaning
01
1. Dictionary (or PDB), not too useful in computational linguistic research.
2. WordNet (Protein networks)
It is a graph of words, with relationships like “is-a”, synonym sets.
Problems: Depend on human labeling hence missing a lot, hard to
automate this process.
3. These are all using atomic symbols: hotel, motel, equivalent to 1-hot
vector:
Hotel: [0,0,0,0,0,0,0,0,1,0,0,0,0,0]
Motel: [0,0,0,0,1,0,0,0,0,0,0,0,0,0]
These are called one-hot representations. Very long: 13M (google crawl).
Example: inverted index.
Word2Vec
Problems.
01
There is no natural meaning of similarity,
hotel  motelT = 0
No inherent notion of similarity with 1-hot vectors.
They are very long too. For daily speech, 20 thousand words, 50k for
machine translation, 20 thousand proteins, millions of species. Google web
crawl, 13M words.
Word2Vec
How do we solve the problem? This is what Newton and Leibniz did
for calculus:
01 Word2Vec
when #rectangles  ∞
Let's do something similar:
01
1. Use a lot of “rectangles”, a vector of numbers, to represent to approximate
the meaning of a word.
2. What represents the meanings of a word?
“You shall know a word by the company it keeps” – J.R. Firth, 1957
Thus, our goal is to assign each word a vector such that similar words have
similar vectors (by dot-product).
We will believe JR Firth and use a neural network to train (low dimension) vectors
such that if two words appear together in a text each time, they should get
slightly closer. This allows us to use a massive corpus without annotation! Thus,
we will scan thru training data by looking at a window 2d+1 at a time, given a
center word, trying to predict d words on the left and d words on the right.
Word2Vec
To design a neural network for this:
01
More specifically, for a center word wt and “context words” wt’, within a
window of some fixed size say 5 (t’=t-1, … t-5, t+1, … , t+5) we use a neural
network to predict all wt’ to maximize:
p(wt’|wt) = …
This has a loss function
L = 1 – p(wt’|wt)
Thus by looking at many positions in a big corpus, we keep on adjusting
these vectors to minimize this loss, we arrive at a (low dimensional) vector
approximation of the meaning of each word, in the sense that if two words
occur in close proximity often then we consider them similar.
Word2Vec
To design a neural network for this:
01
Thus the objective function is: maximize the probability of any context word
given the current center word:
L’(θ) = Π t=1..T Π d=-1..-5,1..5 P(wt+d | wt , θ )
Where θ is all the variables we optimize (i,e, the vectors), T=|training text|.
Taking negative logarithm (and average per word) so that we can minimize
L(θ) = - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log P(wt+d | wt )
Then what is P(wt+d | wt)? We can just take their vector dot products, and
then take softmax, to approximate it, letting v be the vector for word w:
L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt)
Word2Vec
To design a neural network for this:
01
Last slide:
L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt)
The softmax of a center word c, and a context/outside word o
Softmax(vo  vc ) = e^(vo  vc) / Σk=1..V e^(vk  vc)
Note, the index runs over the dictionary of size V, not the whole text T.
Word2Vec
Negative Sampling in Word2Vec
01 Word2Vec
• In our L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt) where
Softmax(vo  vc ) = e^(vo  vc) / Σk=1..V e^(vk  vc)
Each time we have to calculate Σk=1..V e^(vk  vc), this is too expensive.
• To overcome this, use negative sampling. The objective function: L(θ) = -1/T Σt=1..T Lt(θ)
Lt(θ) = log σ (uo
Tvc) + Σt=1..k Ej~P(w) [log σ (- uj
Tvc)]
= log σ (uo
Tvc) + Σj~P(w) [log σ (- uj
Tvc)]
• Where the sigmoid function σ(x) = 1/1+e-x, treated as probability for ML people. I.e.
maximize the first term, taking k=10 random samples in the second term.
• For sampling, we can use unigram distribution U(w) or U(w)3/4 for rare words.
The skip-gram model
01 Word2Vec
• Vocabulary size: V
• Input layer: center word in 1-hot form.
• k-th row of WVxN is center vector of k-th
word.
• k-th column of W’NXV is context vector
of the k-th word in V. Note, each word has
2 vectors, both randomly initialized.
• The output column yij, i=1..C, has 3 steps
1) Use the context word 1-hot vector to
choose its column in W’NxV
2) dot product with hi the center word
3) compute the softmax
C= context window size
The Training of θ
01 Word2Vec
• We will train both WVxN and W’NxV
• I.e. compute all vector gradients.
• Thus θ is in space R2NV, N is vector
size, V is number of words.
• L(θ) / v , for all vectors in θ.
θ = in R2NV
vaardvark
va
.
.
vzebra
uaardvark
ua
.
.
uzebra
Gradient Descent
01 Word2Vec
• θnew = θold – α L(θold) / / θold
• Stochastic gradient descent (SGD):
Just do one position (one center
word and its context words) at a time.
θ = in R2NV
vaardvark
va
.
.
vzebra
uaardvark
ua
.
.
uzebra
CBOW
01 Word2Vec
• What about we predict center word, given context word, opposite to the skip-gram
model?
• Yes, this is called Continuous Bag Of Words model in the original Word2Vec paper.
Results
01 Word2Vec
Results
01 Word2Vec
More realistic data – not everything is perfect
01 Word2Vec
An interesting application in material science
01 Word2Vec
• Nature, July 2019 V. Tshitonya et al, “Unsupervised word embeddings capture
latent knowledge from materials science literature”.
• Lawrence Berkeley lab material scientists applied word embedding to 3.3
million scientific abstracts published between 1922-2018. V=500k. Vector size:
200 dimension, used skip-gram model
• With no explicit insertion of chemical knowledge
• Captured things like periodic table and structure-property relationship in
materials:
ferromagnetic − NiFe + IrMn ≈ antiferromagnetic
• Discovered new thermoelectric materials: “would be years before their
discovery”.
• Can you do something for proteins?
Beyond Word2Vec
01 Word2Vec
• Co-occurrence matrix
Window based co-occurrence
Document based co-occurrence
Ignore the, he, has … frequent words.
Close proximity weigh more …
• In word2vec, if a center word w appears again, we have to repeat this process. Here
they are processed together. Also consider documents.
• Symmetric.
• SVD decomposition, this was before Word2Vec. But it is O(nm2), too slow for large
data.
GloVe (Global vectors model)
01 Word2Vec
• Combining Word2Vec and Co-occurrence matrix approaches. Minimize
L(θ) = ½ Σi,j=1..W f(Pi,j) (ui
Tvj – log Pi,j)2
Where, u, v vectors are still the same, Pi,j is the count that ui and vj co-occur.
Essentially
This says, the more ui,vj co-occur, the larger their dot product should be. f gets rid
of too frequent occurrences.
• What about these two vectors? X=U+V works.
Summary
01 Word2Vec
• We have learned that representation is important: When you represent space under
a curve by a lot of rectangles, you can approximate the curve, hence calculus;
When you represent a word by a vector, other close-in-meaning words can take
nearby vectors, measured by “cosine” distance.
• When a representation (short vectors) of an object allows similarity measures, we
can easily design neural network (or other approaches) to represent words that are
close in meanings in close vicinities.
Literature & Resources for Word2Vec
01
Bengio et al, 2003, A neural probabilistic language model.
Collobert & Weston 2008, NLP (almost) from scratch
Mikolov et al 2013, word2vec paper
Pennington Socher, Manning, GloVe paper, 2014
Rohde et al 2005 (SVD paper) An improved model of semantics similarity
based on lexical co-occurrence.
Plus thousands more.
Resources:
https://mccormickml.com/2016/04/27/word2vec-resources/
https://github.com/clulab/nlp-reading-group/wiki/Word2Vec-Resources
Word2Vec
Project Ideas
01
1. Similar to V. Tshitonya et al's work, can you explore all biological
literature and find interesting facts such as protein-protein interaction or
biological name identity resolution?
2. Can we use word-embedding to embed genomes (for example, shatter
genomes into pieces as "words”, but train one vector for each genome)
hence cluster species to build phylogeny to decide their evolutionary
history, similar to [1,2]. You can start with mitochondrial genomes, or
virus genomes (such as different species of covid-19, for this you can
add in other factors such as geographical, time), or bacteria genomes.
1. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An information-based sequence distance and its application to whole
mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149-154.
2. C.H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6(June 2003) (feature article), 76-81.
Word2Vec
02
Attention and Transformers
LECTURE TWO

More Related Content

Similar to Lecture1.pptx

Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Jay Nagar
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
Prince Jain
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
Tomasz Waszczyk
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
Tomasz Waszczyk
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
fikki11
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph data
Petra Selmer
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 
Lec1
Lec1Lec1
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 
Applications of linear algebra
Applications of linear algebraApplications of linear algebra
Applications of linear algebra
Prerak Trivedi
 
The Impact Of Semantic Handshakes
The Impact Of Semantic HandshakesThe Impact Of Semantic Handshakes
The Impact Of Semantic Handshakes
Lutz Maicher
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Jonathon Hare
 
word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
Mostapha Benhenda
 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysis
chidabdu
 
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZEAPPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
m.kumarasamy college of engineering
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete Problems
Brittany Allen
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
AbdurrahimDerric
 
H-MLQ
H-MLQH-MLQ

Similar to Lecture1.pptx (20)

Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph data
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Lec1
Lec1Lec1
Lec1
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Applications of linear algebra
Applications of linear algebraApplications of linear algebra
Applications of linear algebra
 
The Impact Of Semantic Handshakes
The Impact Of Semantic HandshakesThe Impact Of Semantic Handshakes
The Impact Of Semantic Handshakes
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysis
 
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZEAPPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete Problems
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
H-MLQ
H-MLQH-MLQ
H-MLQ
 

Recently uploaded

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 

Recently uploaded (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 

Lecture1.pptx

  • 1. CS 886 Deep Learning for Biotechnology Ming Li Jan. 2022
  • 2. Prelude: our world What do they have in common? 0 These are all lives encoded by DNA / RNA from 30k bases to 3B bases and more. Genes encoded by DNA are translated to 1000s’of proteins, hence life.
  • 3. Deep Learning Since its invention, deep learning has changed many research fields: speech recognition, image processing, natural language processing, automatic driving, industrial control, especially biotechnology (for example, protein structure prediction). In this class, we will review applications of deep learning in biotechnology. The first few lectures will be on the necessary backgrounds of deep learning. 01. Word2Vec 02. Attention / Transformer 03. Pretraining: GPT and BERT 04. Deep learning applications in proteomics 05. Student presentations begin
  • 4. 01 LECTURE ONE Word2Vec: from discrete to continuous space
  • 5. Things you need to know: 01 Dot product: a  b = ||a||||b||cos(θab) = a1b1+a2b2+ … +anbn One can derive Cosine similarity cos θab = a  b / ||a||||b|| Softmax Function: If we take an input of [1,2,3,4,1,2,3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The softmax function highlights the largest values and suppress other values, so that they are all positive and sum to 1. Word2Vec
  • 6. Transforming from discrete to “continuous” space 01 1. Calculus (for computing the space covered by a curve) 2. Word2Vec (for computing the “space” covered by the meaning of a word, or a gene, or a protein) Word2Vec
  • 7. Traditional representation of a word's meaning 01 1. Dictionary (or PDB), not too useful in computational linguistic research. 2. WordNet (Protein networks) It is a graph of words, with relationships like “is-a”, synonym sets. Problems: Depend on human labeling hence missing a lot, hard to automate this process. 3. These are all using atomic symbols: hotel, motel, equivalent to 1-hot vector: Hotel: [0,0,0,0,0,0,0,0,1,0,0,0,0,0] Motel: [0,0,0,0,1,0,0,0,0,0,0,0,0,0] These are called one-hot representations. Very long: 13M (google crawl). Example: inverted index. Word2Vec
  • 8. Problems. 01 There is no natural meaning of similarity, hotel  motelT = 0 No inherent notion of similarity with 1-hot vectors. They are very long too. For daily speech, 20 thousand words, 50k for machine translation, 20 thousand proteins, millions of species. Google web crawl, 13M words. Word2Vec
  • 9. How do we solve the problem? This is what Newton and Leibniz did for calculus: 01 Word2Vec when #rectangles  ∞
  • 10. Let's do something similar: 01 1. Use a lot of “rectangles”, a vector of numbers, to represent to approximate the meaning of a word. 2. What represents the meanings of a word? “You shall know a word by the company it keeps” – J.R. Firth, 1957 Thus, our goal is to assign each word a vector such that similar words have similar vectors (by dot-product). We will believe JR Firth and use a neural network to train (low dimension) vectors such that if two words appear together in a text each time, they should get slightly closer. This allows us to use a massive corpus without annotation! Thus, we will scan thru training data by looking at a window 2d+1 at a time, given a center word, trying to predict d words on the left and d words on the right. Word2Vec
  • 11. To design a neural network for this: 01 More specifically, for a center word wt and “context words” wt’, within a window of some fixed size say 5 (t’=t-1, … t-5, t+1, … , t+5) we use a neural network to predict all wt’ to maximize: p(wt’|wt) = … This has a loss function L = 1 – p(wt’|wt) Thus by looking at many positions in a big corpus, we keep on adjusting these vectors to minimize this loss, we arrive at a (low dimensional) vector approximation of the meaning of each word, in the sense that if two words occur in close proximity often then we consider them similar. Word2Vec
  • 12. To design a neural network for this: 01 Thus the objective function is: maximize the probability of any context word given the current center word: L’(θ) = Π t=1..T Π d=-1..-5,1..5 P(wt+d | wt , θ ) Where θ is all the variables we optimize (i,e, the vectors), T=|training text|. Taking negative logarithm (and average per word) so that we can minimize L(θ) = - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log P(wt+d | wt ) Then what is P(wt+d | wt)? We can just take their vector dot products, and then take softmax, to approximate it, letting v be the vector for word w: L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt) Word2Vec
  • 13. To design a neural network for this: 01 Last slide: L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt) The softmax of a center word c, and a context/outside word o Softmax(vo  vc ) = e^(vo  vc) / Σk=1..V e^(vk  vc) Note, the index runs over the dictionary of size V, not the whole text T. Word2Vec
  • 14. Negative Sampling in Word2Vec 01 Word2Vec • In our L(θ) ≈ - 1/T Σt=1..T Σd=-1 .. -5, 1,..,5 log Softmax (vt+d  vt) where Softmax(vo  vc ) = e^(vo  vc) / Σk=1..V e^(vk  vc) Each time we have to calculate Σk=1..V e^(vk  vc), this is too expensive. • To overcome this, use negative sampling. The objective function: L(θ) = -1/T Σt=1..T Lt(θ) Lt(θ) = log σ (uo Tvc) + Σt=1..k Ej~P(w) [log σ (- uj Tvc)] = log σ (uo Tvc) + Σj~P(w) [log σ (- uj Tvc)] • Where the sigmoid function σ(x) = 1/1+e-x, treated as probability for ML people. I.e. maximize the first term, taking k=10 random samples in the second term. • For sampling, we can use unigram distribution U(w) or U(w)3/4 for rare words.
  • 15. The skip-gram model 01 Word2Vec • Vocabulary size: V • Input layer: center word in 1-hot form. • k-th row of WVxN is center vector of k-th word. • k-th column of W’NXV is context vector of the k-th word in V. Note, each word has 2 vectors, both randomly initialized. • The output column yij, i=1..C, has 3 steps 1) Use the context word 1-hot vector to choose its column in W’NxV 2) dot product with hi the center word 3) compute the softmax C= context window size
  • 16. The Training of θ 01 Word2Vec • We will train both WVxN and W’NxV • I.e. compute all vector gradients. • Thus θ is in space R2NV, N is vector size, V is number of words. • L(θ) / v , for all vectors in θ. θ = in R2NV vaardvark va . . vzebra uaardvark ua . . uzebra
  • 17. Gradient Descent 01 Word2Vec • θnew = θold – α L(θold) / / θold • Stochastic gradient descent (SGD): Just do one position (one center word and its context words) at a time. θ = in R2NV vaardvark va . . vzebra uaardvark ua . . uzebra
  • 18. CBOW 01 Word2Vec • What about we predict center word, given context word, opposite to the skip-gram model? • Yes, this is called Continuous Bag Of Words model in the original Word2Vec paper.
  • 21. More realistic data – not everything is perfect 01 Word2Vec
  • 22. An interesting application in material science 01 Word2Vec • Nature, July 2019 V. Tshitonya et al, “Unsupervised word embeddings capture latent knowledge from materials science literature”. • Lawrence Berkeley lab material scientists applied word embedding to 3.3 million scientific abstracts published between 1922-2018. V=500k. Vector size: 200 dimension, used skip-gram model • With no explicit insertion of chemical knowledge • Captured things like periodic table and structure-property relationship in materials: ferromagnetic − NiFe + IrMn ≈ antiferromagnetic • Discovered new thermoelectric materials: “would be years before their discovery”. • Can you do something for proteins?
  • 23. Beyond Word2Vec 01 Word2Vec • Co-occurrence matrix Window based co-occurrence Document based co-occurrence Ignore the, he, has … frequent words. Close proximity weigh more … • In word2vec, if a center word w appears again, we have to repeat this process. Here they are processed together. Also consider documents. • Symmetric. • SVD decomposition, this was before Word2Vec. But it is O(nm2), too slow for large data.
  • 24. GloVe (Global vectors model) 01 Word2Vec • Combining Word2Vec and Co-occurrence matrix approaches. Minimize L(θ) = ½ Σi,j=1..W f(Pi,j) (ui Tvj – log Pi,j)2 Where, u, v vectors are still the same, Pi,j is the count that ui and vj co-occur. Essentially This says, the more ui,vj co-occur, the larger their dot product should be. f gets rid of too frequent occurrences. • What about these two vectors? X=U+V works.
  • 25. Summary 01 Word2Vec • We have learned that representation is important: When you represent space under a curve by a lot of rectangles, you can approximate the curve, hence calculus; When you represent a word by a vector, other close-in-meaning words can take nearby vectors, measured by “cosine” distance. • When a representation (short vectors) of an object allows similarity measures, we can easily design neural network (or other approaches) to represent words that are close in meanings in close vicinities.
  • 26. Literature & Resources for Word2Vec 01 Bengio et al, 2003, A neural probabilistic language model. Collobert & Weston 2008, NLP (almost) from scratch Mikolov et al 2013, word2vec paper Pennington Socher, Manning, GloVe paper, 2014 Rohde et al 2005 (SVD paper) An improved model of semantics similarity based on lexical co-occurrence. Plus thousands more. Resources: https://mccormickml.com/2016/04/27/word2vec-resources/ https://github.com/clulab/nlp-reading-group/wiki/Word2Vec-Resources Word2Vec
  • 27. Project Ideas 01 1. Similar to V. Tshitonya et al's work, can you explore all biological literature and find interesting facts such as protein-protein interaction or biological name identity resolution? 2. Can we use word-embedding to embed genomes (for example, shatter genomes into pieces as "words”, but train one vector for each genome) hence cluster species to build phylogeny to decide their evolutionary history, similar to [1,2]. You can start with mitochondrial genomes, or virus genomes (such as different species of covid-19, for this you can add in other factors such as geographical, time), or bacteria genomes. 1. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149-154. 2. C.H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6(June 2003) (feature article), 76-81. Word2Vec