Executive Summary
Introduction
Protein engineering is a burgeoning field within the life sciences promising targeted therapeutics, enhanced agricultural yield, and more efficient manufacturing. Various models and analysis paradigms employed by scientists and engineers leverage statistics and cutting-edge machine learning models to guide desirable functional changes. While notable advancements have been made concerning modeling protein tertiary structure as AlphaFold’s attention network has accomplished, there is room for simpler graphical models with better feature extractability to quickly inform scientists of key functional associations.(Senior, 2020).
Biological Background
Proteins are a polymer consisting of amino acids (of which there are 20) in a linear chain. An amino acid is composed of one nitrogen and two carbon atoms and is bound to various hydrogen and oxygen molecules, as shown in Figure 1. The central carbon Cα is linked to the unit “R” or residue, which distinguishes the amino acid. Amino acids bind through the loss of water molecules and the remaining parts of the amino acids are known as amino acid residues. Amino acids bind to form chains of hundreds to thousands of amino acids, forming the primary structure of proteins.
Figure 1: Amino acid structure. Retrieved from https://study.com/academy/lesson/what-is-amino-acid-residue.html
Amino acids in the chain can also interact with other non-adjacent amino acids in the same chain. This can cause the folding of the amino acid chain and lead to varying three-dimensional structures (secondary and tertiary structures). The two common forms of secondary structure include alpha helices and beta sheets. Proteins are essential in every cellular process. Many proteins are functional as monomers. Other proteins often form complexes (protein-protein interaction) to achieve specific functions. This is known as the quaternary structure of proteins. The four levels of protein structures are visually represented in Figure 2.
Figure 2: Protein structure: Primary, secondary, tertiary, and Quaternary. Retrieved from https://www.thoughtco.com/protein-structure-373563
Protein-protein or residue-residue interactions are the heart of biological processes. They give the protein its structure, which brings us to the key idea of biology: “structure equals function”. Thus, it is crucial to be able to identify these interaction sites or interface residues, as they can indicate the functionality of proteins. In this case, a protein can be modeled graphically where nodes are referred to as the 3D residue position, and the edges in the graph illustrate the spatial neighborhood of the residue.
Modeling
There are many tasks involving predicting the large numbers of variables that depend on pairwise associations. The method of a structured prediction is critical in graphical modeling and a combination of classification. (Athar, 2018). These pairwise as ...
1. Executive Summary
Introduction
Protein engineering is a burgeoning field within the life
sciences promising targeted therapeutics, enhanced agricultural
yield, and more efficient manufacturing. Various models and
analysis paradigms employed by scientists and engineers
leverage statistics and cutting-edge machine learning models to
guide desirable functional changes. While notable advancements
have been made concerning modeling protein tertiary structure
as AlphaFold’s attention network has accomplished, there is
room for simpler graphical models with better feature
extractability to quickly inform scientists of key functional
associations.(Senior, 2020).
Biological Background
Proteins are a polymer consisting of amino acids (of which
there are 20) in a linear chain. An amino acid is composed of
one nitrogen and two carbon atoms and is bound to various
hydrogen and oxygen molecules, as shown in Figure 1. The
central carbon Cα is linked to the unit “R” or residue, which
distinguishes the amino acid. Amino acids bind through the loss
of water molecules and the remaining parts of the amino acids
are known as amino acid residues. Amino acids bind to form
chains of hundreds to thousands of amino acids, forming the
primary structure of proteins.
Figure 1: Amino acid structure.
Retrieved from https://study.com/academy/lesson/what-is-
2. amino-acid-residue.html
Amino acids in the chain can also interact with other non-
adjacent amino acids in the same chain. This can cause the
folding of the amino acid chain and lead to varying three-
dimensional structures (secondary and tertiary structures). The
two common forms of secondary structure include alpha helices
and beta sheets. Proteins are essential in every cellular process.
Many proteins are functional as monomers. Other proteins often
form complexes (protein-protein interaction) to achieve specific
functions. This is known as the quaternary structure of proteins.
The four levels of protein structures are visually represented in
Figure 2.
Figure 2: Protein structure: Primary, secondary, tertiary, and
Quaternary. Retrieved from https://www.thoughtco.com/protein-
structure-373563
Protein-protein or residue-residue interactions are the heart of
biological processes. They give the protein its structure, which
brings us to the key idea of biology: “structure equals
function”. Thus, it is crucial to be able to identify these
interaction sites or interface residues, as they can indicate the
functionality of proteins. In this case, a protein can be modeled
graphically where nodes are referred to as the 3D residue
position, and the edges in the graph illustrate the spatial
neighborhood of the residue.
Modeling
There are many tasks involving predicting the large numbers of
variables that depend on pairwise associations. The method of a
structured prediction is critical in graphical modeling and a
combination of classification. (Athar, 2018). These pairwise
associations can classify compact multivariate data, thus
performing predictions that use a large set of individual
features. Conditional random fields (CRFs), a popular
probabilistic method in structured prediction, are one flavor of
these graphical models. It is worth noting that CRFs have very
wide applications as they are used in computer vision, natural
3. language processing, and bioinformatics. The available methods
for inference in estimating CRFs entail the practical issues
required in large-scale CRFs implementation. Briefly, we can
define CRFs as a statistical model method applied in pattern
recognition and machine learning for structured prediction.
CRFs are a component of the standard mathematical modeling
method used in certain navigational softwares where it enjoys
popularity identifying the direction and orientation of the
device. The models additionally assist in calculating miles
traveled while offline (Xu, 2015). Other computer vision
applications have demonstrated that Neural Nets with CRF
layers have predictive capabilities rivaling heavier Graphical
Neural Networks with less computational resources on a
notoriously difficult dataset called tanks and temples(Bao,
2019). CRFs are a type of non-targeted graphical model.
Usually, this enters the code in the captured relationship
between the visuals and creates a consistent interpretation. It is
often used to label or subdivide consecutive facts, text, or
Biological sequences. In particular, CRFs model key business
relations, genetic discovery, and peptide information to inform
organizations. In computer vision, CRFs are often used for
object recognition and image classification. There are several
types of conditional Random fields. One is higher-order and
semi- Markov, but there are also latent –dynamic conditional
random fields (Suraksha, N. M, 2017).
CRF Types
Higher Order and Semi-Markov
CRFs can be extended to higher-order models using
individualization depending on the consistency variety of
previous variables. Learning and inclination work with great
success in small amounts given that their calculation costs will
increase significantly. The mainline models of the established
forecast, consisting of a dependent person assisting with the
Vector program, can be seen as a training opportunity for CRFs.
An alternative version of the CRF is a random area with semi -
Markov conditions (semi-CRF), which compares the duration
4. components. This provides significant learning capabilities at a
fraction of the compute time for GNNs. Comment by
Alexander Larsen: Expand upon this.
More options/ types
Dynamic
Dynamic brief-term dynamic fields or brief-time period
dynamic fields are the CRF technique of consecutive marking
bonds. They may be a dynamic hidden model that can be
effective in discrimination. In Latent-Dynamic Conditional
Random Field(LDCRF), as with all series tagging tasks, you're
given a series of thoughts x = x₁ … xₙ, the main trouble the
model needs to clear the way to share the series of labels y =
y₁ … yₙ in a single complete set of Y. labels in place of
without delay modeling P (y | x) like an ordinary line chain.
CRF could make, hard and rapid with hidden flexibility
"inserted" among x and y the possible use of chain regulation:
This allows for the taking pictures of a hidden structure
between visuals and labels. While LDCRFs can use quasi -
Newton strategies, a special version of the perceptron set of
rules referred to as modified perceptrons is also designed for
them, based on the built-in perceptron set of rules. These types
get packages in pc viewing, especially contact recognition from
video streaming and in-depth analysis. Comment by Biraj
Shrestha: wha does the boxes after x mean, is it a symbol for
something, please specify.
Comment by Biraj Shrestha: added this to illustrate the
nodes and edges
Fig 3: Simple representation of a network, nodes representing
components and edges representing interactions.
https://www.sciencedirect.com/science/article/pii/S2001037014
000233
Our approach is based on conditional random fields (CRFs)
proposed by Lafferty which are related to the probabilistic
methods. A CRF can use a connected graph as opposed to other
statistical models, such as hidden Markov models, which only
5. have edges between adjacent nodes. This makes CRFs a better
predictor of functionality as many residues in proteins interact
with other residues besides their immediate neighbors. Linear -
chain CRFs, like Hidden Markov Models, only impose
dependencies on the previous element and it can not represent
the three-dimensional structure of a protein. Our project is
concentrating on graphical CRFs where we can impose
dependencies on arbitrary elements. The project goal is to take a
family of proteins and create a graph CRF that can act as a
scoring system for new sequences that assesses whether the new
sequences have the same functionality as the family.
https://journals.plos.org/ploscompbiol/article?id=10 .1371/journ
al.pcbi.0030119
Pros and Cons
Advantages
The conditional random fields offer many advantages over
Markov's hidden models and the stochastic grammar system for
such functions, including illuminating the strong independent.
Assumptions were made on those models. In addition to the 22
basic catches of the many entropy Markov fashions and the
different models of Markov's discrimination, the unconditional
random fields are based primarily on targeted fashions, which
may favour countries with a few consecutive provinces. It can
measure the parameters of conditional forums and evaluate the
effectiveness of the following models in the Hidden Markov
Model(HMM) and Maximum Entropy Markov Model(MEMMs)
in the practicalities and language of herbs.
This concept explores other techniques for measuring field
parameters for random, newly introduced versions that can label
and separate sequential data. The intellectual and practical risks
of learning strategies used in current CRF textbooks are touched
on. We assume that standard pricing strategies lead to more
advanced performance than CRF training algorithms.
Experiments use a set of popular content determination records
that verify this to be true. This is a surprisingly promising
result, showing that such parameter measurement techniques
6. make CRFs an effective and green desire to write sequential
data, as well as a framework of sound and objective beliefs.
Conditional Random Fields (CRFs) is an unconventional
graphic style, a completely different case associated w ith state-
of-the-art information technology. The biggest advantage of
CRFs is their amazing flexibility that includes a wide range of
competitive, impartial entry functions. Facing this freedom,
however, the important question remains: what skills should be
used? This approach is based entirely on the command
combination integration factor that can greatly increase the risk
of entry conditions presented in the model. Permissions for
automated inputs do not work with precise precision, and high
precision parameters depend. Still, the use of large groups and
greater freedom of flexibility of atomic input may be associated
with the challenge. This approach applies to online CRFs and
other CRF structures, including Relational Markov Networks. It
is linked to the acquisition of clique templates and can be
understood by a supervised form of knowledge. It provides the
results of the test on the issuance of a fictional business and the
obligations of word separation.
Limitations
The most obvious disadvantage of CRF is the high
computational complexity of algorithm training. This makes it
very difficult to retrain the model as new school learning data
samples become available. In addition, CRF now does not make
drawings with unknown expressions, meaning phrases that were
no longer in the sample of educational facts.
Circles and rectangles correspond to labels (Y) and comments
(X). It is very important to remember the hyperlink regime's
power changes within a simultaneous version for visibility with
smart home layouts. Such features are difficult to stand on in
HMM because they create opportunities but can be addressed
with the help of CRFs. CRF can be represented as an indirect
graph G = (V, E). The distribution of non-target graph
opportunities is calculated with the help of the addition of
maximal groups of three c ε cliques V of the graph. Graphic
7. fashions in natural language processing. Even though those
examples are popular, they work both to make the file
explanations in the previous section and to show other ideas
that will also arise in our discussion of conditional random
fields. The unique interest in Markov (HMM) is a hidden form
because it is several miles in line with the CRF chain. Graphic
fashions in natural language processing. Even though those
examples are popular, they work both to make the file
explanations in the previous section and to show other ideas
that will also arise in the discussion of conditional random
fields. The unique interest in Markov (HMM) is a hidden form
because it is several miles in line with the CRF chain.
Comment by Biraj Shrestha: What does circles and
rectangles refers to? is it referring to a figure? if so please add a
figure. Comment by Biraj Shrestha: And again, in-text
citation should be on each of the paragraph. Comment by
Biraj Shrestha: can you please add a graph thats its referring to.
PyStruct's desire to provide a definitive cause for the
implementation of preferred preferences and predictive
methods, each designed for physicians and as a basis for
researchers. Written in Python also synchronizes paradigms and
types from seamless Python medical network integration with
other activities. Key phrases: systematic predictions dom fields,
Python. PyStruct aims to be a properly prepared and predictable
studying library. (Cao (2020). It presently uses the most
effective max-margin and perceptron strategies, but other
algorithms may additionally observe. The getting-to-know
algorithms used in PyStruct have exclusive names, frequently
used freely or one by one in one-of-a-kind communities.
Common names are conditional random fields (CRFs), high-
degree random fields (M3N), or vector help equipment.
There are several things that we train them before feeding the
actual data; this includes:
Rating: The facts can also contain adjectives with combos on
various scales such as greenbacks, pounds, and income. Many
ways to manage devices are a true symbol of having the same
8. scale that ranges from 0 and 1 at the lowest and largest price
than the feature provided. Remember any measurement you may
need to achieve.
Decay: There may be factors that create a complex concept that
can be very helpful in reading the gadget while cutting it into
key parts. For example, a day that may have additional day and
time additions that can be further cut. It probably works best for
an hour a day to solve a problem. Remember what factor decay
you can do.
Integration: There may be skills that can be directly integrated
into one aspect that may have more purpose in the problem you
are trying to solve. For example, there may be instances of
information every time a consumer logs into a device that is not
included in the calculation of the login number that allows for
additional time to be lost. Keep in mind what type of feature
integration you may want to achieve.
Statistical Models
In this project, we are focusing on conditional random fields
which are a class of statistical modeling methods. Statistical
models use mathematical models and statistical assumptions to
generate sample data and make predictions for populati ons. In
simple language, it can be considered as a pair (X,P) where X
represents the set of observations and P is the set of possible
probability distributions on X. The process of evaluating the
parameter in the statistical model is known as training. In order
to estimate how the model is expected to perform, we
distinguish the data into two sets: training data and testing data.
Training data set is used to create the model and the test or
validation data set is used to test the performance of the final
model.
Graphical Models
Graphical models are a class of statistical models which is
represented via a graph and mathematically denoted by a pair G
= (V, E). Where V is nodes and E is edges. There are two types
of graphical models: directed graphical models and undirected
graphical models. In directed graphical models, the edges of the
9. graph have directions (Bayesian network), whereas in
undirected graphical models, the edges carry no directional
information (Markov networks). A clique C of an undirected
graph is the maximal complete subgraph. The figure (xx) shows
an undirected graph with three maximal cliques, {1, 2, 3, 4}, {4,
5} and {5, 6}.
Figure ##: Example of an undirected graph with three maximal
cliques.
Directed graphical models describe how label vectors can
generate feature vectors probabilistically. For this reason, they
are known as generative models. Contrastingly, undirected
graphical models describe how to assign feature vectors to label
vectors. They are also known as discriminative models. The
figure below describes the analogy between different graphical
models such as naive Bayes, logistic regression, HMMs, linear -
chain CRFs, generative directed models, and general CRFs. The
main difference between naive Bayes and logistic regression is
that naive Bayes is generative model, meaning that it depends
on the joint distribution p(y, x), whereas logistic regression is
discriminative model, meaning that it depends on the
conditional distribution p(y|x). The relationship between
logistic regression and generative models mirrors the
relationship between Hidden Markov Models (HMMs) and
linear-chain conditional random fields.
Figure ##: The relationship between naive Bayes, logistic
regression, HMMs, linear-chain CRFs, generative directed
models, and general CRFs. Retrieved from
https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Hidden Markov Models
Hidden Markov Model (HMM) is a stochastic model based on
sequential data. It contains a Markov chain with a finite number
of hidden events (emission states) and observed events. In
HMM, each hidden state Yi (except Y1) depends only on the
previous state Yi-1, i = 2, 3, 4…., n and each observed state Xi
10. depends only on the current state Yi, i = 1, 2, 3, 4…., n.
Figure x: Hidden Markov Model with hidden states Yi and
observed states Xi. Retrieved from
https://www.alibabacloud.com/blog/hmm%2C-memm%2C-and-
crf%3A-a-comparative-analysis-of-statistical-modeling-
methods_592049
There are three parameters in HMM; starting probabilities
P(y1), transition probabilities P(yi|yi-1), i = 2, 3, 4…., n, and
emission probabilities P(xi|yi), i = 1, 2, 3, 4…., n. The
probability of an observed state x is labeled by a hidden state y
is given by:
with P(y1|y0) = P(y1)
The limitation of this model is that the observed state xi only
depends on the emission state yi. When the model is predicting
the value for yi, it cannot directly consider knowledge from the
observed variables xi.
http://cs.tulane.edu/~aculotta/pubs/culotta05gene.pdf
Conditional Random Fields
A conditional random field (CRF) is an undirected graphical
model. It can be considered as a generalization of the hidden
Markov model, meaning we can consider the conditional
distribution p(y|x) that results from the joint distribution p(y,
x). The difference between HMM and CRF is that CRF
calculates conditional distribution and HMM calculates joint
distribution. According to Lafferty, “Let x be an observation
over data and y = (y1,y2,..., yn) one of the possible label
sequences. Moreover, F = {fk, k = 1, 2, . . . , K} denotes a set
of real-valued feature functions with a weight vector Λ =
{λk}k=1K . Then a linear-chain conditional random field takes
the form
where the normalization factor
11. .”
Here, the normalization factor Z(x) sums over all possible state
sequences, which is an exponentially large number of terms.
Real world observations generally have multiple interacting
features and dependencies, making it difficult to model the
distribution of P(x). Use of the independent assumption in
HMMs is not warranted, and thus discriminative models like
linear chain CRFs are preferred. Linear chain CRFs, however
are seen as only a linear structure, which is not sufficient for
this project. Latent node graphical CRF models were developed
and the graphical relationships investigated.
The graphical CRF can be defined as
Where each factor is parameterized as:
And the normalization function is
In graphical CRF, let us consider G to be the factor graph over
Y. Then a conditional random field p(y|x) for any fixed x
factorizes according to G. We partition the factors of G into C =
{C1, C2, …, Cp), where each Cp is a clique template. Each
clique template is a set of factors that has a equivalent set of
adequate statistics {fpk(xp, yp) and parameters .
https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Protein multiple sequence alignment (Sanju)--in progress
Protein multiple sequence alignments are an essential tool for
protein structure and function prediction. Distantly related
sequences of proteins can be identified and aligned using
multiple sequence alignment. It can also be used to identify
known sequence domains in new sequences. Multiple sequence
alignment uses a position-specific scoring matrix (PSSM),
allowing for the degree of conservation at various posi tions to
12. be determined.
Multiple sequence alignments work by analyzing if residues in a
given column are homologous or play a common functional role.
A single residue mutation in a column of an MSA can influence
a compensating mutation in a different column, indicating that
the two residue sites are coevolved. The mutated residue sites
are thus key for determining protein-protein interactions. Thus,
the first step to determining this is to determine the co-evolved
sites in an MSA. Neighboring residues in an amino acid
sequence are connected by peptide bonds. These form the
primary structure of proteins. Residues that are not neighboring
may also connect through hydrogen bonding or di-sulfite bonds.
These bonds are what allow for the protein to form three-
dimensional structures. It is the structure formed here which is
critical to the stability and functionality of the protein.
To model these interactions, the latent node graph CRF model is
chosen. In the 3D structure of a protein, nodes represent the
residues in the protein and edges represent the spatial
neighborhood among the residues. The latent node graphical
CRF model involves interactions with variables that are not
observed during training. Hidden causes of the data are often
modeled, making it easier to learn about the actual observations.
· Detailed explanation of Pystruct (Sanju)
Software
Pystruct
In this project, we are using Pystruct software as it fits the
desired capabilities stated in the above CRF section. Pystruct is
a Python library, which is based on general Conditional Random
Field models (CRF). Python provides a general implementation
of standard structured prediction methods, which is defined as
maximizing the compatibility function between inputs (x) and
possible labels (y) to make a prediction, f(x), as shown in the
following equation. Comment by Alexander Larsen: Add caveat
that pystruct fits the desired capabilities stated above in the
CRF explanation. It's a good software to accomplish this.
Comment by Sanju Wagle: Added
13. Comment by Sanju Wagle: will write the formula.
Where; y is a structured label, Ψ is a joint feature function of x
and y, and θ are parameters of the model. The parameters
support algorithms for structural support vector machines
(SSVMs), subgradient methods for SSVMs, block-coordinate
Frank-Wolf (BCFW), the structured perceptron, and latent
variable SSVMs.
The joint feature function and encoding of the problem structure
is computed by model classes. The structure of the joint feature
function determines the hardness of the maximization. Pystruct
is capable of implementing a wide range of models including
CRFs. External libraries, such as OpenGM and LibDAI are used
to maximize the possible labels. Use of external libraries allows
for a wide range of optimization algorithms including QPBO,
MPBP, TRWs and LP.
https://jmlr.csail.mit.edu/papers/volume15/mueller14a/mueller1
4a.pdf
The training models: Sanju---in progress Comment by
Alexander Larsen: Briefly talked about in the discussion which
might be a better section for it. Please add your words about the
frank wolfe work down there!
· SSVMs
· Frank-Wolfe
· OneSlack
· SubgradientSSVM
Method
Protocol
Pystruct can be installed directly using pip if using an older
version of python <3.2; otherwise the library should be taken
from the pystruct/pystruct github page. After the library has
been installed using the included setup.py file, the library can
be imported directly in a python script or jupyter notebooks. If
this version does not work due to having windows or macOS,
we created an updated version that is stored under the github
14. user maxpwilson. After installation of pystruct, users can then
download our project from here.
Figure X. A screen shot of the frontpage of our github
repository.
The notebooks directory contains two notebooks and their
supporting files. The first file, control_pull.ipynb can be used in
conjunction with a Genbank file to find, download, and align
control sequences. The built-in method employs mafft to speed
up computation and is versatile for further development with the
program’s addfragments and keeplength options allowing for
new sequences to be added to the existing structure(Katoh,
2013).
Fig X. A test printout of the control retrieval script showing the
flags to input your email and ncbi_api_key as well as limit the
search space in the PullQCGenes class from the CRFSeqs
program. The result of this script can output to a csv for later
usage in pystruct.
A demo for using LatentNodeCRFs in pystruct and
preprocessing the data is included at LatentNodeCRF_demo
from within the notebooks file. The start of the notebook
delineates a section describing how to format an MSA, generate
a one-hot encoding for each amino acid within a sequence and
then generate a list of latent edges for every pairwise
interaction.
After formatting the data, pystructs model LatentNodeCRF can
be instantiated within a learner SSVM such as NSlackSSVM
which can then be used as a base ssvm for the LatentSSVM
learner to fit the properly formatted data to.
A B
Fig X. (A) showing the first steps in formatting the msa in
character matrix array and then (B) turning that input into a one
hot encoded array with all associated edges and latent features.
Data Format
15. Pystruct’s data format rigidity required a specific format for
data entry. Unlike the base GraphCRF, the LatentNodeCRF and
EdgeFeatureGraphCRF pystruct models require an array of three
arrays and matrices. The first matrix is a one-hot encoded
matrix of the gapped amino acid residue as the core data for the
model. The second position required a combinatorial space of
all pairwise relationships possible. An odd caveat is that the
LatentNodeCRF and EdgeFeatureGraphCRF required these
combinations to be the transposed array of one another. The
third array can be a singular integer as a constant for a 1-D
array Y setting for the LatentNodeCRF or a 2x2 matrix for the
EdgeFeatureCRF often used to denote the extra weight of
neighboring pixels in an image. The Y format-dependent values
can be in a singular point for Latent models and must be in an
amino acid length array for the GraphCRF and
EdgeFeatureGraphCRF models.
X
Y
Pos 1 (nodes)
Pos2 (edges)
Pos 3 (misc)
GraphCRF
N x 22
2 x
-
Nodes length array per replicate
EdgeFeatureGraphCRF
N x 22
2 x
2 x 2
Nodes length array per replicate
16. LatentNodeCRF
N x 22
x 2
1 x 1*
1 label per replicate
Tabel X. shows the input requirements for pystruct where
Position 1 is a one-hot encoded array of amino acids where N is
the length of the AA sequence and 22 is the possible AAs that
could be within that slot. Position 2 is a list of all possible
pairwise edges which are unusually required to be transposed
for the latent model. The third position, if applicable, is a
metric of adding weights and additional features to the edges. In
the latent model, the third position corresponds to the latent
states possible and need not necessarily be a 1x1 array per
replicate.
SSVM Parameters
The set parameters for all models were fairly consistent. A
maximum of 200 training iterations were allowed and
preliminary tests showed that lowering this number eventually
lowered the models predictive accuracy. The regularization
parameter (C) was set to 100 for a strict penalty to be assigned
to avoid overfitting based on recommendation from the pystruct
source code.
Results
Control Quality
Adk-lid, short for Adenylate kinase, is a conserved protein
domain across many species including Streptococcus which is a
mare deeply characterized genus rife with multiple gene
features for an outgroup comparison. We were able to curate an
assortment of 18 gene features totalling 373 sequences that were
greater in length compared to the adk-lid domain sequences.
After receiving the fasta sequences an alignment, performed
with mafft v7.487 (2021/Jul/25) using the default parameters of
each cluster was performed and a Position Specific Scoring
Matrix was calculated and graphed to assess the quality of the
17. data pulled. A position specific scoring matrix shows the
sequence position in the x-axis and the range of possible amino
acids in the y-axis. A handful of chosen alignments showed a
mix of conservation and diversity amongst the sequences.
The quality of the MSA can be visually observed by observing a
heatmap of the position-specific scoring matrix fig x.x
Fig: Snapshot of the portion of the control sequence genes
showing the number of sequences in the file and the
conservation per AA residue elucidated by the position-specific
scoring matrix.
Model Training
Using pystruct and the protocol shown above, we were able to
use the Latent Structured Support Vector to train a Latent node
graphical CRF to high degrees of accuracy. NSlackSSVMs,
OneSlackSSVMs, and FrankWolfeSSVMs were evaluated as the
base SSVM for the latent learner of which the Slack methods
trained to 100% and the FrankWolfe method which had a
predictive power of 91.1% successfully predicted scores. The
fastest SSVM model was the NSlackSSVM which was 3.9X
faster than the second performer, OneSlack and 160X faster
than the FrankWolfe learner. Due to the similar results between
the top two performers, the faster of the two models was chosen
for further examination.
Fig X. Results from the training show that the
FrankWolfeSSVM was significantly slower than the other
models and has lower performance.
Attempts to fish out the actual pairwise associations (Max)
· All the graphs generated
· The model
· CSV
Discussion
In SSVM, the joint feature function Ψ represents the relation
18. between x and y. Latent variable SSVMs are generalizations of
SSVMs, where joint feature function Ψ(x, y) with an extra
argument h to Ψ(x, y, h) to describe the relation between input
x, output y, and latent variable h. Comment by Sanju Wagle:
We can add this info somewhere in the discussion.
Conditional random fields is a discriminative model i.e it
models the conditional probability P(Y/X) which is best suited
to predict the tasks where the current position is affected by the
contextual information or state of the neighbors. Unlike HMM
and MEMM which are a directed graph i.e. it directly models
the transition probability and calculates the probability of co-
occurrence, CFR is an undirected graph and it calculates the
normalization probability in the global scope. concentrating on
graphical CRFs where we can impose dependencies on arbitrary
elements. In this project we developed a graphical CRF based
on the latent node CRF that would score the chances of the new
sequences having the same functionality as its family.
Comment by Biraj Shrestha: will add citation later
Learner
Convex Optimizers
After performing the trainer comparisons it was clear that
the NSlack learner was faster and just as precise if not more
precise than the other models. NSlack and OneSlack learners
were equivalent in performance, likely due to the underlying
design used by both. The slack methods both employ the crxopt
which is a python package whose name is a portmanteau of
convex and optimization(Andersen, 2011). Cvxopt is likely the
reason their performance is far superior as the Frank-Wolfe
algorithm-based learner is a similar type of convex optimizer
commonly referred to as the conditional gradient method
(Kolter, 2019). The difference lies in the fact that the Frank-
Wolfe implementation was made by the pystruct designer and
does not have the C-based speed or “smart” criteria constraint
check which will prematurely check if an optimization step
lowered the predictive capabilities in the code. The general
speed of the NSlack method is a strongly desirable trait as the
19. protein number and length become increasingly large.
Control Alignments
One key limitation of the study design was that we did not have
access to versions of adk-lid that were non-functional. A learner
trained on such a model would have had a strong differentiating
power if non-functional adk-lid protein controls were
unavailable. In an attempt to learn some of the general common
pairwise associations, we made an optimistic assumption that
the learner would have had just enough noise within the control
and target groups to learn meaningful pairwise associations
within the target group. We opted for an alignment within
singular genes to reduce the size of the alignment and avoid
hyper-gappy arrays which likely would have had large gapps
between gene clusters resulting in gap locations being the
primary learned differentiating feature. Training a model on an
alignment of control and target groups theoretically could have
been a fruitful endeavor but we would have lost the structure of
the initial alignment. In future experiments we would try this
total alignment with severe penalties for gap extension, forcing
the Needleman wunsch implementation present in mafft to
create the most compact alignment possible for training.
· Also discuss the possibilities if we had retrieved the pairwise
· Like easy to train with small data set
· Interpret each of the results
· Problems that we faced
· Future plans/recommendations
· Conclusion
· Illustrating the purpose of the CRF
· Summarizing the results and finding
· Future recommendations
References: (ordered)
· Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., &
Zico Kolter, J. (2019). Differentiable convex optimization
layers. Advances in Neural Information Processing Systems, 32
20. (NeurIPS).
· Katoh, K., & Standley, D. M. (2013). MAFFT multiple
sequence alignment software version 7: Improvements in
performance and usability. Molecular Biology and Evolution,
30(4), 772–780. https://doi.org/10.1093/molbev/mst010
· Luo, X., Li, H., Yu, Y., Zhou, C., & Cao (2020). Combining
in-depth features and activity context to improve recognition of
activities of workers in groups. Computer‐ Aided Civil and
Infrastructure Engineering, 35(9), 965-978.
· Meunier, J. L. (2017, November). PyStruct extension for typed
crf graphs. In 2017 14th IAPR International Conference on
Document Analysis and Recognition (ICDAR) (Vol. 4, pp. 5-
10). IEEE.
· Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L.,
Green, T., Qin, C., Žídek, A., Nelson, A. W. R., Bridgland, A.,
Penedones, H., Petersen, S., Simonyan, K., Crossan, S., Kohli,
P., Jones, D. T., Silver, D., Kavukcuoglu, K., & Hassabis, D.
(2020). Improved protein structure prediction using potentials
from deep learning. Nature, 577(7792), 706–710.
https://doi.org/10.1038/s41586-019-1923-7
· Suraksha, N. M., Reshma, K., & Kumar, K. S. (2017, June).
Part-of-speech tagging and parsing of Kannada text using
Conditional Random Fields (CRFs). In 2017 International
Conference on Intelligent Computing and Control (I2C2) (pp. 1-
5). IEEE.
· Xu, M., Du, Y., Wu, J., & Zhou, Y. (2015). Map Matching
Based on Conditional Random Fields and Route Preference
Mining for Uncertain Trajectories. Mathematical Problems in
Engineering, 2015. https://doi.org/10.1155/2015/717095
· Xue, Y., Chen, J., Wen, W., Huang, Y., Yu, C., Li, T., & Bao
(2019). Mvscrf: Learning multi-view stereo with conditional
random fields. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 4312-4321).
· Yu, B., & Fan, Z. (2020). A comprehensive review of
conditional random fields: variants, hybrids and applications.
Artificial Intelligence Review, 53(6), 4289-4333.
21. · Zia, H. B., Raza, A. A., & Athar (2018). Urdu word
segmentation using conditional random fields (CRFs). arXiv
preprint arXiv:1806.05432.
· Zhong, Z., Li, J., Clausi, D. A., & Wong, A. (2019).
Generative adversarial networks and conditional random fields
for hyperspectral image classification. IEEE transactions on
cybernetics, 50(7), 3318-3329.
1