Prediction of proteins for insecticidal activity using python toolkit iFeature
1. ABSTRACT
To improve the crop plant yield, agriculture companies have successfully adopted
development of insect resistant crops by expressing insecticidal (insect killing) proteins in
plants. As a leader in Agriculture Biotechnology industry, Bayer tests hundreds of genes
every year for insecticidal activity in their proprietary pipeline to develop next generation of
insect control solutions. Identification and nomination insecticidal proteins using traditional
methods like blast and structure similarity have some drawbacks because of which more
than 90% of the nominated proteins end up displaying no or less activity against insects. The
testing of these proteins consumes enormous amount of time and resource. So we adopted
machine learning (ML) approach to identify these proteins. We generated numerous features
for more than 5000 amino acid sequences using a Python toolkit, iFeature, developed by
Chen et al, in 2018 and built ML models to identify proteins with insecticidal activity.
Proteins identified using this method are tested in the pipeline to check their efficacy against
insect pests. Challenges faced while building the model and methods to overcome those
challenges are discussed in this presentation.
1
2. HOW WE BUILT A ML MODEL
TO PREDICT PROTEINS WITH
INSECTICIDAL ACTIVITY?
Karnam Vasudeva Rao,
Senior Scientist, Data Science,
Monsanto (A Subsidiary of Bayer)
3. CONTENTS
▰ What are insecticidal proteins?
▰ Why machine learning for protein activity identification?
▰ Different approaches used by researchers
▰ Why not general methods?
▰ iFeature Python tool kit
▰ Why did we choose iFeature?
▰ What features iFeature has?
▰ How we adopted it for our need?
▰ What were the challenges?
▰ How did we overcome those?
▰ Key learnings 3
4. IMPROVE CROP YIELD BY DEVELOPING PEST RESISTANT
CROPS BY EXPRESSING INSECTICIDAL PROTEINS IN THEM
4
5. WHY WE NEED ML FOR GENE NOMINATIONS?
5Current state
What?
Predict protein activity
against insect pests based
on Amino Acid sequence
features to enable quality
nominations to insect control
pipeline in Bayer.
Why?
100’s of proteins are
nominated and analyzed in
each year. Many
nominations have turned out
to be inactive proteins /
toxins. Goal is to develop a
model to predict the
propensity of toxicity.
How?
Extract features from
>5000 Protein (amino
acid) sequences and
develop a predictive
model using historical
data to predict inactive
toxins.
Future state
Pipeline
6. THREE MAJOR APPROACHES ARE USED BY
RESEARCHERS TO PREDICT PROTEIN FUNCTIONS
6
1 2 3
Sequence similarity between
AA sequences
Protein structure
comparison
Disadvantages with traditional methods:
High-similarity BLAST does not always imply homology.
Proteins with the same function can have different
structures.
Proteins that have diverged from a common ancestral
gene may have the same function but different
sequences.
Sequence similarity-based approaches are often
inadequate in the absence of similar sequences or when
the sequence similarity among known protein sequences
is statistically weak (called the "twilight zone" or
"midnight zone") (reference: Proteome Science 2009,
7:27).
Biological experiments for protein identification are time
consuming and resource intensive.
Sequence and structure
derived features
7. iFeature - AN OPEN-SOURCE PYTHON TOOLKIT FOR
PREDICTION OF PROTEINS ACTIVITY
7
iFeature
▰ http://iFeature.erc.monash.edu/
▰ https://github.com/Superzchen/iFeature/
▰ Features:
▰ Protein length, molecular weight, number of atoms,
grand average of hydropathicity (GRAVY), amino
acid composition, periodicity, physicochemical
properties, predicted secondary structures,
subcellular location, sequence motifs or highly
conserved regions, classification of protein function,
hydrophobicity, solvent accessibility, secondary
structure, surface tension, charge, polarisability,
polarity, and normalized van der Waals volume and
annotations in protein databases.
•Predicting protein–protein interactions
through sequence-based deep
learning.
•Bioinformatics, 34, 2018, i802–i810
DPPI
•Predicting protein functions from
sequence and interactions using a
deep ontology-aware classifier.
•Bioinformatics, 34(4), 2018, 660–668
DeepGO
•Predicting protein function by
machine learning on amino acid
sequences – a critical evaluation
•BMC Genomics 2007, 8:78
Classifiers
8. Place your screenshot here
8
iFeature - AN OPEN-
SOURCE PYTHON TOOLKIT
GitHuB repository with codes,
usage instructions and examples.
10. 10
cluster.py
iFeaturePse
KRAAC.py
feaSelector.py
pcaAnalysis.py
python iFeature.py --file examples/test-protein.txt --type CKSAAP
python iFeature.py --file examples/test-protein.txt --type DDE
POSSESS 37 FEATURE DESCRIPTIONS
• three dimensionality reduction
algorithms (PCA, LDA and t-SNE)
• program used to implement the
feature selection algorithms
• program used for running the feature or
sample clustering algorithms.
• program used to extract the 16 types
of pseudo K-tuple reduced amino acid
composition (PseKRAAC) feature
descriptors.
• k-spaced Amino Acid Pairs
11. 11
LIST OF VARIOUS DESCRIPTORS
CALCULATED BY
Descriptor groups Descriptor Dimn.
AA composition Amino acid composition (AAC) 20
Enhanced amino acid composition (EAAC) —
Composition of k-spaced AA pairs (CKSAAP) 2400
Dipeptide composition (DPC) 400
Dipeptide deviation from expected mean (DDE) 400
Tripeptide composition (TPC) 8000
Grouped AA composition Grouped amino acid composition (GAAC) 5
Enhanced grouped AA composition (GEAAC) —
Composition of k-spaced AA group pairs (CKSAAGP) 150
Grouped dipeptide composition (GDPC) 25
Grouped tripeptide composition (GTPC) 125
Binary Binary (BINARY) —
Autocorrelation Moran (Moran) 240
Geary (Geary) 240
Normalized Moreau-Broto (NMBroto) 240
C/T/D Composition (CTDC) 39
Transition (CTDT) 39
Distribution (CTDD) 195
Conjoint triad
Conjoint triad (CTriad) 343
Conjoint k-spaced triad (KSCTriad) 343x(k+1)
12. Feature selectionFeature extraction Model building
Performance of the
modelsData preparation
1 2 3 4 5
What to explore in
Data?
Only 2 independent
variables
• Sequences
• Assay values
No independent
variables!
Need to generate
features using
sequences.
1000s of features;
which ones to
select?
What these
features explain?
Which model to
choose?
Confusion matrix
Biologically whether
it makes sense?
Meaningful features for protein function
prediction
CHALLENGES IN USING SEQUENCE BASED
ML APPROACHES
iFeature
15. KEY LEARNINGS
FEATURES
▰iFeature - ‘all in one package’
▰Very few independent variables
before using iFeature and too
many after using iFeature.
▰Use not only Importance but
domain knowledge to choose input
variables (e.g. K space, conjoint
triad).
DATA
▰Data bias can be overcome
using domain knowledge – 0:
active; 1-5: active (Multinomial
to binomial).
MODEL BUILDING
▰Build multiple models instead of
one or two and choose the best
based on business needs and
parameters.
▰Where multiple models perform
equally select model based on
business needs / domain
knowledge (False Positives |
False negatives) – sensitivity and
specificity.
15
OTHER APPLICATIONS
▰iFeature and above approach – to
identify disease related proteins and
Protein-protein interaction studies.