ABSTRACT
To improve the crop plant yield, agriculture companies have successfully adopted
development of insect resistant crops by expressing insecticidal (insect killing) proteins in
plants. As a leader in Agriculture Biotechnology industry, Bayer tests hundreds of genes
every year for insecticidal activity in their proprietary pipeline to develop next generation of
insect control solutions. Identification and nomination insecticidal proteins using traditional
methods like blast and structure similarity have some drawbacks because of which more
than 90% of the nominated proteins end up displaying no or less activity against insects. The
testing of these proteins consumes enormous amount of time and resource. So we adopted
machine learning (ML) approach to identify these proteins. We generated numerous features
for more than 5000 amino acid sequences using a Python toolkit, iFeature, developed by
Chen et al, in 2018 and built ML models to identify proteins with insecticidal activity.
Proteins identified using this method are tested in the pipeline to check their efficacy against
insect pests. Challenges faced while building the model and methods to overcome those
challenges are discussed in this presentation.
1
HOW WE BUILT A ML MODEL
TO PREDICT PROTEINS WITH
INSECTICIDAL ACTIVITY?
Karnam Vasudeva Rao,
Senior Scientist, Data Science,
Monsanto (A Subsidiary of Bayer)
CONTENTS
▰ What are insecticidal proteins?
▰ Why machine learning for protein activity identification?
▰ Different approaches used by researchers
▰ Why not general methods?
▰ iFeature Python tool kit
▰ Why did we choose iFeature?
▰ What features iFeature has?
▰ How we adopted it for our need?
▰ What were the challenges?
▰ How did we overcome those?
▰ Key learnings 3
IMPROVE CROP YIELD BY DEVELOPING PEST RESISTANT
CROPS BY EXPRESSING INSECTICIDAL PROTEINS IN THEM
4
WHY WE NEED ML FOR GENE NOMINATIONS?
5Current state
What?
Predict protein activity
against insect pests based
on Amino Acid sequence
features to enable quality
nominations to insect control
pipeline in Bayer.
Why?
100’s of proteins are
nominated and analyzed in
each year. Many
nominations have turned out
to be inactive proteins /
toxins. Goal is to develop a
model to predict the
propensity of toxicity.
How?
Extract features from
>5000 Protein (amino
acid) sequences and
develop a predictive
model using historical
data to predict inactive
toxins.
Future state
Pipeline
THREE MAJOR APPROACHES ARE USED BY
RESEARCHERS TO PREDICT PROTEIN FUNCTIONS
6
1 2 3
Sequence similarity between
AA sequences
Protein structure
comparison
Disadvantages with traditional methods:
High-similarity BLAST does not always imply homology.
Proteins with the same function can have different
structures.
Proteins that have diverged from a common ancestral
gene may have the same function but different
sequences.
Sequence similarity-based approaches are often
inadequate in the absence of similar sequences or when
the sequence similarity among known protein sequences
is statistically weak (called the "twilight zone" or
"midnight zone") (reference: Proteome Science 2009,
7:27).
Biological experiments for protein identification are time
consuming and resource intensive.
Sequence and structure
derived features
iFeature - AN OPEN-SOURCE PYTHON TOOLKIT FOR
PREDICTION OF PROTEINS ACTIVITY
7
iFeature
▰ http://iFeature.erc.monash.edu/
▰ https://github.com/Superzchen/iFeature/
▰ Features:
▰ Protein length, molecular weight, number of atoms,
grand average of hydropathicity (GRAVY), amino
acid composition, periodicity, physicochemical
properties, predicted secondary structures,
subcellular location, sequence motifs or highly
conserved regions, classification of protein function,
hydrophobicity, solvent accessibility, secondary
structure, surface tension, charge, polarisability,
polarity, and normalized van der Waals volume and
annotations in protein databases.
•Predicting protein–protein interactions
through sequence-based deep
learning.
•Bioinformatics, 34, 2018, i802–i810
DPPI
•Predicting protein functions from
sequence and interactions using a
deep ontology-aware classifier.
•Bioinformatics, 34(4), 2018, 660–668
DeepGO
•Predicting protein function by
machine learning on amino acid
sequences – a critical evaluation
•BMC Genomics 2007, 8:78
Classifiers
Place your screenshot here
8
iFeature - AN OPEN-
SOURCE PYTHON TOOLKIT
GitHuB repository with codes,
usage instructions and examples.
9
SAMPLE DATA
toxin sequence score
protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINKLIPSSG0
protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDGKPVYIA0
protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEWDGEPYS0
protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMANFSLDI0
protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQLDGGA0
protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVERNKKSL0
protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNKSAYYSF0
protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNVDGMY0
protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILGFLGLGF0
protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANFNEPVC0
protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLADSQQQ1
protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITTKNHPK0
protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTWNPLLV0
protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDLEVKNSY0
• 5000+ amino acid sequences
and activity scores 0-5.
• *0-5: inactive to highly active
10
cluster.py
iFeaturePse
KRAAC.py
feaSelector.py
pcaAnalysis.py
python iFeature.py --file examples/test-protein.txt --type CKSAAP
python iFeature.py --file examples/test-protein.txt --type DDE
POSSESS 37 FEATURE DESCRIPTIONS
• three dimensionality reduction
algorithms (PCA, LDA and t-SNE)
• program used to implement the
feature selection algorithms
• program used for running the feature or
sample clustering algorithms.
• program used to extract the 16 types
of pseudo K-tuple reduced amino acid
composition (PseKRAAC) feature
descriptors.
• k-spaced Amino Acid Pairs
11
LIST OF VARIOUS DESCRIPTORS
CALCULATED BY
Descriptor groups Descriptor Dimn.
AA composition Amino acid composition (AAC) 20
Enhanced amino acid composition (EAAC) —
Composition of k-spaced AA pairs (CKSAAP) 2400
Dipeptide composition (DPC) 400
Dipeptide deviation from expected mean (DDE) 400
Tripeptide composition (TPC) 8000
Grouped AA composition Grouped amino acid composition (GAAC) 5
Enhanced grouped AA composition (GEAAC) —
Composition of k-spaced AA group pairs (CKSAAGP) 150
Grouped dipeptide composition (GDPC) 25
Grouped tripeptide composition (GTPC) 125
Binary Binary (BINARY) —
Autocorrelation Moran (Moran) 240
Geary (Geary) 240
Normalized Moreau-Broto (NMBroto) 240
C/T/D Composition (CTDC) 39
Transition (CTDT) 39
Distribution (CTDD) 195
Conjoint triad
Conjoint triad (CTriad) 343
Conjoint k-spaced triad (KSCTriad) 343x(k+1)
Feature selectionFeature extraction Model building
Performance of the
modelsData preparation
1 2 3 4 5
What to explore in
Data?
Only 2 independent
variables
• Sequences
• Assay values
No independent
variables!
Need to generate
features using
sequences.
1000s of features;
which ones to
select?
What these
features explain?
Which model to
choose?
Confusion matrix
Biologically whether
it makes sense?
Meaningful features for protein function
prediction
CHALLENGES IN USING SEQUENCE BASED
ML APPROACHES
iFeature
13
toxin sequence score
protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINK0
protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDG0
protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEW0
protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMA0
protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQ0
protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVER0
protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNK0
protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNV0
protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILG0
protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANF0
protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLA1
protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITT0
protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTW0
protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDL0
NUMEROUS SEQUENCE FEATURES
WERE GENERATED USING
MODEL EVALUATION
RANDOM FOREST WAS THE FAVORITE
14
KEY LEARNINGS
FEATURES
▰iFeature - ‘all in one package’
▰Very few independent variables
before using iFeature and too
many after using iFeature.
▰Use not only Importance but
domain knowledge to choose input
variables (e.g. K space, conjoint
triad).
DATA
▰Data bias can be overcome
using domain knowledge – 0:
active; 1-5: active (Multinomial
to binomial).
MODEL BUILDING
▰Build multiple models instead of
one or two and choose the best
based on business needs and
parameters.
▰Where multiple models perform
equally select model based on
business needs / domain
knowledge (False Positives |
False negatives) – sensitivity and
specificity.
15
OTHER APPLICATIONS
▰iFeature and above approach – to
identify disease related proteins and
Protein-protein interaction studies.
16
THANKS!
https://www.linkedin.com/in/karnam-vasudeva-rao-phd-9032759/
vasukarnam@gmail.com
vkarnam@monsanto.com; vasudevarao.karnam@bayer.com
Senior Scientist - Data Science,
Monsanto (Subsidiary of
Bayer), Bengaluru, India.

Prediction of proteins for insecticidal activity using python toolkit iFeature

  • 1.
    ABSTRACT To improve thecrop plant yield, agriculture companies have successfully adopted development of insect resistant crops by expressing insecticidal (insect killing) proteins in plants. As a leader in Agriculture Biotechnology industry, Bayer tests hundreds of genes every year for insecticidal activity in their proprietary pipeline to develop next generation of insect control solutions. Identification and nomination insecticidal proteins using traditional methods like blast and structure similarity have some drawbacks because of which more than 90% of the nominated proteins end up displaying no or less activity against insects. The testing of these proteins consumes enormous amount of time and resource. So we adopted machine learning (ML) approach to identify these proteins. We generated numerous features for more than 5000 amino acid sequences using a Python toolkit, iFeature, developed by Chen et al, in 2018 and built ML models to identify proteins with insecticidal activity. Proteins identified using this method are tested in the pipeline to check their efficacy against insect pests. Challenges faced while building the model and methods to overcome those challenges are discussed in this presentation. 1
  • 2.
    HOW WE BUILTA ML MODEL TO PREDICT PROTEINS WITH INSECTICIDAL ACTIVITY? Karnam Vasudeva Rao, Senior Scientist, Data Science, Monsanto (A Subsidiary of Bayer)
  • 3.
    CONTENTS ▰ What areinsecticidal proteins? ▰ Why machine learning for protein activity identification? ▰ Different approaches used by researchers ▰ Why not general methods? ▰ iFeature Python tool kit ▰ Why did we choose iFeature? ▰ What features iFeature has? ▰ How we adopted it for our need? ▰ What were the challenges? ▰ How did we overcome those? ▰ Key learnings 3
  • 4.
    IMPROVE CROP YIELDBY DEVELOPING PEST RESISTANT CROPS BY EXPRESSING INSECTICIDAL PROTEINS IN THEM 4
  • 5.
    WHY WE NEEDML FOR GENE NOMINATIONS? 5Current state What? Predict protein activity against insect pests based on Amino Acid sequence features to enable quality nominations to insect control pipeline in Bayer. Why? 100’s of proteins are nominated and analyzed in each year. Many nominations have turned out to be inactive proteins / toxins. Goal is to develop a model to predict the propensity of toxicity. How? Extract features from >5000 Protein (amino acid) sequences and develop a predictive model using historical data to predict inactive toxins. Future state Pipeline
  • 6.
    THREE MAJOR APPROACHESARE USED BY RESEARCHERS TO PREDICT PROTEIN FUNCTIONS 6 1 2 3 Sequence similarity between AA sequences Protein structure comparison Disadvantages with traditional methods: High-similarity BLAST does not always imply homology. Proteins with the same function can have different structures. Proteins that have diverged from a common ancestral gene may have the same function but different sequences. Sequence similarity-based approaches are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak (called the "twilight zone" or "midnight zone") (reference: Proteome Science 2009, 7:27). Biological experiments for protein identification are time consuming and resource intensive. Sequence and structure derived features
  • 7.
    iFeature - ANOPEN-SOURCE PYTHON TOOLKIT FOR PREDICTION OF PROTEINS ACTIVITY 7 iFeature ▰ http://iFeature.erc.monash.edu/ ▰ https://github.com/Superzchen/iFeature/ ▰ Features: ▰ Protein length, molecular weight, number of atoms, grand average of hydropathicity (GRAVY), amino acid composition, periodicity, physicochemical properties, predicted secondary structures, subcellular location, sequence motifs or highly conserved regions, classification of protein function, hydrophobicity, solvent accessibility, secondary structure, surface tension, charge, polarisability, polarity, and normalized van der Waals volume and annotations in protein databases. •Predicting protein–protein interactions through sequence-based deep learning. •Bioinformatics, 34, 2018, i802–i810 DPPI •Predicting protein functions from sequence and interactions using a deep ontology-aware classifier. •Bioinformatics, 34(4), 2018, 660–668 DeepGO •Predicting protein function by machine learning on amino acid sequences – a critical evaluation •BMC Genomics 2007, 8:78 Classifiers
  • 8.
    Place your screenshothere 8 iFeature - AN OPEN- SOURCE PYTHON TOOLKIT GitHuB repository with codes, usage instructions and examples.
  • 9.
    9 SAMPLE DATA toxin sequencescore protein3345 MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINKLIPSSG0 protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDGKPVYIA0 protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEWDGEPYS0 protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMANFSLDI0 protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQLDGGA0 protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVERNKKSL0 protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNKSAYYSF0 protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNVDGMY0 protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILGFLGLGF0 protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANFNEPVC0 protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLADSQQQ1 protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITTKNHPK0 protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTWNPLLV0 protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDLEVKNSY0 • 5000+ amino acid sequences and activity scores 0-5. • *0-5: inactive to highly active
  • 10.
    10 cluster.py iFeaturePse KRAAC.py feaSelector.py pcaAnalysis.py python iFeature.py --fileexamples/test-protein.txt --type CKSAAP python iFeature.py --file examples/test-protein.txt --type DDE POSSESS 37 FEATURE DESCRIPTIONS • three dimensionality reduction algorithms (PCA, LDA and t-SNE) • program used to implement the feature selection algorithms • program used for running the feature or sample clustering algorithms. • program used to extract the 16 types of pseudo K-tuple reduced amino acid composition (PseKRAAC) feature descriptors. • k-spaced Amino Acid Pairs
  • 11.
    11 LIST OF VARIOUSDESCRIPTORS CALCULATED BY Descriptor groups Descriptor Dimn. AA composition Amino acid composition (AAC) 20 Enhanced amino acid composition (EAAC) — Composition of k-spaced AA pairs (CKSAAP) 2400 Dipeptide composition (DPC) 400 Dipeptide deviation from expected mean (DDE) 400 Tripeptide composition (TPC) 8000 Grouped AA composition Grouped amino acid composition (GAAC) 5 Enhanced grouped AA composition (GEAAC) — Composition of k-spaced AA group pairs (CKSAAGP) 150 Grouped dipeptide composition (GDPC) 25 Grouped tripeptide composition (GTPC) 125 Binary Binary (BINARY) — Autocorrelation Moran (Moran) 240 Geary (Geary) 240 Normalized Moreau-Broto (NMBroto) 240 C/T/D Composition (CTDC) 39 Transition (CTDT) 39 Distribution (CTDD) 195 Conjoint triad Conjoint triad (CTriad) 343 Conjoint k-spaced triad (KSCTriad) 343x(k+1)
  • 12.
    Feature selectionFeature extractionModel building Performance of the modelsData preparation 1 2 3 4 5 What to explore in Data? Only 2 independent variables • Sequences • Assay values No independent variables! Need to generate features using sequences. 1000s of features; which ones to select? What these features explain? Which model to choose? Confusion matrix Biologically whether it makes sense? Meaningful features for protein function prediction CHALLENGES IN USING SEQUENCE BASED ML APPROACHES iFeature
  • 13.
    13 toxin sequence score protein3345MNSYQNQYEILESSSNNTNMPNRYPFANDPNIFPINLDACQGRPWQDTWKSVSDIVTIGTYLIQFLREPGIGGIPVILSIINK0 protein10357 MSDLEVKIGVNPADVRYTANFKVAPNDGYVMYEKNTPIIPEIGVNITVINTGREEMEVHYEWAPPFGGWQCASTTIIPPDG0 protein7062 MSINIDPSKEFVKVSNFAGYEIATSQDSEEEGANLIIYYTADPYLLFYLDEERNNGILVSRRTGFVIGVKSGSNKDGELIIQCEW0 protein000023 MKICVVNILLGLLMIVGESAANIGYADLTTNVYFVATIKSSTCQMSLEGGTAGGGDSYTIPVGSNGKVGAIDIINGTENAMA0 protein3518 MKSISKKVMAGLLVGATSLSIWAPISEAAAPENNRYYNIALKSNTKKVWNVSQASNDNDRAIVLWQGGSADHERFAFFQ0 protein10355 MGIKKTIKFILCLSISLCILNYPSISFAETLDTNSSSVKSKSDIDTGIANLNYNNREVLAVNGDRVDSFVPKEGLNSNDKFIVVER0 protein5481 MENSNYFEKNNFSQEDSALDSLLNTFLVIQNKKTNQVIGRPEHYIQKGIITYYFINLENEADIPEQQLILYKLDNKSYYIVSRNK0 protein000025 MKRIFFFIPLILGLVACADDDSFSTSTGLRLDFPSDTIKLDTVFSRTASSTYTFWVNNRNDNGVKLQSVRLKRGNQTGFRVNV0 protein3918 MNGGKNMNQNNQNEMQIIDSSSNDFSQSNRYPRYPLAKESNYKDWLASCDESNVDTLSTTSDVKGSVSRVLGIVNQILG0 protein000021 MSNDIYGSSTELIANSIYETDYHVLLGIRNSNILFMTPHGGGVETGATELSIASGGTDHNYYCFEGWRTSNNGDMHVTSANF0 protein9439 MKKKVSMMLTCVLLAPLFLNGNAPVAHAGDPFLITSIDEPTIDREGLIGYYYREDQFKNLQLFTPTRNHTLVYDQGTARDLLA1 protein8184 MNQKKYIFMKPISILSIVCFCVSITPTSSLADMYRSRGNFTSKNENTKHTNEYYPRAIFNPYIEPAPEIITETRFASIKSTDTIAITT0 protein2126 MTKNHKKILSMTLVTSMLAGTYIPTAYTAFAETEQKEGSQENQTGLINKGSLPLDSYGLFENPYKGVTFDQFMNAFNNNTW0 protein9438 MKKKITKTLLCATMGISILTPLAVSAKTEDNNEQQLITQINQRENSFPNVGLGTQWLFQYYDKYLRANGLLRVAPVVTVEDL0 NUMEROUS SEQUENCE FEATURES WERE GENERATED USING
  • 14.
    MODEL EVALUATION RANDOM FORESTWAS THE FAVORITE 14
  • 15.
    KEY LEARNINGS FEATURES ▰iFeature -‘all in one package’ ▰Very few independent variables before using iFeature and too many after using iFeature. ▰Use not only Importance but domain knowledge to choose input variables (e.g. K space, conjoint triad). DATA ▰Data bias can be overcome using domain knowledge – 0: active; 1-5: active (Multinomial to binomial). MODEL BUILDING ▰Build multiple models instead of one or two and choose the best based on business needs and parameters. ▰Where multiple models perform equally select model based on business needs / domain knowledge (False Positives | False negatives) – sensitivity and specificity. 15 OTHER APPLICATIONS ▰iFeature and above approach – to identify disease related proteins and Protein-protein interaction studies.
  • 16.