Ali Kishk
Research Assistant @ Nile University
in Bioinformatics
Choosing a product based on reviews ?
Sentiment Analysis Model
Post Company Happy /
Sad
Post1 X Happy
Post2 Y Sad
Post3 X Sad
Company
with best
reviews
Outline
● AI (Types, Development)
● Deep Learning (Architecture)
● Bioinformatics Fields
● Input formats for AI
● AI Challenges in Biology
● Example: (Proteomics, Transcriptomics)
● Metagenomics: @ NU
● Taxonomic Classification
● Phenotype Classification
● How to begin in AI in Bioinformatics
AI
Applications
7
Why Artificial Intelligence?
Artificial Intelligence Classification
AI
Supervised Un-Supervised
RegressionClassification
Reinforcement
Learning
Supervised
Classification Regression
Un-Supervised
Reinforcement
Learning
Artificial Intelligence: Development
AI
Deep
Learning
Classical
Machine
Learning
Bioinformatics Fields
Genomics
Transcriptomics
Proteomics
Input formats for AI
ACTCTCTCTGCTACTCGCA
Sequence
ACTCTCTCTGCTACTCGCA
Sequence
ACTCTCTCTGCTACTCGCRA
Image
GC%,
Kmer frequency,
TFIDF
Features
Deep Learning Architectures
Multi-layer
perceptron
Convolutional
neural network
Recurrent neural
network
Complexity Low Medium High
Examples ResNet LSTM, GRU
Main Applications Tabular data Computer Vision Sequence classification
Machine Translation
15
AI Challenges in Biology
● High # features
● Low # samples
● High # classes
● Feature sparsity
● Class imbalance
AI Challenges in Biology
Genomics Transcriptomics Metagenomics
Unique analysis step Variant calling
Differential expressed
gene analysis
Taxonomic
assignment
Analysis output
Variant calling file of
SNPs or CNVs
Differential expressed
genes (DEGs)
Taxonomy table /
OTU table
Features Sparsity Very sparse Dense Sparse
Number of Features 1M to 10M 10K : 50K
OTU table: 100s to
1K
Kmer content: 4 ^
kmer size
AI in Proteomics
Subcellular Localization
Input:
Molecular Weight, Polarity ..
Output:
Location ( Nucleus, Cytoplasm,
Extracellular. membrane)
Format:
Protein Sequence Features
Wei, Leyi, et al. "Prediction of human protein subcellular localization using deep learning." Journal of Parallel and Distributed Computing 117 (2018):
212-217.
AI in Transcriptomics
Differential Expression Prediction
Input:
1000 gene expression
Output:
20,000 gene expression
Format:
Features (Differential Expression)
Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000
profiles." Cell 171.6 (2017): 1437-1452.
AI in Metagenomics
1- AmpliconNet: Taxonomic Assignment using Deep Learning (Published)
2- Phenotype Classification using Machine Learning (Published)
3- Metafy: Phenotype Classification using Deep Learning (Not published)
General 16S rRNA analysis steps
1- Quality Control
2- Merging
3- Trimming & Filtration
4- Taxonomic Assignment
5- Statistical analysis
21
AI in Metagenomics
Taxonomic Assignment
Input: 16s rRNA gene sequence
Output: Phylum, Class, Order. Family, Genus
Format: Sequence itself
Kishk, Ali, and Mohamed El-Hadidi. "AmpliconNet: Sequence Based Multi-layer Perceptron for Amplicon Read Classification Using Real-time
Data Augmentation." 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018.
Taxonomic assignment programs
Alignment Based Prediction-Based / ML-Based
Pros Highly accurate Faster in prediction
No reference database needed
Cons Computational expensive,
increasing if applied for sub-species
level
Less accurate
Most tools use k-mer frequency
Example BLAST RDP, 16S classifier
23
Training data in ML-based
1- Full 16s rRNA gene
2- High Variable Region Specific
Image source: https://www.lcsciences.com/discovery/applications/genomics/16s-rrna-gene-sequencing-landing/16s-gene/ 24
AmpliconNet Goal
1- Modeling the direct sequence rather than sequence features.
2- Reduce taxonomic classification time from weeks to hours
using DL (Over GPU and TPU)
3- Using simple neural network (Trainable model on average PCs)
25
- HVR Specific model
- Direct sequence modeling
- Manual search of many DL architectures
- Data augmentation
AmpliconNet Approach
Sequence Input Example
“I like this movie.”
“I don’t like this movie.”
27
Sorted Corpus
1. Don’t
2. Like
3. I
4. This
5. Movie
One Hot Encoding
3 2 4 5 >>> HAPPY
3 1 2 4 5 >>> SAD
Vocab_size
Zero Padding
3 2 4 5 0 >>> HAPPY
3 1 2 4 5 >>> SAD
DNA Words !
AmpliconNet Vs RDP
RDP* AmpliconNet
Input Type Sequence features Sequence itself
Input Kmer frequency Sequence of kmers
Model Naive Bayes Multilayer Perceptron
16S rRNA gene
training
The whole gene (one
model)
HVR Specific (multiple
models)
Advantage Faster in training Preserve kmer position
● Wang, Qiong, et al. "Naive Bayesian classifier for rapid assignment of rRNA sequences into the new
bacterial taxonomy." Appl. Environ. Microbiol. 73.16 (2007): 5261-5267.
AmpliconNet Vs RDP F1_Score on V2 HVR
F1_Score for AmpliconNet & RDP on the simulated test dataset
Phylum Order Class Family Genus
# Classes 28 50 162 348 1853
AmpliconNet 99.85% 99.80% 99.49% 99.14% 95.55%
RDP 99.86% 99.81% 99.50% 99.13% 89.79%
30
AmpliconNet paper
Publication: in BIBM 2018 Conference Paper
AmpliconNet Code:
Github: https://github.com/ali-kishk/AmpliconNet
2- Phenotype Classification
using ML:
Goal:
Using a classifier on the raw data, to
avoid alignment time.
Techniques:
Kmer Frequency
SVM, Random Forest
Results:
78% test F1, on CRC data (277 sample)
2- Phenotype Classification of Colon Cancer
Publication: in CIBEC 2018 Conference Paper
3- Phenotype Classification using Deep Learning:
Goal:
Reduce the number of features by deep learning
Results:
MicroPheno * Metafy (our approach)
Input Kmer frequency Kmer frequency
Model SVM / RF/ MLP DL feature extraction + SVM
Test F1 on CRC 84 % 89%
* Asgari, Ehsaneddin, et al. "MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene
sequencing using a k-mer based representation of shallow sub-samples." Bioinformatics 34.13 (2018): i32-i42.
How to begin in AI in Bioinformatics:
1- Choose an OMICs field (genomics, transcriptomics,..)
2- Choose unsolved problem / need optimization
3- Choose a NGS platform
4- Search for recent models / architectures in this problem
5- Keep yourself updated (Google Scholar Alert)
6- Define the problem challenges ( class imbalance)
7- Search for recent models / architectures for these challenges.
Conclusion
Strength yourself in AI first in general problems (NLP, Vision)
Avoid the AI hype
Thank You
Contact:
ph.ali.kish@gmail.com

AI in Bioinformatics

  • 1.
    Ali Kishk Research Assistant@ Nile University in Bioinformatics
  • 3.
    Choosing a productbased on reviews ? Sentiment Analysis Model Post Company Happy / Sad Post1 X Happy Post2 Y Sad Post3 X Sad Company with best reviews
  • 5.
    Outline ● AI (Types,Development) ● Deep Learning (Architecture) ● Bioinformatics Fields ● Input formats for AI ● AI Challenges in Biology ● Example: (Proteomics, Transcriptomics) ● Metagenomics: @ NU ● Taxonomic Classification ● Phenotype Classification ● How to begin in AI in Bioinformatics
  • 6.
  • 7.
  • 8.
    Artificial Intelligence Classification AI SupervisedUn-Supervised RegressionClassification Reinforcement Learning
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Input formats forAI ACTCTCTCTGCTACTCGCA Sequence ACTCTCTCTGCTACTCGCA Sequence ACTCTCTCTGCTACTCGCRA Image GC%, Kmer frequency, TFIDF Features
  • 14.
    Deep Learning Architectures Multi-layer perceptron Convolutional neuralnetwork Recurrent neural network Complexity Low Medium High Examples ResNet LSTM, GRU Main Applications Tabular data Computer Vision Sequence classification Machine Translation 15
  • 15.
    AI Challenges inBiology ● High # features ● Low # samples ● High # classes ● Feature sparsity ● Class imbalance
  • 16.
    AI Challenges inBiology Genomics Transcriptomics Metagenomics Unique analysis step Variant calling Differential expressed gene analysis Taxonomic assignment Analysis output Variant calling file of SNPs or CNVs Differential expressed genes (DEGs) Taxonomy table / OTU table Features Sparsity Very sparse Dense Sparse Number of Features 1M to 10M 10K : 50K OTU table: 100s to 1K Kmer content: 4 ^ kmer size
  • 17.
    AI in Proteomics SubcellularLocalization Input: Molecular Weight, Polarity .. Output: Location ( Nucleus, Cytoplasm, Extracellular. membrane) Format: Protein Sequence Features Wei, Leyi, et al. "Prediction of human protein subcellular localization using deep learning." Journal of Parallel and Distributed Computing 117 (2018): 212-217.
  • 18.
    AI in Transcriptomics DifferentialExpression Prediction Input: 1000 gene expression Output: 20,000 gene expression Format: Features (Differential Expression) Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell 171.6 (2017): 1437-1452.
  • 19.
    AI in Metagenomics 1-AmpliconNet: Taxonomic Assignment using Deep Learning (Published) 2- Phenotype Classification using Machine Learning (Published) 3- Metafy: Phenotype Classification using Deep Learning (Not published)
  • 20.
    General 16S rRNAanalysis steps 1- Quality Control 2- Merging 3- Trimming & Filtration 4- Taxonomic Assignment 5- Statistical analysis 21
  • 21.
    AI in Metagenomics TaxonomicAssignment Input: 16s rRNA gene sequence Output: Phylum, Class, Order. Family, Genus Format: Sequence itself Kishk, Ali, and Mohamed El-Hadidi. "AmpliconNet: Sequence Based Multi-layer Perceptron for Amplicon Read Classification Using Real-time Data Augmentation." 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018.
  • 22.
    Taxonomic assignment programs AlignmentBased Prediction-Based / ML-Based Pros Highly accurate Faster in prediction No reference database needed Cons Computational expensive, increasing if applied for sub-species level Less accurate Most tools use k-mer frequency Example BLAST RDP, 16S classifier 23
  • 23.
    Training data inML-based 1- Full 16s rRNA gene 2- High Variable Region Specific Image source: https://www.lcsciences.com/discovery/applications/genomics/16s-rrna-gene-sequencing-landing/16s-gene/ 24
  • 24.
    AmpliconNet Goal 1- Modelingthe direct sequence rather than sequence features. 2- Reduce taxonomic classification time from weeks to hours using DL (Over GPU and TPU) 3- Using simple neural network (Trainable model on average PCs) 25
  • 25.
    - HVR Specificmodel - Direct sequence modeling - Manual search of many DL architectures - Data augmentation AmpliconNet Approach
  • 26.
    Sequence Input Example “Ilike this movie.” “I don’t like this movie.” 27 Sorted Corpus 1. Don’t 2. Like 3. I 4. This 5. Movie One Hot Encoding 3 2 4 5 >>> HAPPY 3 1 2 4 5 >>> SAD Vocab_size Zero Padding 3 2 4 5 0 >>> HAPPY 3 1 2 4 5 >>> SAD
  • 27.
  • 28.
    AmpliconNet Vs RDP RDP*AmpliconNet Input Type Sequence features Sequence itself Input Kmer frequency Sequence of kmers Model Naive Bayes Multilayer Perceptron 16S rRNA gene training The whole gene (one model) HVR Specific (multiple models) Advantage Faster in training Preserve kmer position ● Wang, Qiong, et al. "Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy." Appl. Environ. Microbiol. 73.16 (2007): 5261-5267.
  • 29.
    AmpliconNet Vs RDPF1_Score on V2 HVR F1_Score for AmpliconNet & RDP on the simulated test dataset Phylum Order Class Family Genus # Classes 28 50 162 348 1853 AmpliconNet 99.85% 99.80% 99.49% 99.14% 95.55% RDP 99.86% 99.81% 99.50% 99.13% 89.79% 30
  • 30.
    AmpliconNet paper Publication: inBIBM 2018 Conference Paper
  • 31.
  • 32.
    2- Phenotype Classification usingML: Goal: Using a classifier on the raw data, to avoid alignment time. Techniques: Kmer Frequency SVM, Random Forest Results: 78% test F1, on CRC data (277 sample)
  • 33.
    2- Phenotype Classificationof Colon Cancer Publication: in CIBEC 2018 Conference Paper
  • 34.
    3- Phenotype Classificationusing Deep Learning: Goal: Reduce the number of features by deep learning Results: MicroPheno * Metafy (our approach) Input Kmer frequency Kmer frequency Model SVM / RF/ MLP DL feature extraction + SVM Test F1 on CRC 84 % 89% * Asgari, Ehsaneddin, et al. "MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples." Bioinformatics 34.13 (2018): i32-i42.
  • 35.
    How to beginin AI in Bioinformatics: 1- Choose an OMICs field (genomics, transcriptomics,..) 2- Choose unsolved problem / need optimization 3- Choose a NGS platform 4- Search for recent models / architectures in this problem 5- Keep yourself updated (Google Scholar Alert) 6- Define the problem challenges ( class imbalance) 7- Search for recent models / architectures for these challenges.
  • 36.
    Conclusion Strength yourself inAI first in general problems (NLP, Vision) Avoid the AI hype
  • 37.