Data Con LA 2020
Description
In the field of machine learning, it is well known that supervised problems can be one of two categories: classification or regression. Within the context of classification, several metrics and graphs used to assess the performance of a model only work in the context of a classification problem that computes the decision boundary between two classes (binary classification). With a greater adoption of machine learning, organizations now find themselves determining decision boundaries between several classes (multiclass). The usual question that arises is, how can one set up a multi-class problem and assess its performance? Although expansions on binary performance metrics do exist for this situation, there are a number of challenges worth considering. Suffering from limitations such as insufficient data samples and class imbalance, multi-class experiments can be unreliable for several machine learning problems. Developing a work-around, we compare and contrast several approaches to re-designing a multi-classification into binary classification. We further elucidate the best experimental design for assessing the final decisions of our model (s). The experiments for this case study analysis are applied to determine the taxonomic levels of several COVID-19 viral genomes to identify the pathogenic strains based on digital signal and chaos-inspired features.
Talk Main Points:
*What is multi-class classification?
*Compare and contrast the performance of multi-class and binary class problems
*Transforming a multi-class problem into a binary class problem
*Assessing limitations of each transformation approach in the process of COVID-19 viral taxonomy classification
Speaker
Rishov Chatterjee, City of Hope, Data Scientist
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Usual Questions with Unusual Answers: Application of Multi-class Supervised Algorithms to Identify COVID-19 Viral Strains
1. Usual Questions with Unusual Answers:
Application of Multi-class Supervised
Algorithms to Identify COVID-19 Viral Strains
Rishov Chatterjee
Research Informatics Division, Center for Informatics, City of Hope, CA
3. Introduction
● Taxonomic classification : finding the identity of a
certain virus1
● For unknown, potentially harmful pathogens,
classification can help uncover patterns from closest
known pathogens
● 10 taxonomic levels for viral genome, each has 1 or
more sublevels
Realm
Subrealm
Kingdom
Subkingdom
Phylum
Subphylum
Class
Subclass
Order
Suborder
Family
Subfamily
Genus
Subgenus
Species
Taxa
few
many
Adapted From: https://talk.ictvonline.org/
[1] Richards R, Biological Classification: A Philosophical
Introduction, Cambridge University Press, 2016
4. Objective: Sars-Cov2 Sequence Classification
Taxonomic Level Sublevels
Realm
Duplodnaviria, Monodnaviria, Riboviria ,
Varidnaviria
Kingdom Orthornavirae , Pararnavirae
Phylum
Duplornaviricota, Kitrinoviricota, Lenarviricota,
Negarnaviricota, Pisuviricota
Class
Duplopiviricetes , Pisoniviricetes ,
Stelpaviricetes
Order Nidovirales , Picornavirales, Sobelivirales
Taxonomic Level Sublevels
Suborder
Arnidovirineae, Cornidovirineae ,
Mesnidovirineae, Monidovirineae,
Nanidovirineae, Ronidovirineae,
Tornidovirineae
Family* Coronaviridae
Subfamily Orthocoronavirinae, Torovirinae, Coronavirinae
Genus
Alphacoronavirus, Betacoronavirus ,
Deltacoronavirus, Gammacoronavirus
Subgenus
Embecovirus, Merbecovirus, Nobecovirus,
Sarbecovirus
● Simplify classification and prevent data leakage by creating a new feature to classify Sars-Cov-2
sequences into a sublevel at each of the 9 out of 10 taxonomic levels2.
[2] https://talk.ictvonline.org/taxonomy/
5. Multi-class Classification
[3]. Mohamed, Aly (2005). "Survey on multiclass classification methods". Technical Report, Caltech.
Visualization from Jason Brownlee, 4 Types of Classification Tasks in Machine
Learning. https://machinelearningmastery.com/types-of-classification-in-machine-
learning/
● Multiclass classification classifies instances into one of
three or more classes.
● There are fewer multiclass classifiers than binary
classifiers3.
● Multi-class classification is usually more difficult to
optimize than binary classification.
6. Classification Task Reduction
● Multiclass classification can be reduced to become several binary classifiers.
● Most common reduction strategies that currently exist are One vs All (OvA, OvR) and
One vs One (OvO).
One-vs-One:
Credits: Zhang et al. Science DirectCredits: Jatin Nanda, Georgia Tech
7. Machine Learning Workflow
Classification
Type:
Optimized ML
Metric: Accuracy
Sequences used
for data
50 genomic sequences
chosen at random
from each sublevel
200 total
Multi-class problem
One vs Rest
200 Sars-Cov-2
Sequences
194
Accuracy =
194
200
= 97.0%
3 3
One vs One
8. Feature Engineering
Discrete Fourier Transform4 (DFT) Shannon’s Entropy
● Finds the digital frequencies
associated with numbers in a finite
numeric sequence
● Prior study used the average
magnitude of the Discrete Fourier
transform for feature creation.
● Finds the measure of the intrinsic
uncertainty embedded within a
sequence
● Based on the concept that all systems
have a tendency towards disorder
[4] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA, Kari L,
Machine learning using intrinsic genomic signatures for rapid
classification of novel pathogens: COVID-19 case study, PLOS One, 2020
9. Conversion Rules for Genomic Digitization
Conversion
Rule A T C G
Purine
Pyrimidine (PP)
0 1 1 0
EIIP 0.13 0.14 0.15 0.08
Just A 1 0 0 0
Paired
Numeric
1 1 -1 -1
Real 1.5 -1.5 0.5 -0.5
Integer 1 1 0 2 3
Integer 2 2 1 3 4
Just C 0 0 1 0
Just T 0 1 0 0
Just G 0 0 0 1
Illustration with the chosen
conversion rule5: Purine
Pyrimidine (PP)
CAGGTCAT…. =
10001101….
[5] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA,
Kari L, Machine learning using intrinsic genomic signatures for rapid
classification of novel pathogens: COVID-19 case study, PLOS
One,2020
10. Step 1
Convert genomic
sequences to
numbers with PP
conversion rule
Step 2
Find frequency
distribution of each
nucleotide using
DFT
Step 5Step 4Step 3
Find absolute
value of DFT
(magnitude), then
find the average
CGATAT
100101
Entropy = 1.33
[3, 0.5, 1.5 ,-1, 1.5, 0.5]
Average
Normalized
Magnitude=
2.15
Feature Engineering
Separately, find
entropy of the
sequence
Divide the average
magnitude by the
entropy to find
“magtropy”
Magtropy = 1.62
(2.15 /1.33)
11. Sublevel_Seq# Sequence
Betacoronavirus_1 ATCGCGAGA….
Betacoronavirus_2 ATCGGGTCG….
Alphacoronavirus_1 GATGCTGTA…...
Alphacoronavirus_2 GAGTCTCTA…..
Gammacoronavirus_1 AGGCCAAAT…...
Gammacoronavirus_2 AGGTCAAAT…...
Deltacoronavirus_1 CCGGTAATA...
Deltacoronavirus_2 CAGGTAAAC...
Raw vs. Processed Data for Genus level
Sublevel Magtropy Value
Betacoronavirus 111.59
Betacoronavirus 110.72
Alphacoronavirus 103.75
Alphacoronavirus 102.98
Gammacoronavirus 95.88
Gammacoronavirus 90.74
Deltacoronavirus 121.78
Deltacoronavirus 125.87
12. Machine Learning Workflow
Labels = Sublevels
Features =
Magtropy Values
Alphacoronavirus
Betacoronavirus
Deltacoronavirus
Gammacoronavirus
112
110
104
103
102
95
91
122
123
124
10-fold Cross
Validation
91
104
123
70% =
Training Set
30% =
Testing Set
Model
Generated
111
112
109
Sars-Cov-2 Magtropy values
Each of the 3
sequences
predicted as
Betacoronavirus
Classification
Prediction
13. SaRs-CoV-2 Multi-class Results
● 87.3% mean classification
accuracy
● 2.5% accuracy in Phylum
level with entropy alone,
100% with Magtropy
● Consistently best
performing Classifiers:
Extreme Gradient
Boosting, Decision Tree
100% Pisuviricota
33% Riboviria
67% Duplodnaviria
97% Genus
14. OvR and OvO Results (Genus)
One vs Rest (LGBM)
10-Fold CV
Accuracy
93.52%
10-Fold CV
Standard
Deviation
5.01%
Holdout Accuracy 85.25%
Sars-Cov-2
Sequence
Prediction
Accuracy
100%
One vs One (LGBM)
10-Fold CV
Accuracy
88.46%
10-Fold CV
Standard
Deviation
7.33%
Holdout Accuracy 81.97%
Sars-Cov-2
Sequence
Prediction
Accuracy
100%
15. Conclusions and Future Directions
• Though DFT and Shannon’s Entropy applied as distinct features in the ML model did not
correctly classify Sars-Cov-2, combining them yielded a feature with substantially greater
predictive power for all 3 classification designs.
• Magtropy can be applied to further genomic classification studies.
• One vs Rest performed better than One vs One for the Genus sublevel.
• The methods developed are general enough to be applicable to genomic sequences from
any organism.
16. ● 2020 Research Informatics Interns
○ Anoushka Bhat
○ Esha Ananth
● Center for Informatics
○ Srisairam Achuthan, Ph.D.
○ Samir Courdy
○ Sorena Nadaf
Acknowledgements