SlideShare a Scribd company logo
1 of 16
Usual Questions with Unusual Answers:
Application of Multi-class Supervised
Algorithms to Identify COVID-19 Viral Strains
Rishov Chatterjee
Research Informatics Division, Center for Informatics, City of Hope, CA
Outline
● Introduction
● Multi-class Classification
● Classification Task Reduction
● Data Processing / Feature Engineering
● Results
● Conclusions / Future Directions
Introduction
● Taxonomic classification : finding the identity of a
certain virus1
● For unknown, potentially harmful pathogens,
classification can help uncover patterns from closest
known pathogens
● 10 taxonomic levels for viral genome, each has 1 or
more sublevels
Realm
Subrealm
Kingdom
Subkingdom
Phylum
Subphylum
Class
Subclass
Order
Suborder
Family
Subfamily
Genus
Subgenus
Species
Taxa
few
many
Adapted From: https://talk.ictvonline.org/
[1] Richards R, Biological Classification: A Philosophical
Introduction, Cambridge University Press, 2016
Objective: Sars-Cov2 Sequence Classification
Taxonomic Level Sublevels
Realm
Duplodnaviria, Monodnaviria, Riboviria ,
Varidnaviria
Kingdom Orthornavirae , Pararnavirae
Phylum
Duplornaviricota, Kitrinoviricota, Lenarviricota,
Negarnaviricota, Pisuviricota
Class
Duplopiviricetes , Pisoniviricetes ,
Stelpaviricetes
Order Nidovirales , Picornavirales, Sobelivirales
Taxonomic Level Sublevels
Suborder
Arnidovirineae, Cornidovirineae ,
Mesnidovirineae, Monidovirineae,
Nanidovirineae, Ronidovirineae,
Tornidovirineae
Family* Coronaviridae
Subfamily Orthocoronavirinae, Torovirinae, Coronavirinae
Genus
Alphacoronavirus, Betacoronavirus ,
Deltacoronavirus, Gammacoronavirus
Subgenus
Embecovirus, Merbecovirus, Nobecovirus,
Sarbecovirus
● Simplify classification and prevent data leakage by creating a new feature to classify Sars-Cov-2
sequences into a sublevel at each of the 9 out of 10 taxonomic levels2.
[2] https://talk.ictvonline.org/taxonomy/
Multi-class Classification
[3]. Mohamed, Aly (2005). "Survey on multiclass classification methods". Technical Report, Caltech.
Visualization from Jason Brownlee, 4 Types of Classification Tasks in Machine
Learning. https://machinelearningmastery.com/types-of-classification-in-machine-
learning/
● Multiclass classification classifies instances into one of
three or more classes.
● There are fewer multiclass classifiers than binary
classifiers3.
● Multi-class classification is usually more difficult to
optimize than binary classification.
Classification Task Reduction
● Multiclass classification can be reduced to become several binary classifiers.
● Most common reduction strategies that currently exist are One vs All (OvA, OvR) and
One vs One (OvO).
One-vs-One:
Credits: Zhang et al. Science DirectCredits: Jatin Nanda, Georgia Tech
Machine Learning Workflow
Classification
Type:
Optimized ML
Metric: Accuracy
Sequences used
for data
50 genomic sequences
chosen at random
from each sublevel
200 total
Multi-class problem
One vs Rest
200 Sars-Cov-2
Sequences
194
Accuracy =
194
200
= 97.0%
3 3
One vs One
Feature Engineering
Discrete Fourier Transform4 (DFT) Shannon’s Entropy
● Finds the digital frequencies
associated with numbers in a finite
numeric sequence
● Prior study used the average
magnitude of the Discrete Fourier
transform for feature creation.
● Finds the measure of the intrinsic
uncertainty embedded within a
sequence
● Based on the concept that all systems
have a tendency towards disorder
[4] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA, Kari L,
Machine learning using intrinsic genomic signatures for rapid
classification of novel pathogens: COVID-19 case study, PLOS One, 2020
Conversion Rules for Genomic Digitization
Conversion
Rule A T C G
Purine
Pyrimidine (PP)
0 1 1 0
EIIP 0.13 0.14 0.15 0.08
Just A 1 0 0 0
Paired
Numeric
1 1 -1 -1
Real 1.5 -1.5 0.5 -0.5
Integer 1 1 0 2 3
Integer 2 2 1 3 4
Just C 0 0 1 0
Just T 0 1 0 0
Just G 0 0 0 1
Illustration with the chosen
conversion rule5: Purine
Pyrimidine (PP)
CAGGTCAT…. =
10001101….
[5] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA,
Kari L, Machine learning using intrinsic genomic signatures for rapid
classification of novel pathogens: COVID-19 case study, PLOS
One,2020
Step 1
Convert genomic
sequences to
numbers with PP
conversion rule
Step 2
Find frequency
distribution of each
nucleotide using
DFT
Step 5Step 4Step 3
Find absolute
value of DFT
(magnitude), then
find the average
CGATAT
100101
Entropy = 1.33
[3, 0.5, 1.5 ,-1, 1.5, 0.5]
Average
Normalized
Magnitude=
2.15
Feature Engineering
Separately, find
entropy of the
sequence
Divide the average
magnitude by the
entropy to find
“magtropy”
Magtropy = 1.62
(2.15 /1.33)
Sublevel_Seq# Sequence
Betacoronavirus_1 ATCGCGAGA….
Betacoronavirus_2 ATCGGGTCG….
Alphacoronavirus_1 GATGCTGTA…...
Alphacoronavirus_2 GAGTCTCTA…..
Gammacoronavirus_1 AGGCCAAAT…...
Gammacoronavirus_2 AGGTCAAAT…...
Deltacoronavirus_1 CCGGTAATA...
Deltacoronavirus_2 CAGGTAAAC...
Raw vs. Processed Data for Genus level
Sublevel Magtropy Value
Betacoronavirus 111.59
Betacoronavirus 110.72
Alphacoronavirus 103.75
Alphacoronavirus 102.98
Gammacoronavirus 95.88
Gammacoronavirus 90.74
Deltacoronavirus 121.78
Deltacoronavirus 125.87
Machine Learning Workflow
Labels = Sublevels
Features =
Magtropy Values
Alphacoronavirus
Betacoronavirus
Deltacoronavirus
Gammacoronavirus
112
110
104
103
102
95
91
122
123
124
10-fold Cross
Validation
91
104
123
70% =
Training Set
30% =
Testing Set
Model
Generated
111
112
109
Sars-Cov-2 Magtropy values
Each of the 3
sequences
predicted as
Betacoronavirus
Classification
Prediction
SaRs-CoV-2 Multi-class Results
● 87.3% mean classification
accuracy
● 2.5% accuracy in Phylum
level with entropy alone,
100% with Magtropy
● Consistently best
performing Classifiers:
Extreme Gradient
Boosting, Decision Tree
100% Pisuviricota
33% Riboviria
67% Duplodnaviria
97% Genus
OvR and OvO Results (Genus)
One vs Rest (LGBM)
10-Fold CV
Accuracy
93.52%
10-Fold CV
Standard
Deviation
5.01%
Holdout Accuracy 85.25%
Sars-Cov-2
Sequence
Prediction
Accuracy
100%
One vs One (LGBM)
10-Fold CV
Accuracy
88.46%
10-Fold CV
Standard
Deviation
7.33%
Holdout Accuracy 81.97%
Sars-Cov-2
Sequence
Prediction
Accuracy
100%
Conclusions and Future Directions
• Though DFT and Shannon’s Entropy applied as distinct features in the ML model did not
correctly classify Sars-Cov-2, combining them yielded a feature with substantially greater
predictive power for all 3 classification designs.
• Magtropy can be applied to further genomic classification studies.
• One vs Rest performed better than One vs One for the Genus sublevel.
• The methods developed are general enough to be applicable to genomic sequences from
any organism.
● 2020 Research Informatics Interns
○ Anoushka Bhat
○ Esha Ananth
● Center for Informatics
○ Srisairam Achuthan, Ph.D.
○ Samir Courdy
○ Sorena Nadaf
Acknowledgements

More Related Content

What's hot

Bio Scope
Bio ScopeBio Scope
Bio ScopeStartup
 
GIAB GRC Workshop slides
GIAB GRC Workshop slidesGIAB GRC Workshop slides
GIAB GRC Workshop slidesGenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)Genome Reference Consortium
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGenomeInABottle
 
Sept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequinsSept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequinsGenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
GIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyGenomeInABottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GenomeInABottle
 
Total RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarTotal RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarQIAGEN
 

What's hot (20)

Bio Scope
Bio ScopeBio Scope
Bio Scope
 
GIAB GRC Workshop slides
GIAB GRC Workshop slidesGIAB GRC Workshop slides
GIAB GRC Workshop slides
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
Giab ashg 2017
Giab ashg 2017Giab ashg 2017
Giab ashg 2017
 
The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)The Transforming Genetic Medicine Initiative (TGMI)
The Transforming Genetic Medicine Initiative (TGMI)
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Sept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequinsSept2016 plenary mercer_sequins
Sept2016 plenary mercer_sequins
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
2016 ashg giab poster
2016 ashg giab poster2016 ashg giab poster
2016 ashg giab poster
 
GIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seqGIAB Sep2016 Lightning megan cleveland targeted seq
GIAB Sep2016 Lightning megan cleveland targeted seq
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
Total RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarTotal RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development Webinar
 

Similar to Usual Questions with Unusual Answers: Application of Multi-class Supervised Algorithms to Identify COVID-19 Viral Strains

Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyJoaquin Dopazo
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0Computer Science Club
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNAUlises Urzua
 
Presentation
PresentationPresentation
Presentationsidra ali
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Genomika Diagnósticos
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
miScript Single Cell Poster
miScript Single Cell PostermiScript Single Cell Poster
miScript Single Cell PosterQIAGEN
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalJoachim Jacob
 
PICS: Pathway Informed Classification System for cancer analysis using gene e...
PICS: Pathway Informed Classification System for cancer analysis using gene e...PICS: Pathway Informed Classification System for cancer analysis using gene e...
PICS: Pathway Informed Classification System for cancer analysis using gene e...David Craft
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
 

Similar to Usual Questions with Unusual Answers: Application of Multi-class Supervised Algorithms to Identify COVID-19 Viral Strains (20)

Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncology
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
Dna microarray mehran
Dna microarray  mehranDna microarray  mehran
Dna microarray mehran
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNA
 
Presentation
PresentationPresentation
Presentation
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
May workshop
May workshopMay workshop
May workshop
 
miScript Single Cell Poster
miScript Single Cell PostermiScript Single Cell Poster
miScript Single Cell Poster
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
PICS: Pathway Informed Classification System for cancer analysis using gene e...
PICS: Pathway Informed Classification System for cancer analysis using gene e...PICS: Pathway Informed Classification System for cancer analysis using gene e...
PICS: Pathway Informed Classification System for cancer analysis using gene e...
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Usual Questions with Unusual Answers: Application of Multi-class Supervised Algorithms to Identify COVID-19 Viral Strains

  • 1. Usual Questions with Unusual Answers: Application of Multi-class Supervised Algorithms to Identify COVID-19 Viral Strains Rishov Chatterjee Research Informatics Division, Center for Informatics, City of Hope, CA
  • 2. Outline ● Introduction ● Multi-class Classification ● Classification Task Reduction ● Data Processing / Feature Engineering ● Results ● Conclusions / Future Directions
  • 3. Introduction ● Taxonomic classification : finding the identity of a certain virus1 ● For unknown, potentially harmful pathogens, classification can help uncover patterns from closest known pathogens ● 10 taxonomic levels for viral genome, each has 1 or more sublevels Realm Subrealm Kingdom Subkingdom Phylum Subphylum Class Subclass Order Suborder Family Subfamily Genus Subgenus Species Taxa few many Adapted From: https://talk.ictvonline.org/ [1] Richards R, Biological Classification: A Philosophical Introduction, Cambridge University Press, 2016
  • 4. Objective: Sars-Cov2 Sequence Classification Taxonomic Level Sublevels Realm Duplodnaviria, Monodnaviria, Riboviria , Varidnaviria Kingdom Orthornavirae , Pararnavirae Phylum Duplornaviricota, Kitrinoviricota, Lenarviricota, Negarnaviricota, Pisuviricota Class Duplopiviricetes , Pisoniviricetes , Stelpaviricetes Order Nidovirales , Picornavirales, Sobelivirales Taxonomic Level Sublevels Suborder Arnidovirineae, Cornidovirineae , Mesnidovirineae, Monidovirineae, Nanidovirineae, Ronidovirineae, Tornidovirineae Family* Coronaviridae Subfamily Orthocoronavirinae, Torovirinae, Coronavirinae Genus Alphacoronavirus, Betacoronavirus , Deltacoronavirus, Gammacoronavirus Subgenus Embecovirus, Merbecovirus, Nobecovirus, Sarbecovirus ● Simplify classification and prevent data leakage by creating a new feature to classify Sars-Cov-2 sequences into a sublevel at each of the 9 out of 10 taxonomic levels2. [2] https://talk.ictvonline.org/taxonomy/
  • 5. Multi-class Classification [3]. Mohamed, Aly (2005). "Survey on multiclass classification methods". Technical Report, Caltech. Visualization from Jason Brownlee, 4 Types of Classification Tasks in Machine Learning. https://machinelearningmastery.com/types-of-classification-in-machine- learning/ ● Multiclass classification classifies instances into one of three or more classes. ● There are fewer multiclass classifiers than binary classifiers3. ● Multi-class classification is usually more difficult to optimize than binary classification.
  • 6. Classification Task Reduction ● Multiclass classification can be reduced to become several binary classifiers. ● Most common reduction strategies that currently exist are One vs All (OvA, OvR) and One vs One (OvO). One-vs-One: Credits: Zhang et al. Science DirectCredits: Jatin Nanda, Georgia Tech
  • 7. Machine Learning Workflow Classification Type: Optimized ML Metric: Accuracy Sequences used for data 50 genomic sequences chosen at random from each sublevel 200 total Multi-class problem One vs Rest 200 Sars-Cov-2 Sequences 194 Accuracy = 194 200 = 97.0% 3 3 One vs One
  • 8. Feature Engineering Discrete Fourier Transform4 (DFT) Shannon’s Entropy ● Finds the digital frequencies associated with numbers in a finite numeric sequence ● Prior study used the average magnitude of the Discrete Fourier transform for feature creation. ● Finds the measure of the intrinsic uncertainty embedded within a sequence ● Based on the concept that all systems have a tendency towards disorder [4] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA, Kari L, Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study, PLOS One, 2020
  • 9. Conversion Rules for Genomic Digitization Conversion Rule A T C G Purine Pyrimidine (PP) 0 1 1 0 EIIP 0.13 0.14 0.15 0.08 Just A 1 0 0 0 Paired Numeric 1 1 -1 -1 Real 1.5 -1.5 0.5 -0.5 Integer 1 1 0 2 3 Integer 2 2 1 3 4 Just C 0 0 1 0 Just T 0 1 0 0 Just G 0 0 0 1 Illustration with the chosen conversion rule5: Purine Pyrimidine (PP) CAGGTCAT…. = 10001101…. [5] Randhawa G, Soltysiak M, Roz HE, de Souza CPE, Hill KA, Kari L, Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study, PLOS One,2020
  • 10. Step 1 Convert genomic sequences to numbers with PP conversion rule Step 2 Find frequency distribution of each nucleotide using DFT Step 5Step 4Step 3 Find absolute value of DFT (magnitude), then find the average CGATAT 100101 Entropy = 1.33 [3, 0.5, 1.5 ,-1, 1.5, 0.5] Average Normalized Magnitude= 2.15 Feature Engineering Separately, find entropy of the sequence Divide the average magnitude by the entropy to find “magtropy” Magtropy = 1.62 (2.15 /1.33)
  • 11. Sublevel_Seq# Sequence Betacoronavirus_1 ATCGCGAGA…. Betacoronavirus_2 ATCGGGTCG…. Alphacoronavirus_1 GATGCTGTA…... Alphacoronavirus_2 GAGTCTCTA….. Gammacoronavirus_1 AGGCCAAAT…... Gammacoronavirus_2 AGGTCAAAT…... Deltacoronavirus_1 CCGGTAATA... Deltacoronavirus_2 CAGGTAAAC... Raw vs. Processed Data for Genus level Sublevel Magtropy Value Betacoronavirus 111.59 Betacoronavirus 110.72 Alphacoronavirus 103.75 Alphacoronavirus 102.98 Gammacoronavirus 95.88 Gammacoronavirus 90.74 Deltacoronavirus 121.78 Deltacoronavirus 125.87
  • 12. Machine Learning Workflow Labels = Sublevels Features = Magtropy Values Alphacoronavirus Betacoronavirus Deltacoronavirus Gammacoronavirus 112 110 104 103 102 95 91 122 123 124 10-fold Cross Validation 91 104 123 70% = Training Set 30% = Testing Set Model Generated 111 112 109 Sars-Cov-2 Magtropy values Each of the 3 sequences predicted as Betacoronavirus Classification Prediction
  • 13. SaRs-CoV-2 Multi-class Results ● 87.3% mean classification accuracy ● 2.5% accuracy in Phylum level with entropy alone, 100% with Magtropy ● Consistently best performing Classifiers: Extreme Gradient Boosting, Decision Tree 100% Pisuviricota 33% Riboviria 67% Duplodnaviria 97% Genus
  • 14. OvR and OvO Results (Genus) One vs Rest (LGBM) 10-Fold CV Accuracy 93.52% 10-Fold CV Standard Deviation 5.01% Holdout Accuracy 85.25% Sars-Cov-2 Sequence Prediction Accuracy 100% One vs One (LGBM) 10-Fold CV Accuracy 88.46% 10-Fold CV Standard Deviation 7.33% Holdout Accuracy 81.97% Sars-Cov-2 Sequence Prediction Accuracy 100%
  • 15. Conclusions and Future Directions • Though DFT and Shannon’s Entropy applied as distinct features in the ML model did not correctly classify Sars-Cov-2, combining them yielded a feature with substantially greater predictive power for all 3 classification designs. • Magtropy can be applied to further genomic classification studies. • One vs Rest performed better than One vs One for the Genus sublevel. • The methods developed are general enough to be applicable to genomic sequences from any organism.
  • 16. ● 2020 Research Informatics Interns ○ Anoushka Bhat ○ Esha Ananth ● Center for Informatics ○ Srisairam Achuthan, Ph.D. ○ Samir Courdy ○ Sorena Nadaf Acknowledgements