SlideShare a Scribd company logo
1 of 1
Download to read offline
Introduction
The flavivirus genus consists of 73
mosquito and tick borne viruses
that pose a considerable threat to
public health. A deeper
understanding of the molecular
evolution of flaviviruses is needed
to guide public health decisions and
to prevent flavivirus epidemics,
such as the 2015-2016 Zika virus
epidemic in the Americas.
Comparative genomic analysis can
lead to many discoveries regarding
molecular evolution and
epidemiology, but computational
power remains a limiting factor due
to the size of the data and the
complexity of the algorithms. Here
we present the Infectious Disease
Epidemiology Appliance or IDEA,
a tool developed by FedCentric
Technologies for the rapid analysis
and visualization of over 6,200
genomic and 93,000 proteomic
sequences.
Results
Phylogenetic Trees:
● Trees constructed using the
Maximum Likelihood Algorithm
● Maximum Likelihood uses
statistical techniques to infer a
probability distribution for the
given data
● The likelihood of the tree is
proportional to the probability of
the tree given by the distribution
● Generalised Time Reversible was
used as the substitution model
● The final trees were validated
using Bootstrapping, which is a
resampling method
● Intuitive way to visualize the
evolution of an organism over time
● Two genomic sequences at similar
locations in the tree are expected to
have evolved from the same
common ancestor
Figure 6: Dendrogram of Zika Virus
genomes
Figure 7: Circular tree view
of Zika Virus genomes
Methods
Hardware
● FedCentric Labs’ SGI UV2000
system: x86/Linux OS, scales up
to 4092 cores & 64TB of RAM
● Data in memory, very low
latency, high performance
Figure 1: Scale-Up vs. Scale-Out
Architecture
Software
● Apache Spark, a big data
processing framework
● Driver node manages parallel
operations carried out by
executor nodes
● Implements graphs through its
GraphFrame data structure
● Shiny, an RStudio package used
for web application development
Figure 2: Spark executor framework
Acknowledgements
The results published here are in
whole or part based upon data
generated by the Virus Pathogen
Database and Analysis Resource
maintained by the National
Institute of Allergy and Infectious
Diseases.
References
Daep, C. A., Munoz-Jordan, J. L., & Eugenin, E. A. (2014). Flaviruses an
expanding threat in public health: focus on Dengue, West Nile, and Japanese
encephalitis virus, J. 539-560.
Felsenstein, J. (1985). Confidence Limits on Phylogenies: An Approach Using the
Bootstrap. Evolution. 39: 783.doi:10.2307/2408678. ISSN 0014-3820
Galtier, N. & Guoy, M. (1998). Inferring pattern and process:
maximum-likelihood implementation of a nonhomogeneous model of DNA
sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–79.
Hougland, J. (2015). How-to: Prepare Your Apache Hadoop Cluster for PySpark.
Retrieved from
http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluste
r-for-pyspark-jobs/
Conclusions
IDEA will be an invaluable tool to
virologists, epidemiologists, and
public health officials. The ability
to accurately decipher the origin of
a pathogenic strain will allow for
preventative interventions and
quarantines when they are needed
most. This technology can be
rapidly expanded to other
pathogenic organisms as well, such
as the Ebola virus, HIV, Malaria,
the flu, and even bacterial
infections. An application like this
paired with an assembler and a
sequencer will allow researchers to
quickly identify and track strains
of pathogens. Granting better
intelligence for interventions and
quarantines to stop and contain
outbreaks.
Figure
High Performance Data Analytics with Flavivirus Data in a Scale-Up Appliance
Rui Ponte1
, Lucia Fernandez1
, Kyle Milligan1
, Meena Sengottuvelu1
, Terry Antony1
, Tianwen Chu2
, Michael S. Atkins2
1
University of Maryland, College Park, MD,
2
FedCentric Technologies, LLC. College Park, MD
Results
Sequence Comparison:
● All-to-All comparison of protein and
nucleotide sequences
● Distance value is calculated with edit
distance corrected with Jukes-Cantor
Figure 3: Searchable database allows
easy strain lookup and querying
Figure 4: Researchers can search data with
multiple filters at the same time
Chloropleth Map View
● Displays the density of samples of
each country
● Provides the user with a timeline
slider to view the spread over time
Figure 5: Map visualization tool allows
users to track the spread of a pathogen
Future Work
Fedcentric Technologies will
continue to add to this application
by incorporating the accession
numbers and associated sequence
data into the trees, as well as the
ability to zoom in on certain
portions of the tree and highlight
nodes. We also intend to expand
this work to other organisms and
improve on our map visualization
to allow more detailed geographic
data. Most importantly we intend
to optimize the algorithms for the
maximum likelihood calculations
to run efficiently on SGI scale up
hardware.
Methods and Materials
Data
● Whole genome and protein
sequences from the NIAID Virus
Pathogen Database and Analysis
Resource (ViPR)
Graph Model
● The graph consists of virus sample
nodes linked together with distance
calculation edges
● Each sample has 14 protein
sequences, 1 polyprotein sequence,
and 1 nucleotide sequence
● 6,201 nodes and 307,569,600
edges
User Interface
● Edit distance comparison between
protein and nucleotide sequences
● Phylogenetic tree builder with
maximum likelihood
● Multiple Sequence Alignment
using MAFFT (Multiple
Alignment using Fast Fourier
Transform)
● Choropleth map view

More Related Content

What's hot

The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionBenjamin Good
 
Drug discovery using ai
Drug discovery using aiDrug discovery using ai
Drug discovery using aiSukant Khurana
 
Genomics2 Phenomics Complete
Genomics2 Phenomics CompleteGenomics2 Phenomics Complete
Genomics2 Phenomics CompleteInterpretOmics
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the futurePistoia Alliance
 
Overpromise of AI in Drug Discovery
Overpromise of AI in Drug DiscoveryOverpromise of AI in Drug Discovery
Overpromise of AI in Drug DiscoveryTudor Oprea
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowEagle Genomics
 
2018 NF Conference Cutaneous Neurofibroma
2018 NF Conference Cutaneous Neurofibroma2018 NF Conference Cutaneous Neurofibroma
2018 NF Conference Cutaneous NeurofibromaRobert Allaway
 
De vry university math 221 week 3 quiz
De vry university math 221 week 3 quizDe vry university math 221 week 3 quiz
De vry university math 221 week 3 quizOlivia Fournier
 
160929 teamscope presentation molecule to business
160929 teamscope presentation molecule to business160929 teamscope presentation molecule to business
160929 teamscope presentation molecule to businessSMBBV
 
Chemical Effects in Biological Systems: a toxicology database
Chemical Effects in Biological Systems: a toxicology databaseChemical Effects in Biological Systems: a toxicology database
Chemical Effects in Biological Systems: a toxicology databaseAsif Rashid
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceDale Butler
 
Research Statement Chien-Wei Lin
Research Statement Chien-Wei LinResearch Statement Chien-Wei Lin
Research Statement Chien-Wei LinChien-Wei Lin
 

What's hot (20)

The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
2006NIC-NLPPoster_V2
2006NIC-NLPPoster_V22006NIC-NLPPoster_V2
2006NIC-NLPPoster_V2
 
Drug discovery using ai
Drug discovery using aiDrug discovery using ai
Drug discovery using ai
 
Genomics2 Phenomics Complete
Genomics2 Phenomics CompleteGenomics2 Phenomics Complete
Genomics2 Phenomics Complete
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
 
David Tyrpak CV
David Tyrpak CVDavid Tyrpak CV
David Tyrpak CV
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
An Introduction to Biology with Computers
An Introduction to Biology with ComputersAn Introduction to Biology with Computers
An Introduction to Biology with Computers
 
Overpromise of AI in Drug Discovery
Overpromise of AI in Drug DiscoveryOverpromise of AI in Drug Discovery
Overpromise of AI in Drug Discovery
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
2018 NF Conference Cutaneous Neurofibroma
2018 NF Conference Cutaneous Neurofibroma2018 NF Conference Cutaneous Neurofibroma
2018 NF Conference Cutaneous Neurofibroma
 
De vry university math 221 week 3 quiz
De vry university math 221 week 3 quizDe vry university math 221 week 3 quiz
De vry university math 221 week 3 quiz
 
METAGENOTE_poster2016
METAGENOTE_poster2016METAGENOTE_poster2016
METAGENOTE_poster2016
 
Chapter 31
Chapter 31Chapter 31
Chapter 31
 
160929 teamscope presentation molecule to business
160929 teamscope presentation molecule to business160929 teamscope presentation molecule to business
160929 teamscope presentation molecule to business
 
Pine Biotech
Pine BiotechPine Biotech
Pine Biotech
 
David Tyrpak CV
David Tyrpak CVDavid Tyrpak CV
David Tyrpak CV
 
Chemical Effects in Biological Systems: a toxicology database
Chemical Effects in Biological Systems: a toxicology databaseChemical Effects in Biological Systems: a toxicology database
Chemical Effects in Biological Systems: a toxicology database
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conference
 
Research Statement Chien-Wei Lin
Research Statement Chien-Wei LinResearch Statement Chien-Wei Lin
Research Statement Chien-Wei Lin
 

Viewers also liked (12)

¿Pescado o pescador?
¿Pescado o pescador?¿Pescado o pescador?
¿Pescado o pescador?
 
Ssmac15
Ssmac15Ssmac15
Ssmac15
 
111006 presentación aveq kimika
111006 presentación aveq kimika111006 presentación aveq kimika
111006 presentación aveq kimika
 
第2回 Android勉強会
第2回 Android勉強会第2回 Android勉強会
第2回 Android勉強会
 
Resume
ResumeResume
Resume
 
Comportamiento De Compra De Una Familia
Comportamiento De Compra De Una FamiliaComportamiento De Compra De Una Familia
Comportamiento De Compra De Una Familia
 
Baroumalinobska
BaroumalinobskaBaroumalinobska
Baroumalinobska
 
Speaking lina bermudez
Speaking lina bermudezSpeaking lina bermudez
Speaking lina bermudez
 
canales de distribucion
canales de distribucioncanales de distribucion
canales de distribucion
 
Saowaluk kallayabut
Saowaluk kallayabutSaowaluk kallayabut
Saowaluk kallayabut
 
Detroit party buses
Detroit party busesDetroit party buses
Detroit party buses
 
INDECENTES, SIN MAS
INDECENTES, SIN MASINDECENTES, SIN MAS
INDECENTES, SIN MAS
 

Similar to Virus Analytics Poster

Plant Leaf Diseases Identification in Deep Learning
Plant Leaf Diseases Identification in Deep LearningPlant Leaf Diseases Identification in Deep Learning
Plant Leaf Diseases Identification in Deep LearningCSEIJJournal
 
Predictions And Analytics In Healthcare: Advancements In Machine Learning
Predictions And Analytics In Healthcare: Advancements In Machine LearningPredictions And Analytics In Healthcare: Advancements In Machine Learning
Predictions And Analytics In Healthcare: Advancements In Machine LearningIRJET Journal
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningIRJET Journal
 
Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...eSAT Journals
 
Tomato Disease Fusion and Classification using Deep Learning
Tomato Disease Fusion and Classification using Deep LearningTomato Disease Fusion and Classification using Deep Learning
Tomato Disease Fusion and Classification using Deep LearningIJCI JOURNAL
 
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...IRJET Journal
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERijcsit
 
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONSTHE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONSManishReddy706923
 
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov
 
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISSEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISIRJET Journal
 
Guava fruit disease identification based on improved convolutional neural net...
Guava fruit disease identification based on improved convolutional neural net...Guava fruit disease identification based on improved convolutional neural net...
Guava fruit disease identification based on improved convolutional neural net...IJECEIAES
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingIncedo
 
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGIC
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGICRECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGIC
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGICIAEME Publication
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
Sepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningSepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningIRJET Journal
 
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningEnhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningIJECEIAES
 
Potato leaf disease detection using convolutional neural networks
Potato leaf disease detection using convolutional neural networksPotato leaf disease detection using convolutional neural networks
Potato leaf disease detection using convolutional neural networksIRJET Journal
 
Bukky.pptx
Bukky.pptxBukky.pptx
Bukky.pptxKimah7
 

Similar to Virus Analytics Poster (20)

Plant Leaf Diseases Identification in Deep Learning
Plant Leaf Diseases Identification in Deep LearningPlant Leaf Diseases Identification in Deep Learning
Plant Leaf Diseases Identification in Deep Learning
 
Predictions And Analytics In Healthcare: Advancements In Machine Learning
Predictions And Analytics In Healthcare: Advancements In Machine LearningPredictions And Analytics In Healthcare: Advancements In Machine Learning
Predictions And Analytics In Healthcare: Advancements In Machine Learning
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep Learning
 
Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...Comparative study of decision tree algorithm and naive bayes classifier for s...
Comparative study of decision tree algorithm and naive bayes classifier for s...
 
Tomato Disease Fusion and Classification using Deep Learning
Tomato Disease Fusion and Classification using Deep LearningTomato Disease Fusion and Classification using Deep Learning
Tomato Disease Fusion and Classification using Deep Learning
 
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...
IRJET - Development of a Predictive Fuzzy Logic Model for Monitoring the Risk...
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMERGENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
GENE-GENE INTERACTION ANALYSIS IN ALZHEIMER
 
Madhavi
MadhaviMadhavi
Madhavi
 
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONSTHE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
 
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100John Boikov Personalised Medicine Essay, Mark - 95 out of 100
John Boikov Personalised Medicine Essay, Mark - 95 out of 100
 
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISSEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
 
Guava fruit disease identification based on improved convolutional neural net...
Guava fruit disease identification based on improved convolutional neural net...Guava fruit disease identification based on improved convolutional neural net...
Guava fruit disease identification based on improved convolutional neural net...
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGIC
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGICRECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGIC
RECOMMENDER SYSTEM FOR DETECTION OF DENGUE USING FUZZY LOGIC
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Sepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningSepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine Learning
 
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningEnhancing COVID-19 forecasting through deep learning techniques and fine-tuning
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuning
 
Potato leaf disease detection using convolutional neural networks
Potato leaf disease detection using convolutional neural networksPotato leaf disease detection using convolutional neural networks
Potato leaf disease detection using convolutional neural networks
 
Bukky.pptx
Bukky.pptxBukky.pptx
Bukky.pptx
 

More from Michael Atkins

More from Michael Atkins (10)

2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
 
Bio-IT Award
Bio-IT AwardBio-IT Award
Bio-IT Award
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)
 
Astrobiology Poster
Astrobiology PosterAstrobiology Poster
Astrobiology Poster
 
2000 WHOI Currents
2000 WHOI Currents2000 WHOI Currents
2000 WHOI Currents
 
Final Draft
Final DraftFinal Draft
Final Draft
 
OTEC
OTECOTEC
OTEC
 
2000 JME (51)278-285
2000 JME (51)278-2852000 JME (51)278-285
2000 JME (51)278-285
 

Virus Analytics Poster

  • 1. Introduction The flavivirus genus consists of 73 mosquito and tick borne viruses that pose a considerable threat to public health. A deeper understanding of the molecular evolution of flaviviruses is needed to guide public health decisions and to prevent flavivirus epidemics, such as the 2015-2016 Zika virus epidemic in the Americas. Comparative genomic analysis can lead to many discoveries regarding molecular evolution and epidemiology, but computational power remains a limiting factor due to the size of the data and the complexity of the algorithms. Here we present the Infectious Disease Epidemiology Appliance or IDEA, a tool developed by FedCentric Technologies for the rapid analysis and visualization of over 6,200 genomic and 93,000 proteomic sequences. Results Phylogenetic Trees: ● Trees constructed using the Maximum Likelihood Algorithm ● Maximum Likelihood uses statistical techniques to infer a probability distribution for the given data ● The likelihood of the tree is proportional to the probability of the tree given by the distribution ● Generalised Time Reversible was used as the substitution model ● The final trees were validated using Bootstrapping, which is a resampling method ● Intuitive way to visualize the evolution of an organism over time ● Two genomic sequences at similar locations in the tree are expected to have evolved from the same common ancestor Figure 6: Dendrogram of Zika Virus genomes Figure 7: Circular tree view of Zika Virus genomes Methods Hardware ● FedCentric Labs’ SGI UV2000 system: x86/Linux OS, scales up to 4092 cores & 64TB of RAM ● Data in memory, very low latency, high performance Figure 1: Scale-Up vs. Scale-Out Architecture Software ● Apache Spark, a big data processing framework ● Driver node manages parallel operations carried out by executor nodes ● Implements graphs through its GraphFrame data structure ● Shiny, an RStudio package used for web application development Figure 2: Spark executor framework Acknowledgements The results published here are in whole or part based upon data generated by the Virus Pathogen Database and Analysis Resource maintained by the National Institute of Allergy and Infectious Diseases. References Daep, C. A., Munoz-Jordan, J. L., & Eugenin, E. A. (2014). Flaviruses an expanding threat in public health: focus on Dengue, West Nile, and Japanese encephalitis virus, J. 539-560. Felsenstein, J. (1985). Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution. 39: 783.doi:10.2307/2408678. ISSN 0014-3820 Galtier, N. & Guoy, M. (1998). Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–79. Hougland, J. (2015). How-to: Prepare Your Apache Hadoop Cluster for PySpark. Retrieved from http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluste r-for-pyspark-jobs/ Conclusions IDEA will be an invaluable tool to virologists, epidemiologists, and public health officials. The ability to accurately decipher the origin of a pathogenic strain will allow for preventative interventions and quarantines when they are needed most. This technology can be rapidly expanded to other pathogenic organisms as well, such as the Ebola virus, HIV, Malaria, the flu, and even bacterial infections. An application like this paired with an assembler and a sequencer will allow researchers to quickly identify and track strains of pathogens. Granting better intelligence for interventions and quarantines to stop and contain outbreaks. Figure High Performance Data Analytics with Flavivirus Data in a Scale-Up Appliance Rui Ponte1 , Lucia Fernandez1 , Kyle Milligan1 , Meena Sengottuvelu1 , Terry Antony1 , Tianwen Chu2 , Michael S. Atkins2 1 University of Maryland, College Park, MD, 2 FedCentric Technologies, LLC. College Park, MD Results Sequence Comparison: ● All-to-All comparison of protein and nucleotide sequences ● Distance value is calculated with edit distance corrected with Jukes-Cantor Figure 3: Searchable database allows easy strain lookup and querying Figure 4: Researchers can search data with multiple filters at the same time Chloropleth Map View ● Displays the density of samples of each country ● Provides the user with a timeline slider to view the spread over time Figure 5: Map visualization tool allows users to track the spread of a pathogen Future Work Fedcentric Technologies will continue to add to this application by incorporating the accession numbers and associated sequence data into the trees, as well as the ability to zoom in on certain portions of the tree and highlight nodes. We also intend to expand this work to other organisms and improve on our map visualization to allow more detailed geographic data. Most importantly we intend to optimize the algorithms for the maximum likelihood calculations to run efficiently on SGI scale up hardware. Methods and Materials Data ● Whole genome and protein sequences from the NIAID Virus Pathogen Database and Analysis Resource (ViPR) Graph Model ● The graph consists of virus sample nodes linked together with distance calculation edges ● Each sample has 14 protein sequences, 1 polyprotein sequence, and 1 nucleotide sequence ● 6,201 nodes and 307,569,600 edges User Interface ● Edit distance comparison between protein and nucleotide sequences ● Phylogenetic tree builder with maximum likelihood ● Multiple Sequence Alignment using MAFFT (Multiple Alignment using Fast Fourier Transform) ● Choropleth map view