SlideShare a Scribd company logo
1 of 1
Download to read offline
Introduction
Traditional relational databases are not
optimized to perform specific, complex
queries that Precision Medicine researchers
are interested in. Structured query language
(SQL) relational databases are not ideal to
handle the volume or diversity of
biomedical data that is available. Resource
description framework (RDF) triples are a
type of graph structure that comprise of
subject and object nodes, and a predicate
edge. While more flexible than relational
databases, RDF triples tend to still be rigid
and lead to database size increases.
The “Property Graph Database” is a flexible
model that can be used to store complex and
variable data from different genomic fields.
Graph databases can store unlimited
relationships (edges) between data points
(nodes), as well as various attributes of
each. Querying the database merely
accesses the list of relationship records,
eliminating the need for a lengthy search
and match computation. Graph databases do
not require a pre-set schema and any
amount of new data can be arbitrarily
incorporated with ease.
To showcase the compatibility of a property
graph database with biomedical data, we
stored and analyzed nearly all the publicly
available from The Cancer Genome Atlas
(TCGA) data.
Results
Ingestion
● Spark provides multithreaded parse and
write functionality
● Ingestion and write to parquet takes 50
minutes using 192 cores
● Total size on disk is 3.4 GB
compressed
● Total size in memory is 171 GB
Methods and Materials
Hardware
Fig 1: Scale Up vs. Scale Out
Acknowledgements
The authors would like to acknowledge the International Cancer
Genome Consortium for providing data found at https://icgc.org/.
The results published here are in whole or part based upon data
generated by The Cancer Genome Atlas managed by the NCI and
NHGRI. Information about TCGA can be found at
http://cancergenome.nih.gov.
References
Grossman, R. L., Heath, A. P., Ferretti, V., Varmus, H. E., Lowy, D. R., Kibbe,
W. A., Staudt, L. M. (2016) Toward a Shared Vision for Cancer Genomic Data.
New England Journal of Medicine 375:12, 1109-1112.
Goldman, M., Craft, B., Swatloski, T., Ellrott, K., Cline, M., Diekhans, M., Ma,
S., Wilks, C., Stuart, J., Haussler, D., Zhu, J. The UCSC Cancer Genomics
Browser: update 2013. Nucleic Acids Research 2012; doi: 10.1093/nar/gks1008.
Liu, Y., Ji, Y., & Qiu, P. (2013). Identification of thresholds for dichotomizing
DNA methylation data. EURASIP Journal on Bioinformatics and Systems
Biology, 2013(1), 8. http://doi.org/10.1186/1687-4153-2013-8
Conclusions
In conclusion, FedCentric was able to
successfully build a property graph
database containing a variety of TCGA
data, while storing the entire dataset in
memory on a scale-up architecture to
reduce latency and computational time.
We were able to run simple and complex
queries to find relationships, extract
information, and cluster data types
quickly.
This graph database gives oncology
researchers the ability to make novel
discoveries and find unique relationships
between different types of data both
quickly and seamlessly. These discoveries
can lead to substantial advancements in
Precision Medicine.
Results
Simple Queries
Examples (Fig. 3)
● All information for one gene
(IDH1)
● Percentage of patients that died due
to GBM
● Most common gain/loss copy
numbers for chromosomes
Performance Speeds (Fig. 3)
● Ability to derive subgraphs in
seconds
Fig 5: Simple Queries
Complex Queries
Examples (Fig. 4)
● Top 5 most commonly mutated
genes & diseases
● All information for one gene for one
patient
● Most commonly overexpressed
genes
Performance Speeds (Fig. 4)
● Ability to scale up to 256 parallel
processes
● Query times in seconds
Fig 6: Complex Queries
StepMiner2D
● Bivariate algorithm determines
thresholds for two data types at once
● Hypergeometric test to find association
in data types
● Ran using methylation and gene
expression data for 200 breast cancer
patients (Fig. 5)
● Took ca. 16 minutes to run
● Very low p-value (high correlation)
Fig 7: StepMiner2D Results
Clustering
● K-means clustering using Euclidean
distance
● Clustered 2,275 patients by disease
using 500 genes (Fig. 6)
● Over 99% accurately clustered
● Took ca. 1 minute to run
High Performance Data Analytics with Heterogeneous Cancer Data in a Scale-Up Property Graph
Terry Antony1
, Sarah Bullard1
, Nicole Sabatelli1
, Shawn Ulmer1
, Doron Levy1
, Tianwen Chu2
, Frank D’Ippolito2
, Michael S. Atkins2
1
University of Maryland, College Park, MD, 2
FedCentric Technologies, LLC., College Park, MD
Background
With advancements in high-throughput
sequencing, genomic data from individuals
around the world is rapidly being uploaded
to a variety of databases, stored on cloud
servers, and kept on localized computers.
Genomic data for millions of individuals
can be petabytes and even exabytes in size.
Data this size is difficult to manage and
genomic researchers need methods to easily
and rapidly store, upload, query, and
analyze it. Discoveries of patterns and
relationships in data is crucial in
understanding genetic diseases, such as
cancer, but it is especially relevant in
precision medicine, which utilizes the
knowledge and insights contained in
biomedical information to customize
individual therapies and treatment options.
Methods and Materials
Software Architecture - Apache Spark
● Big data processing framework
● Driver node manages parallel
operations carried out by executor
nodes
● Implements graphs through its
GraphFrame data structure
Fig 2: Apache Spark
Data
● Over 500GB of The Cancer Genome
Atlas (TCGA) data from 33 Cancer
Project for over 11,000 patients (Fig. 1)
● Clinical Information & VCF files*
from NIH GDC
● Methylation, miRNA expression, &
Copy Number Variation from ICGC
● Gene Expression from UCSC
● Gene names and additional information
from Ensembl
● Probe ID from Illumina
Fig 3: File Information
The Graph
● All data types mapped to gene name
and/or chromosome (Fig. 2)
● Includes over 20,000 genes and all
chromosomes
● Over 57 million nodes and 3.5 billion
edges
Fig 4: The Graph Model
*
NIH Granted Access Data
Future Work
With graph model’s flexibility and
capacity for intricacy, the team will use
graph methodology for fusing different
types of biomedical data. This
methodology can be used to study cancer
diagnosis, treatment, and survivorship
(such as breast cancer, prostate cancer,
and blood cancers).
Many researchers and doctors have a
plethora of data generated daily that has
the capacity to be stored and analyzed
using graph technology. FedCentric can
team with clinicians and experimentalists
to study the applicability of graph
technology in a variety of medical
problems to find scientifically accurate
relationships and connections deemed
most useful in real-time.
Results
Fig 8: Clustering Results
Scale Up
● One large server
● Scales up by adding
memory and CPUs to
the same system
● Blades are linked with
high speed internal bus
Scale Out
● Cluster of small servers
● Scales out by adding more
physical servers
● Servers are linked through
external network
connection

More Related Content

What's hot

is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.Chris Evelo
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...marcosmartinezromero
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...Koray Atalag
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926Laura Clarke
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...Maulik Kamdar
 
Introduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsIntroduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsYan Xu
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
Integration of heterogeneous data
Integration of heterogeneous dataIntegration of heterogeneous data
Integration of heterogeneous dataLars Juhl Jensen
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseNeo4j
 
Rethinking data intensive science using scalable analytics systems
 Rethinking data intensive science using scalable analytics systems Rethinking data intensive science using scalable analytics systems
Rethinking data intensive science using scalable analytics systemsnewmooxx
 

What's hot (20)

Implementation of Semantic Network Dictionary System for Global Observation ...
Implementation of Semantic Network Dictionary System for Global Observation ...Implementation of Semantic Network Dictionary System for Global Observation ...
Implementation of Semantic Network Dictionary System for Global Observation ...
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Big data
Big dataBig data
Big data
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsAn Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
 
V5I3_IJERTV5IS031157
V5I3_IJERTV5IS031157V5I3_IJERTV5IS031157
V5I3_IJERTV5IS031157
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...
 
Highly dimensional data_20160926
Highly dimensional data_20160926Highly dimensional data_20160926
Highly dimensional data_20160926
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
 
BioNLPSADI
BioNLPSADIBioNLPSADI
BioNLPSADI
 
Introduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsIntroduction to data integration in bioinformatics
Introduction to data integration in bioinformatics
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Integration of heterogeneous data
Integration of heterogeneous dataIntegration of heterogeneous data
Integration of heterogeneous data
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
 
Rethinking data intensive science using scalable analytics systems
 Rethinking data intensive science using scalable analytics systems Rethinking data intensive science using scalable analytics systems
Rethinking data intensive science using scalable analytics systems
 

Viewers also liked

Frameworks2 go dancing with gorillas
Frameworks2 go dancing with gorillasFrameworks2 go dancing with gorillas
Frameworks2 go dancing with gorillasframeworks2go.com
 
การใช้เทคโนโลยีสารสนเทศนำเสนองาน
การใช้เทคโนโลยีสารสนเทศนำเสนองานการใช้เทคโนโลยีสารสนเทศนำเสนองาน
การใช้เทคโนโลยีสารสนเทศนำเสนองานtanachot1898
 
Yassir Certificate
Yassir CertificateYassir Certificate
Yassir CertificateYassir Ahmed
 
HQNOP75 - 1601 - Ship Manager
HQNOP75 - 1601 - Ship ManagerHQNOP75 - 1601 - Ship Manager
HQNOP75 - 1601 - Ship ManagerTamer Mohamed
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision MedicineMichael Atkins
 
öT tibeti jógagyakorlat
öT tibeti jógagyakorlatöT tibeti jógagyakorlat
öT tibeti jógagyakorlatlelekportal
 
La perdida de biodiversidad
La perdida de biodiversidadLa perdida de biodiversidad
La perdida de biodiversidadPelodytes
 
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016María Rubio
 

Viewers also liked (13)

Bio-IT Award
Bio-IT AwardBio-IT Award
Bio-IT Award
 
Madarski konnik
Madarski konnikMadarski konnik
Madarski konnik
 
Frameworks2 go dancing with gorillas
Frameworks2 go dancing with gorillasFrameworks2 go dancing with gorillas
Frameworks2 go dancing with gorillas
 
การใช้เทคโนโลยีสารสนเทศนำเสนองาน
การใช้เทคโนโลยีสารสนเทศนำเสนองานการใช้เทคโนโลยีสารสนเทศนำเสนองาน
การใช้เทคโนโลยีสารสนเทศนำเสนองาน
 
Yassir Certificate
Yassir CertificateYassir Certificate
Yassir Certificate
 
GYM A&L
GYM A&L GYM A&L
GYM A&L
 
HQNOP75 - 1601 - Ship Manager
HQNOP75 - 1601 - Ship ManagerHQNOP75 - 1601 - Ship Manager
HQNOP75 - 1601 - Ship Manager
 
Belogradchishki skali
Belogradchishki skaliBelogradchishki skali
Belogradchishki skali
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
 
öT tibeti jógagyakorlat
öT tibeti jógagyakorlatöT tibeti jógagyakorlat
öT tibeti jógagyakorlat
 
La perdida de biodiversidad
La perdida de biodiversidadLa perdida de biodiversidad
La perdida de biodiversidad
 
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016
Informe del Observatorio de la Gestión Empresarial de la Biodiversidad 2016
 
principios de APS
principios de APSprincipios de APS
principios de APS
 

Similar to Cancer Analytics Poster

Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016Warren Kibbe
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...Syed Ahmad Chan Bukhari, PhD
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
Cancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsCancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsWarren Kibbe
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016Anita de Waard
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례mothersafe
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
 
Genomics2 Phenomics Complete
Genomics2 Phenomics CompleteGenomics2 Phenomics Complete
Genomics2 Phenomics CompleteInterpretOmics
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
 
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016Radiomics Data Management, Computation, and Analysis for QIN F2F 2016
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016Ashish Sharma
 
Virus Analytics Poster
Virus Analytics PosterVirus Analytics Poster
Virus Analytics PosterMichael Atkins
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillanceJoel Saltz
 

Similar to Cancer Analytics Poster (20)

Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Cancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsCancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data Commons
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
Genomics2 Phenomics Complete
Genomics2 Phenomics CompleteGenomics2 Phenomics Complete
Genomics2 Phenomics Complete
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016Radiomics Data Management, Computation, and Analysis for QIN F2F 2016
Radiomics Data Management, Computation, and Analysis for QIN F2F 2016
 
Virus Analytics Poster
Virus Analytics PosterVirus Analytics Poster
Virus Analytics Poster
 
Madhavi
MadhaviMadhavi
Madhavi
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer Surveillance
 

More from Michael Atkins

More from Michael Atkins (7)

FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)ASLO 2001 Poster (Black)
ASLO 2001 Poster (Black)
 
Astrobiology Poster
Astrobiology PosterAstrobiology Poster
Astrobiology Poster
 
2000 WHOI Currents
2000 WHOI Currents2000 WHOI Currents
2000 WHOI Currents
 
Final Draft
Final DraftFinal Draft
Final Draft
 
OTEC
OTECOTEC
OTEC
 
2000 JME (51)278-285
2000 JME (51)278-2852000 JME (51)278-285
2000 JME (51)278-285
 

Cancer Analytics Poster

  • 1. Introduction Traditional relational databases are not optimized to perform specific, complex queries that Precision Medicine researchers are interested in. Structured query language (SQL) relational databases are not ideal to handle the volume or diversity of biomedical data that is available. Resource description framework (RDF) triples are a type of graph structure that comprise of subject and object nodes, and a predicate edge. While more flexible than relational databases, RDF triples tend to still be rigid and lead to database size increases. The “Property Graph Database” is a flexible model that can be used to store complex and variable data from different genomic fields. Graph databases can store unlimited relationships (edges) between data points (nodes), as well as various attributes of each. Querying the database merely accesses the list of relationship records, eliminating the need for a lengthy search and match computation. Graph databases do not require a pre-set schema and any amount of new data can be arbitrarily incorporated with ease. To showcase the compatibility of a property graph database with biomedical data, we stored and analyzed nearly all the publicly available from The Cancer Genome Atlas (TCGA) data. Results Ingestion ● Spark provides multithreaded parse and write functionality ● Ingestion and write to parquet takes 50 minutes using 192 cores ● Total size on disk is 3.4 GB compressed ● Total size in memory is 171 GB Methods and Materials Hardware Fig 1: Scale Up vs. Scale Out Acknowledgements The authors would like to acknowledge the International Cancer Genome Consortium for providing data found at https://icgc.org/. The results published here are in whole or part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. Information about TCGA can be found at http://cancergenome.nih.gov. References Grossman, R. L., Heath, A. P., Ferretti, V., Varmus, H. E., Lowy, D. R., Kibbe, W. A., Staudt, L. M. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine 375:12, 1109-1112. Goldman, M., Craft, B., Swatloski, T., Ellrott, K., Cline, M., Diekhans, M., Ma, S., Wilks, C., Stuart, J., Haussler, D., Zhu, J. The UCSC Cancer Genomics Browser: update 2013. Nucleic Acids Research 2012; doi: 10.1093/nar/gks1008. Liu, Y., Ji, Y., & Qiu, P. (2013). Identification of thresholds for dichotomizing DNA methylation data. EURASIP Journal on Bioinformatics and Systems Biology, 2013(1), 8. http://doi.org/10.1186/1687-4153-2013-8 Conclusions In conclusion, FedCentric was able to successfully build a property graph database containing a variety of TCGA data, while storing the entire dataset in memory on a scale-up architecture to reduce latency and computational time. We were able to run simple and complex queries to find relationships, extract information, and cluster data types quickly. This graph database gives oncology researchers the ability to make novel discoveries and find unique relationships between different types of data both quickly and seamlessly. These discoveries can lead to substantial advancements in Precision Medicine. Results Simple Queries Examples (Fig. 3) ● All information for one gene (IDH1) ● Percentage of patients that died due to GBM ● Most common gain/loss copy numbers for chromosomes Performance Speeds (Fig. 3) ● Ability to derive subgraphs in seconds Fig 5: Simple Queries Complex Queries Examples (Fig. 4) ● Top 5 most commonly mutated genes & diseases ● All information for one gene for one patient ● Most commonly overexpressed genes Performance Speeds (Fig. 4) ● Ability to scale up to 256 parallel processes ● Query times in seconds Fig 6: Complex Queries StepMiner2D ● Bivariate algorithm determines thresholds for two data types at once ● Hypergeometric test to find association in data types ● Ran using methylation and gene expression data for 200 breast cancer patients (Fig. 5) ● Took ca. 16 minutes to run ● Very low p-value (high correlation) Fig 7: StepMiner2D Results Clustering ● K-means clustering using Euclidean distance ● Clustered 2,275 patients by disease using 500 genes (Fig. 6) ● Over 99% accurately clustered ● Took ca. 1 minute to run High Performance Data Analytics with Heterogeneous Cancer Data in a Scale-Up Property Graph Terry Antony1 , Sarah Bullard1 , Nicole Sabatelli1 , Shawn Ulmer1 , Doron Levy1 , Tianwen Chu2 , Frank D’Ippolito2 , Michael S. Atkins2 1 University of Maryland, College Park, MD, 2 FedCentric Technologies, LLC., College Park, MD Background With advancements in high-throughput sequencing, genomic data from individuals around the world is rapidly being uploaded to a variety of databases, stored on cloud servers, and kept on localized computers. Genomic data for millions of individuals can be petabytes and even exabytes in size. Data this size is difficult to manage and genomic researchers need methods to easily and rapidly store, upload, query, and analyze it. Discoveries of patterns and relationships in data is crucial in understanding genetic diseases, such as cancer, but it is especially relevant in precision medicine, which utilizes the knowledge and insights contained in biomedical information to customize individual therapies and treatment options. Methods and Materials Software Architecture - Apache Spark ● Big data processing framework ● Driver node manages parallel operations carried out by executor nodes ● Implements graphs through its GraphFrame data structure Fig 2: Apache Spark Data ● Over 500GB of The Cancer Genome Atlas (TCGA) data from 33 Cancer Project for over 11,000 patients (Fig. 1) ● Clinical Information & VCF files* from NIH GDC ● Methylation, miRNA expression, & Copy Number Variation from ICGC ● Gene Expression from UCSC ● Gene names and additional information from Ensembl ● Probe ID from Illumina Fig 3: File Information The Graph ● All data types mapped to gene name and/or chromosome (Fig. 2) ● Includes over 20,000 genes and all chromosomes ● Over 57 million nodes and 3.5 billion edges Fig 4: The Graph Model * NIH Granted Access Data Future Work With graph model’s flexibility and capacity for intricacy, the team will use graph methodology for fusing different types of biomedical data. This methodology can be used to study cancer diagnosis, treatment, and survivorship (such as breast cancer, prostate cancer, and blood cancers). Many researchers and doctors have a plethora of data generated daily that has the capacity to be stored and analyzed using graph technology. FedCentric can team with clinicians and experimentalists to study the applicability of graph technology in a variety of medical problems to find scientifically accurate relationships and connections deemed most useful in real-time. Results Fig 8: Clustering Results Scale Up ● One large server ● Scales up by adding memory and CPUs to the same system ● Blades are linked with high speed internal bus Scale Out ● Cluster of small servers ● Scales out by adding more physical servers ● Servers are linked through external network connection