1
Data Café — A Platform For Creating
Biomedical Data Lakes
Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2
1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
2 Department of Biomedical Informatics, Emory University, Atlanta, USA
www.sharmalab.info
2
Data Landscape
for Precision Medicine
DATA
CHARACTERISTICS
• Large number of small datasets
• Structured…Semi-structured
…Unstructured…Ill formed
• Noisy and Fuzzy/Uncertain
• Spatial, Temporal relationships
DATA MANAGEMENT
• Variety in storage and messaging
protocols
• No shared interface
3
Illustrative Use Case
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more.
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more
PACS + EMR + AIM + RT + Molecular
4
Motivation
• Most current solutions require a DBA to initiate the migration of data into
a Data Warehousing environment
• to query and explore all the data at once.
• Costly to set up such warehouses.
• Unified warehouse with access to query and explore the data.
• Limitations
• Scalability and extensibility to incorporate new data sources
• A priori knowledge of the data models of the different data sources.
BIOMEDICAL DATA LAKES
• Cohort Discovery and Creation — Assembled per-study
• Heterogeneous data collected in a loosely structured fashion.
• Agile and easy to create.
• Integrate with data exploration/visualization via REST APIs.
• Problem or hypothesis specific virtual data set.
• Powered by Drill + HDFS, Data Sources via APIs.
6
Data Café
• An agile approach to creating and extending the concept of a star
schema
• to model a problem/hypothesis specific dataset.
• by leveraging Apache Drill to easily query the data.
• Tackles the limitations in the existing approaches.
• Provides researchers the ability to add new data models and sources.
7
Core Concepts
Step 1. Given a set of data sources,
create a graphical representation of
the join attributes.
This graph represents how data is
connected across the various data
sources
8
Core Concepts
Step 2. Run a set of parallel queries on
the data sources that include the
attributes that are present in the
query graph.
In the top figure, our query is of type:
{id1: A1 > x and B2 == y}
We run similar queries across C, D and
E and retrieve the set of relevant id’s
(join attributes).
9
Core Concepts
Step 3. Compute intersection across
the various id’s (join attributes). The
data of interest can now be obtained
using the id’s in this intersection.
A subsequent query will allow us to
stream, in parallel, data from
individual sources, given the relevant
ids (join attributes)
10
Data Café Architecture
11
Apache Drill
• Variety – Query a range of non-relational data sources.
• Flexibility.
• Agility – Faster Insights.
• Scalability.
12
Evaluation Environment
• Data Café was deployed along with the data sources and Drill in Amazon
EC2.
• MongoDB instantiated in EC2 instances.
• Hive on Amazon EMR (Elastic MapReduce).
• EMR HDFS was configured with 3 nodes.
• Various datasets for evaluation
• Two synthetic datasets.
• Clinical Data from the TCGA BRCA collection
13
Results
• Quick creation of data lakes
• without prior knowledge of the data schema.
• Very fast execution of large queries
• with Apache Drill.
• Data Café can be an efficient platform for exploring an integrated data
source.
• Integrated data source construction process may be time consuming.
• Less critical path.
• Done less frequently than the data queries from HDFS/Hive using Drill.
14
Conclusion
• A novel platform for integrating multiple data sources.
• Without a priori knowledge of the data models of the sources that are being
integrated.
• Indices to do the actual integration
• Enables parallelizing the push of the actual data into HDFS.
• Apache Drill as a fast query execution engine that supports SQL.
• Currently ingesting data from TCGA.
15
Current State and Future Plans
• Ongoing efforts to evaluate the platform with diverse and heterogeneous data
sources.
• Expanding to a larger multi-node distributed cluster.
• Integration with DataScope.
• Multiple data stores and larger data sets.
• Integration with imaging clients such as caMicroscope, as well as archives such
as The Cancer Imaging Archive (TCIA).
Acknowledgements
Google Summer of Code 2015
NCIP/Leidos 14X138, caMicroscope
— A Digital Pathology Integrative
Query System; Ashish Sharma PI
Emory/WUSTL/Stony Brook
NCI U01 [1U01CA187013-01],
Resources for development and
validation of Radiomic Analyses &
Adaptive Therapy, Fred Prior, Ashish
Sharma (UAMS, Emory)
The results published here are in part
based upon data generated by the
TCGA Research Network:
http://cancergenome.nih.gov/
For more information
including recent updates
please visit:
www.sharmalab.info
ashish.sharma@emory.edu

Data Café — A Platform For Creating Biomedical Data Lakes

  • 1.
    1 Data Café —A Platform For Creating Biomedical Data Lakes Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2 1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Department of Biomedical Informatics, Emory University, Atlanta, USA www.sharmalab.info
  • 2.
    2 Data Landscape for PrecisionMedicine DATA CHARACTERISTICS • Large number of small datasets • Structured…Semi-structured …Unstructured…Ill formed • Noisy and Fuzzy/Uncertain • Spatial, Temporal relationships DATA MANAGEMENT • Variety in storage and messaging protocols • No shared interface
  • 3.
    3 Illustrative Use Case Executea Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more. Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more PACS + EMR + AIM + RT + Molecular
  • 4.
    4 Motivation • Most currentsolutions require a DBA to initiate the migration of data into a Data Warehousing environment • to query and explore all the data at once. • Costly to set up such warehouses. • Unified warehouse with access to query and explore the data. • Limitations • Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.
  • 5.
    BIOMEDICAL DATA LAKES •Cohort Discovery and Creation — Assembled per-study • Heterogeneous data collected in a loosely structured fashion. • Agile and easy to create. • Integrate with data exploration/visualization via REST APIs. • Problem or hypothesis specific virtual data set. • Powered by Drill + HDFS, Data Sources via APIs.
  • 6.
    6 Data Café • Anagile approach to creating and extending the concept of a star schema • to model a problem/hypothesis specific dataset. • by leveraging Apache Drill to easily query the data. • Tackles the limitations in the existing approaches. • Provides researchers the ability to add new data models and sources.
  • 7.
    7 Core Concepts Step 1.Given a set of data sources, create a graphical representation of the join attributes. This graph represents how data is connected across the various data sources
  • 8.
    8 Core Concepts Step 2.Run a set of parallel queries on the data sources that include the attributes that are present in the query graph. In the top figure, our query is of type: {id1: A1 > x and B2 == y} We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).
  • 9.
    9 Core Concepts Step 3.Compute intersection across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection. A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)
  • 10.
  • 11.
    11 Apache Drill • Variety– Query a range of non-relational data sources. • Flexibility. • Agility – Faster Insights. • Scalability.
  • 12.
    12 Evaluation Environment • DataCafé was deployed along with the data sources and Drill in Amazon EC2. • MongoDB instantiated in EC2 instances. • Hive on Amazon EMR (Elastic MapReduce). • EMR HDFS was configured with 3 nodes. • Various datasets for evaluation • Two synthetic datasets. • Clinical Data from the TCGA BRCA collection
  • 13.
    13 Results • Quick creationof data lakes • without prior knowledge of the data schema. • Very fast execution of large queries • with Apache Drill. • Data Café can be an efficient platform for exploring an integrated data source. • Integrated data source construction process may be time consuming. • Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.
  • 14.
    14 Conclusion • A novelplatform for integrating multiple data sources. • Without a priori knowledge of the data models of the sources that are being integrated. • Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS. • Apache Drill as a fast query execution engine that supports SQL. • Currently ingesting data from TCGA.
  • 15.
    15 Current State andFuture Plans • Ongoing efforts to evaluate the platform with diverse and heterogeneous data sources. • Expanding to a larger multi-node distributed cluster. • Integration with DataScope. • Multiple data stores and larger data sets. • Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).
  • 16.
    Acknowledgements Google Summer ofCode 2015 NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony Brook NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory) The results published here are in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/
  • 17.
    For more information includingrecent updates please visit: www.sharmalab.info ashish.sharma@emory.edu