Data Café — A Platform For Creating Biomedical Data Lakes


Published on

A podium abstract presented at AMIA 2016 Joint Summits on Translational Science. This discusses Data Café — A Platform For Creating Biomedical Data Lakes.

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Café — A Platform For Creating Biomedical Data Lakes

  1. 1. 1 Data Café — A Platform For Creating Biomedical Data Lakes Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2 1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Department of Biomedical Informatics, Emory University, Atlanta, USA
  2. 2. 2 Data Landscape for Precision Medicine DATA CHARACTERISTICS • Large number of small datasets • Structured…Semi-structured …Unstructured…Ill formed • Noisy and Fuzzy/Uncertain • Spatial, Temporal relationships DATA MANAGEMENT • Variety in storage and messaging protocols • No shared interface
  3. 3. 3 Illustrative Use Case Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more. Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more PACS + EMR + AIM + RT + Molecular
  4. 4. 4 Motivation • Most current solutions require a DBA to initiate the migration of data into a Data Warehousing environment • to query and explore all the data at once. • Costly to set up such warehouses. • Unified warehouse with access to query and explore the data. • Limitations • Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.
  5. 5. BIOMEDICAL DATA LAKES • Cohort Discovery and Creation — Assembled per-study • Heterogeneous data collected in a loosely structured fashion. • Agile and easy to create. • Integrate with data exploration/visualization via REST APIs. • Problem or hypothesis specific virtual data set. • Powered by Drill + HDFS, Data Sources via APIs.
  6. 6. 6 Data Café • An agile approach to creating and extending the concept of a star schema • to model a problem/hypothesis specific dataset. • by leveraging Apache Drill to easily query the data. • Tackles the limitations in the existing approaches. • Provides researchers the ability to add new data models and sources.
  7. 7. 7 Core Concepts Step 1. Given a set of data sources, create a graphical representation of the join attributes. This graph represents how data is connected across the various data sources
  8. 8. 8 Core Concepts Step 2. Run a set of parallel queries on the data sources that include the attributes that are present in the query graph. In the top figure, our query is of type: {id1: A1 > x and B2 == y} We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).
  9. 9. 9 Core Concepts Step 3. Compute intersection across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection. A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)
  10. 10. 10 Data Café Architecture
  11. 11. 11 Apache Drill • Variety – Query a range of non-relational data sources. • Flexibility. • Agility – Faster Insights. • Scalability.
  12. 12. 12 Evaluation Environment • Data Café was deployed along with the data sources and Drill in Amazon EC2. • MongoDB instantiated in EC2 instances. • Hive on Amazon EMR (Elastic MapReduce). • EMR HDFS was configured with 3 nodes. • Various datasets for evaluation • Two synthetic datasets. • Clinical Data from the TCGA BRCA collection
  13. 13. 13 Results • Quick creation of data lakes • without prior knowledge of the data schema. • Very fast execution of large queries • with Apache Drill. • Data Café can be an efficient platform for exploring an integrated data source. • Integrated data source construction process may be time consuming. • Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.
  14. 14. 14 Conclusion • A novel platform for integrating multiple data sources. • Without a priori knowledge of the data models of the sources that are being integrated. • Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS. • Apache Drill as a fast query execution engine that supports SQL. • Currently ingesting data from TCGA.
  15. 15. 15 Current State and Future Plans • Ongoing efforts to evaluate the platform with diverse and heterogeneous data sources. • Expanding to a larger multi-node distributed cluster. • Integration with DataScope. • Multiple data stores and larger data sets. • Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).
  16. 16. Acknowledgements Google Summer of Code 2015 NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony Brook NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory) The results published here are in part based upon data generated by the TCGA Research Network:
  17. 17. For more information including recent updates please visit: