SlideShare a Scribd company logo
1
Data Café — A Platform For Creating
Biomedical Data Lakes
Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2
1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
2 Department of Biomedical Informatics, Emory University, Atlanta, USA
www.sharmalab.info
2
Data Landscape
for Precision Medicine
DATA
CHARACTERISTICS
• Large number of small datasets
• Structured…Semi-structured
…Unstructured…Ill formed
• Noisy and Fuzzy/Uncertain
• Spatial, Temporal relationships
DATA MANAGEMENT
• Variety in storage and messaging
protocols
• No shared interface
3
Illustrative Use Case
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more.
Execute a Radiogenomics workflow on the diffusion images of GBM
patients who received a TMZ + experimental regimen with an overall
survival of 18months or more
PACS + EMR + AIM + RT + Molecular
4
Motivation
• Most current solutions require a DBA to initiate the migration of data into
a Data Warehousing environment
• to query and explore all the data at once.
• Costly to set up such warehouses.
• Unified warehouse with access to query and explore the data.
• Limitations
• Scalability and extensibility to incorporate new data sources
• A priori knowledge of the data models of the different data sources.
BIOMEDICAL DATA LAKES
• Cohort Discovery and Creation — Assembled per-study
• Heterogeneous data collected in a loosely structured fashion.
• Agile and easy to create.
• Integrate with data exploration/visualization via REST APIs.
• Problem or hypothesis specific virtual data set.
• Powered by Drill + HDFS, Data Sources via APIs.
6
Data Café
• An agile approach to creating and extending the concept of a star
schema
• to model a problem/hypothesis specific dataset.
• by leveraging Apache Drill to easily query the data.
• Tackles the limitations in the existing approaches.
• Provides researchers the ability to add new data models and sources.
7
Core Concepts
Step 1. Given a set of data sources,
create a graphical representation of
the join attributes.
This graph represents how data is
connected across the various data
sources
8
Core Concepts
Step 2. Run a set of parallel queries on
the data sources that include the
attributes that are present in the
query graph.
In the top figure, our query is of type:
{id1: A1 > x and B2 == y}
We run similar queries across C, D and
E and retrieve the set of relevant id’s
(join attributes).
9
Core Concepts
Step 3. Compute intersection across
the various id’s (join attributes). The
data of interest can now be obtained
using the id’s in this intersection.
A subsequent query will allow us to
stream, in parallel, data from
individual sources, given the relevant
ids (join attributes)
10
Data Café Architecture
11
Apache Drill
• Variety – Query a range of non-relational data sources.
• Flexibility.
• Agility – Faster Insights.
• Scalability.
12
Evaluation Environment
• Data Café was deployed along with the data sources and Drill in Amazon
EC2.
• MongoDB instantiated in EC2 instances.
• Hive on Amazon EMR (Elastic MapReduce).
• EMR HDFS was configured with 3 nodes.
• Various datasets for evaluation
• Two synthetic datasets.
• Clinical Data from the TCGA BRCA collection
13
Results
• Quick creation of data lakes
• without prior knowledge of the data schema.
• Very fast execution of large queries
• with Apache Drill.
• Data Café can be an efficient platform for exploring an integrated data
source.
• Integrated data source construction process may be time consuming.
• Less critical path.
• Done less frequently than the data queries from HDFS/Hive using Drill.
14
Conclusion
• A novel platform for integrating multiple data sources.
• Without a priori knowledge of the data models of the sources that are being
integrated.
• Indices to do the actual integration
• Enables parallelizing the push of the actual data into HDFS.
• Apache Drill as a fast query execution engine that supports SQL.
• Currently ingesting data from TCGA.
15
Current State and Future Plans
• Ongoing efforts to evaluate the platform with diverse and heterogeneous data
sources.
• Expanding to a larger multi-node distributed cluster.
• Integration with DataScope.
• Multiple data stores and larger data sets.
• Integration with imaging clients such as caMicroscope, as well as archives such
as The Cancer Imaging Archive (TCIA).
Acknowledgements
Google Summer of Code 2015
NCIP/Leidos 14X138, caMicroscope
— A Digital Pathology Integrative
Query System; Ashish Sharma PI
Emory/WUSTL/Stony Brook
NCI U01 [1U01CA187013-01],
Resources for development and
validation of Radiomic Analyses &
Adaptive Therapy, Fred Prior, Ashish
Sharma (UAMS, Emory)
The results published here are in part
based upon data generated by the
TCGA Research Network:
http://cancergenome.nih.gov/
For more information
including recent updates
please visit:
www.sharmalab.info
ashish.sharma@emory.edu

More Related Content

What's hot

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Data cloud lab version v.001.2020
Data cloud lab version v.001.2020Data cloud lab version v.001.2020
Data cloud lab version v.001.2020
mdcdwh
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
Matteo Manca
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
OpenAIRE
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data Repositories
Environmental Data Initiative
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly works
OpenAIRE
 
New PID developments
New PID developmentsNew PID developments
New PID developments
OpenAIRE
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
tusharjadhav2611
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution
Statice
 
Data mining
Data miningData mining
Data mining
Ritesh Tiwari
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining
E2MATRIX
 
CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.
Vishwas Sankhe
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
Daniela Tomova
 
Lambda Architecture The Hive
Lambda Architecture The HiveLambda Architecture The Hive
Lambda Architecture The HiveAltan Khendup
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
DataminingTools Inc
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got Semantics
Ontotext
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
Mahmoud Alfarra
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
Ramakant Soni
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE Indonesia
Crossref
 

What's hot (20)

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
 
Data mining
Data miningData mining
Data mining
 
Data cloud lab version v.001.2020
Data cloud lab version v.001.2020Data cloud lab version v.001.2020
Data cloud lab version v.001.2020
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data Repositories
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly worksRole of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly works
 
New PID developments
New PID developmentsNew PID developments
New PID developments
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution9 facts about statice's data anonymization solution
9 facts about statice's data anonymization solution
 
Data mining
Data miningData mining
Data mining
 
data warehousing and data mining
data warehousing and data mining data warehousing and data mining
data warehousing and data mining
 
CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.CRM - Data Collection, Storage and Acces.
CRM - Data Collection, Storage and Acces.
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 
Lambda Architecture The Hive
Lambda Architecture The HiveLambda Architecture The Hive
Lambda Architecture The Hive
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got Semantics
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE Indonesia
 

Viewers also liked

EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
Vivien Bonazzi
 
Entrance test for teacher 2013...
Entrance test for teacher 2013...Entrance test for teacher 2013...
Entrance test for teacher 2013...Ashish Sharma
 
From protein interaction networks to human phenotypes
From protein  interaction networks to human phenotypesFrom protein  interaction networks to human phenotypes
From protein interaction networks to human phenotypes
biocs
 
Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...
Neil Saunders
 
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and RecommendationsLeveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Nitish Aggarwal
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Amrapali Zaveri, PhD
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
Bahareh Heravi
 
Linked data in the digital humanities skills workshop for realising the oppo...
Linked data in the digital humanities  skills workshop for realising the oppo...Linked data in the digital humanities  skills workshop for realising the oppo...
Linked data in the digital humanities skills workshop for realising the oppo...
jodischneider
 
Beyond Journalism Chicago
Beyond Journalism ChicagoBeyond Journalism Chicago
Beyond Journalism Chicago
Mark Deuze
 
Harrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social mediaHarrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social media
dri_ireland
 
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction NetworksSpecificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
pedrobeltrao
 
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Ronak Shah
 
Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...
jodischneider
 
Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...
Lars Juhl Jensen
 
PhD viva - 11th November 2015
PhD viva - 11th November 2015PhD viva - 11th November 2015
PhD viva - 11th November 2015
Kevin Keraudren
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
Aidan Hogan
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
Pradeeban Kathiravelu, Ph.D.
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
Sean Davis
 
Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane
 
Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013
Scribe Software Corp.
 

Viewers also liked (20)

EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Entrance test for teacher 2013...
Entrance test for teacher 2013...Entrance test for teacher 2013...
Entrance test for teacher 2013...
 
From protein interaction networks to human phenotypes
From protein  interaction networks to human phenotypesFrom protein  interaction networks to human phenotypes
From protein interaction networks to human phenotypes
 
Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...
 
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and RecommendationsLeveraging Wikipedia-based Features for Entity Relatedness and Recommendations
Leveraging Wikipedia-based Features for Entity Relatedness and Recommendations
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
Linked data in the digital humanities skills workshop for realising the oppo...
Linked data in the digital humanities  skills workshop for realising the oppo...Linked data in the digital humanities  skills workshop for realising the oppo...
Linked data in the digital humanities skills workshop for realising the oppo...
 
Beyond Journalism Chicago
Beyond Journalism ChicagoBeyond Journalism Chicago
Beyond Journalism Chicago
 
Harrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social mediaHarrower Heravi RDA P4 Social media
Harrower Heravi RDA P4 Social media
 
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction NetworksSpecificity and Evolvability in Eukaryotic Protein Interaction Networks
Specificity and Evolvability in Eukaryotic Protein Interaction Networks
 
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...
 
Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...Identifying, annotating, and filtering arguments and opinions on the social w...
Identifying, annotating, and filtering arguments and opinions on the social w...
 
Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...Combining sequence motifs and protein interactions to unravel complex phospho...
Combining sequence motifs and protein interactions to unravel complex phospho...
 
PhD viva - 11th November 2015
PhD viva - 11th November 2015PhD viva - 11th November 2015
PhD viva - 11th November 2015
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation Sabrina Kirrane INSIGHT Viva Presentation
Sabrina Kirrane INSIGHT Viva Presentation
 
Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013Industry Report: The State of Customer Data Integration in 2013
Industry Report: The State of Customer Data Integration in 2013
 

Similar to Data Café — A Platform For Creating Biomedical Data Lakes

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
Geoffrey Fox
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
Merce Crosas
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College London
Torsten Reimer
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
NamrataBhatt8
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
AKSHAY BHAGAT
 
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare InnovationUsing The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Dan Wellisch
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
Globus
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 
Big Data
Big Data Big Data
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
Geoffrey Fox
 

Similar to Data Café — A Platform For Creating Biomedical Data Lakes (20)

Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College London
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Using The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare InnovationUsing The Hadoop Ecosystem to Drive Healthcare Innovation
Using The Hadoop Ecosystem to Drive Healthcare Innovation
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Big Data
Big Data Big Data
Big Data
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 

More from Pradeeban Kathiravelu, Ph.D.

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
Pradeeban Kathiravelu, Ph.D.
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Pradeeban Kathiravelu, Ph.D.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
Pradeeban Kathiravelu, Ph.D.
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
Pradeeban Kathiravelu, Ph.D.
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
Pradeeban Kathiravelu, Ph.D.
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Pradeeban Kathiravelu, Ph.D.
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
Pradeeban Kathiravelu, Ph.D.
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
Pradeeban Kathiravelu, Ph.D.
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Pradeeban Kathiravelu, Ph.D.
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
Pradeeban Kathiravelu, Ph.D.
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Pradeeban Kathiravelu, Ph.D.
 
Componentizing Big Services in the Internet
Componentizing Big Services in the InternetComponentizing Big Services in the Internet
Componentizing Big Services in the Internet
Pradeeban Kathiravelu, Ph.D.
 

More from Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Componentizing Big Services in the Internet
Componentizing Big Services in the InternetComponentizing Big Services in the Internet
Componentizing Big Services in the Internet
 

Recently uploaded

the IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meetingthe IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meeting
ssuser787e5c1
 
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
ILC- UK
 
Introduction to Forensic Pathology course
Introduction to Forensic Pathology courseIntroduction to Forensic Pathology course
Introduction to Forensic Pathology course
fprxsqvnz5
 
HEAT WAVE presented by priya bhojwani..pptx
HEAT WAVE presented by priya bhojwani..pptxHEAT WAVE presented by priya bhojwani..pptx
HEAT WAVE presented by priya bhojwani..pptx
priyabhojwani1200
 
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
preciousstephanie75
 
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptxR3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cell
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
Naeemshahzad51
 
Navigating Healthcare with Telemedicine
Navigating Healthcare with  TelemedicineNavigating Healthcare with  Telemedicine
Navigating Healthcare with Telemedicine
Iris Thiele Isip-Tan
 
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
o6ov5dqmf
 
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
Dr. David Greene Arizona
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
Sachin Sharma
 
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdfCHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
Sachin Sharma
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
TheDocs
 
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptxBOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
AnushriSrivastav
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
Ameena Kadar
 
Neuro Saphirex Cranial Brochure
Neuro Saphirex Cranial BrochureNeuro Saphirex Cranial Brochure
Neuro Saphirex Cranial Brochure
RXOOM Healthcare Pvt. Ltd. ​
 
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICEJaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
ranishasharma67
 
Essential Metrics for Palliative Care Management
Essential Metrics for Palliative Care ManagementEssential Metrics for Palliative Care Management
Essential Metrics for Palliative Care Management
Care Coordinations
 
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
Kumar Satyam
 
Telehealth Psychology Building Trust with Clients.pptx
Telehealth Psychology Building Trust with Clients.pptxTelehealth Psychology Building Trust with Clients.pptx
Telehealth Psychology Building Trust with Clients.pptx
The Harvest Clinic
 

Recently uploaded (20)

the IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meetingthe IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meeting
 
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
Global launch of the Healthy Ageing and Prevention Index 2nd wave – alongside...
 
Introduction to Forensic Pathology course
Introduction to Forensic Pathology courseIntroduction to Forensic Pathology course
Introduction to Forensic Pathology course
 
HEAT WAVE presented by priya bhojwani..pptx
HEAT WAVE presented by priya bhojwani..pptxHEAT WAVE presented by priya bhojwani..pptx
HEAT WAVE presented by priya bhojwani..pptx
 
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
Surgery-Mini-OSCE-All-Past-Years-Questions-Modified.
 
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptxR3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
R3 Stem Cells and Kidney Repair A New Horizon in Nephrology.pptx
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
 
Navigating Healthcare with Telemedicine
Navigating Healthcare with  TelemedicineNavigating Healthcare with  Telemedicine
Navigating Healthcare with Telemedicine
 
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
一比一原版纽约大学毕业证(NYU毕业证)成绩单留信认证
 
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
Leading the Way in Nephrology: Dr. David Greene's Work with Stem Cells for Ki...
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
 
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdfCHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
CHAPTER 1 SEMESTER V - ROLE OF PEADIATRIC NURSE.pdf
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
 
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptxBOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
BOWEL ELIMINATION BY ANUSHRI SRIVASTAVA.pptx
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
 
Neuro Saphirex Cranial Brochure
Neuro Saphirex Cranial BrochureNeuro Saphirex Cranial Brochure
Neuro Saphirex Cranial Brochure
 
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICEJaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
 
Essential Metrics for Palliative Care Management
Essential Metrics for Palliative Care ManagementEssential Metrics for Palliative Care Management
Essential Metrics for Palliative Care Management
 
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
 
Telehealth Psychology Building Trust with Clients.pptx
Telehealth Psychology Building Trust with Clients.pptxTelehealth Psychology Building Trust with Clients.pptx
Telehealth Psychology Building Trust with Clients.pptx
 

Data Café — A Platform For Creating Biomedical Data Lakes

  • 1. 1 Data Café — A Platform For Creating Biomedical Data Lakes Pradeeban Kathiravelu1,2, Ameen Kazerouni2, Ashish Sharma2 1 Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Department of Biomedical Informatics, Emory University, Atlanta, USA www.sharmalab.info
  • 2. 2 Data Landscape for Precision Medicine DATA CHARACTERISTICS • Large number of small datasets • Structured…Semi-structured …Unstructured…Ill formed • Noisy and Fuzzy/Uncertain • Spatial, Temporal relationships DATA MANAGEMENT • Variety in storage and messaging protocols • No shared interface
  • 3. 3 Illustrative Use Case Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more. Execute a Radiogenomics workflow on the diffusion images of GBM patients who received a TMZ + experimental regimen with an overall survival of 18months or more PACS + EMR + AIM + RT + Molecular
  • 4. 4 Motivation • Most current solutions require a DBA to initiate the migration of data into a Data Warehousing environment • to query and explore all the data at once. • Costly to set up such warehouses. • Unified warehouse with access to query and explore the data. • Limitations • Scalability and extensibility to incorporate new data sources • A priori knowledge of the data models of the different data sources.
  • 5. BIOMEDICAL DATA LAKES • Cohort Discovery and Creation — Assembled per-study • Heterogeneous data collected in a loosely structured fashion. • Agile and easy to create. • Integrate with data exploration/visualization via REST APIs. • Problem or hypothesis specific virtual data set. • Powered by Drill + HDFS, Data Sources via APIs.
  • 6. 6 Data Café • An agile approach to creating and extending the concept of a star schema • to model a problem/hypothesis specific dataset. • by leveraging Apache Drill to easily query the data. • Tackles the limitations in the existing approaches. • Provides researchers the ability to add new data models and sources.
  • 7. 7 Core Concepts Step 1. Given a set of data sources, create a graphical representation of the join attributes. This graph represents how data is connected across the various data sources
  • 8. 8 Core Concepts Step 2. Run a set of parallel queries on the data sources that include the attributes that are present in the query graph. In the top figure, our query is of type: {id1: A1 > x and B2 == y} We run similar queries across C, D and E and retrieve the set of relevant id’s (join attributes).
  • 9. 9 Core Concepts Step 3. Compute intersection across the various id’s (join attributes). The data of interest can now be obtained using the id’s in this intersection. A subsequent query will allow us to stream, in parallel, data from individual sources, given the relevant ids (join attributes)
  • 11. 11 Apache Drill • Variety – Query a range of non-relational data sources. • Flexibility. • Agility – Faster Insights. • Scalability.
  • 12. 12 Evaluation Environment • Data Café was deployed along with the data sources and Drill in Amazon EC2. • MongoDB instantiated in EC2 instances. • Hive on Amazon EMR (Elastic MapReduce). • EMR HDFS was configured with 3 nodes. • Various datasets for evaluation • Two synthetic datasets. • Clinical Data from the TCGA BRCA collection
  • 13. 13 Results • Quick creation of data lakes • without prior knowledge of the data schema. • Very fast execution of large queries • with Apache Drill. • Data Café can be an efficient platform for exploring an integrated data source. • Integrated data source construction process may be time consuming. • Less critical path. • Done less frequently than the data queries from HDFS/Hive using Drill.
  • 14. 14 Conclusion • A novel platform for integrating multiple data sources. • Without a priori knowledge of the data models of the sources that are being integrated. • Indices to do the actual integration • Enables parallelizing the push of the actual data into HDFS. • Apache Drill as a fast query execution engine that supports SQL. • Currently ingesting data from TCGA.
  • 15. 15 Current State and Future Plans • Ongoing efforts to evaluate the platform with diverse and heterogeneous data sources. • Expanding to a larger multi-node distributed cluster. • Integration with DataScope. • Multiple data stores and larger data sets. • Integration with imaging clients such as caMicroscope, as well as archives such as The Cancer Imaging Archive (TCIA).
  • 16. Acknowledgements Google Summer of Code 2015 NCIP/Leidos 14X138, caMicroscope — A Digital Pathology Integrative Query System; Ashish Sharma PI Emory/WUSTL/Stony Brook NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory) The results published here are in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/
  • 17. For more information including recent updates please visit: www.sharmalab.info ashish.sharma@emory.edu