Eagle Bioinformatics Symposium: 1.Martin Strahm: Understanding Disease Informatics Systems: A Foundation to Integrate Exploratory Translational Data

  • 416 views
Uploaded on

UDIS is the solution we built in-house to integrate –omics data in one place. The system brings clinical (human cohort) and pre-clinical (animal, cell lines) data from internal and public sources …

UDIS is the solution we built in-house to integrate –omics data in one place. The system brings clinical (human cohort) and pre-clinical (animal, cell lines) data from internal and public sources together. It serves as data source for simple visualizations suitable for many scientists and additionally via batch download for expert analysts.

The first challenge to build such a system is the huge amount of data that has to be stored and made accessible with reasonable performance. The second is the variety of data sources which store their data in different formats and use different vocabularies.

We could overcome the first challenge quite easily by bringing the right talents together. The second one is currently solved by a reasonably large investment in resources for data loading and curation, but would tremendously benefit from one industry wide standard.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
416
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Understanding Disease Informatics System (UDIS) A Foundation to Integrate Exploratory Translational Data Martin Strahm, F. Hoffmann – La Roche
  • 2. UDIS - A Foundation to Integrate Exploratory Translational Data • Design criteria – Data sources – Data types – Meta data – User community • Implementation and challenges – High dimensional data – Meta data • Conclusion
  • 3. DESIGN CRITERIA Some considerations we had when starting to build UDIS
  • 4. Design Criteria: Data Sources Experimental Medicine Studies Genetic variations of ~1000 (~2500) human individuals from ~30 ethnicities. ~20 cancer types ~100-1000 samples per type ~ 10 data types per sample 600 cell lines Animal models, Xenografts, Tissue Profiles Pre-clinical Clinical/Human Cohorts In-housePublic
  • 5. Design Criteria: Data Types DNA Sequence variation Copy number variation
  • 6. Design Criteria: Data Types mRNA Expression RNA Chip Expression RNA Seq
  • 7. Design Criteria: Meta Data Meta Data Low Dimensional Data Gender Disease (MeSH) Disease (ICD10) Tissue Organ Treatment … Clinical chemistry Albumin Bilirubin Aspartate transaminase … EQ-5D Mobility Self-Care Usual Activities Pain/Discomfort Anxiety/Depression …
  • 8. Design Criteria: User Community Expert Analyst / Statistician Bench Scientist / Biologist Simple view on simple data set Batch download / API Interface to R, Spotfire, GenePattern
  • 9. Design Criteria: User Community • Some simple questions for a broad user community – In which tissue is my gene of interest expressed? – What is the frequency of my SNP of interest in a population? – How frequent is a somatic mutation in a cancer type? • Most questions are complicated and need expert analysis – Correlation between disease status and SNPs in a heterogeneous population. – Finding biomarker for disease progression.
  • 10. IMPLEMENTATION AND CHALLENGES
  • 11. Implementation: High Dimensional Data SEQ_VARIATION PK SEQ_VAR_ID GENOME_ID DBSNP_ID CHROMOSOME SEQ_START SEQ_END SEQ_REF SEQ_ALT SEQ_VARIATION_RESULT PK SEQ_VARIATION_RESULT_ID EXPERIMENT_ID FK1 SEQ_VAR_ID
  • 12. Implementation: High Dimensional Data '0 5000000'000 10000000'000 15000000'000 20000000'000 25000000'000 30000000'000 Rows 558000'000 30000000'000 BDS (RESULTS) 1KG (VARIATIONS)
  • 13. Implementation: Meta Data UDIS CV Property name Smoking_status Property values Current smoker Former smoker Non-smoker TCGA Tabacco_smoking_history Lifelong Non-smoker Current smoker Current reformed smoker for > 15 years Current reformed smoker for < or = 15 years Current Reformed Smoker, Duration Not Specified Source
  • 14. Implementation: Meta Data CDISC Property name SMKCLAS Property values NEVER SMOKED SMOKER EX SMOKER Other Smoking status Neversmoker Non-smoker Passive smoker Current smoker Former smoker Smoker (not further specified) Source
  • 15. Implementation: Meta Data Initial work Decide on terms and granularity used in your system. There are multiple standards out there which may be used for common concepts. For each data source Map property names used externally to internal property names Map property values used externally to internal property names Convert external data format (XML, CSV, XLS, RDB, …) to input format
  • 16. Implementation: Meta Data FF Property name TOTCOM_1 TOTSOC_1 TOTCOMSOC_1 TOTC_1 TOTD_1 TOTE_1 TOT_CS_M1 DIAG_CS_M1 TOT_TSA_M1 DIAG_TSA_M1 Property values INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER YES|NO INTEGER YES|NO Source
  • 17. Implementation: Meta Data FF Property name MOBILITY SELFCARE ACTIVITY PAIN ANXIETY EQ_VAS EQ5D_SCORE Property values INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER Source
  • 18. CONCLUSION
  • 19. Conclusion • Data volume is challenging, but can be managed with current technology • Variety of annotation standards is extremely challenging and requires a lot of human work – Huge amount of property names in individual studies (thousands) – Deep biomedical knowledge required to understand annotation – Standardized exchange format and vocabulary would help • We do not have the perfect solution
  • 20. Doing now what patients need next