More Related Content

Similar to Automated Hypothesis Testing with Large Scale Scientific Workflows(20)

More from dgarijo(20)

Automated Hypothesis Testing with Large Scale Scientific Workflows

  1. Automated HypothesisTesting with Large Scale Scientific Workflows Yolanda Gil Daniel Garijo Rajiv Mayani Varun Ratnakar Information Sciences Institute & Department of Computer Science University of Southern California http://www.isi.edu Parag Mallick Ravali Adusumilli Hunter Boyce Stanford School of Medicine Canary Center for Early Cancer Detection Stanford University http://mallicklab.stanford.edu http://www.disk-project.org
  2. Talk Outline ๏ Motivation ๏ Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  3. Scientific Data AnalysisToday: Inefficient, Incomplete, Irreproducible ๏ Data analysis is time consuming ๏ Not systematic ๏ Not updated when new data/methods become available ๏ Hard/impractical to reproduce prior work ๏ Overall process is manually done: inefficient and error-prone ๏ Analytic knowledge is compartmentalised New hypothesis Formulate line of inquiry (data + method) Retrieve data Run workflows (methods) Meta-analysis of results
  4. Our Focus: Cancer Multi-Omics ๏ Data Availability and Complexity: • The multi-omic domain is filled with multiple levels of heterogeneous data that is regularly expanding in volume and complexity through projects likeThe Cancer Genome AtlasTCGA and and the associated Clinical ProteomicTumor Analysis Consortium (CPTAC)
  5. Our Focus: Cancer Multi-Omics ๏ Analytic Complexity: • Multi-omic analysis requires the use of dozens of interconnected tools each of which may require substantial domain knowledge. MAQ BWA BWA-SW (SE only) PERM SOAPv2 MOSAIK NOVOALIGN SAMTOOLS PICARD GATK PICARD SAMTOOLS IGVtools Domain Knowledge is isolated
  6. Our Focus: Cancer Multi-Omics ๏ Multiple types and complexities of hypotheses: • Hypotheses span the range from single-gene/single dataset to multi-gene/multi-ome/multi- dataset • Is this protein is found in this sample ? • Is this gene is found in this sample ? • Is this protein is associated with a certain cancer ? • Which proteins are associated with a certain cancer ? • .. • ..
  7. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  8. Our Approach: Hypotheses-Driven Discovery ๏ Represent scientist hypotheses ๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued by data analysis workflows ๏ Design a meta-analysis that examines the results of lines of inquiry and either validates or revises the original hypotheses ๏ Develop an intelligent agent that can report and explain new findings to the scientist Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  9. Representing Hypotheses Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings Representing Hypotheses
  10. Requirements from Omics ๏ Graph-based hypothesis representation • Entities are nodes • Relationships are links ๏ Annotations on graphs • Represent qualifications of hypotheses: confidence and evidence ๏ Representing hypothesis evolution • Graph versioning Graph representation in RDF ๏ Standard semantic web language ๏ Scalable reasoners available ๏ Qualifications and provenance through triple reification ๏ Versioning through multiple named graphs Representing Hypotheses
  11. Representing Hypotheses Biology ontology Hypothesis ontology hyp:expressedIn user:TCGA-AA-3561-01A-22 User data definitions hyp:associatedWith bio:ColonCancer Graph Hy1 Graph Hy2 bio:PRKCDBP bio:PRKCDBP
  12. Lifecycle of a hypothesis Biology ontology Hypothesis ontology hyp:expressedIn user:TCGA-AA-3561-01A-22 User data definitions hyp:associatedWith bio:ColonCancer Graph Hy1 Graph Hy2 bio:PRKCDBP bio:PRKCDBP
  13. 1. Initial Hypothesis, Data & Workflows Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Hypothesis Statement Hy1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  14. 2. Running workflows on Data Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Workflow Execution W1 hasWorkflowTemplate used Hypothesis Statement Hy1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  15. Qualifications of Hy1'Provenance of Hy1' Hypothesis Statement Hy1 3. Meta reasoning about workflow results PRKCDBP expressedIn TCGA-AA-3561-01A-22 Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Workflow Execution W1 hasWorkflowTemplate used Meta-Workflow Execution MW1 used Revised Hypothesis Statement Hy1' PRKCDBP expressedIn TCGA-AA-3561-01A-22 hasConfidenceValue 0 Statement Hy1'-S1 hasProvenance producedused produced revisionOf
  16. 4. New Data becomes available Workflows Available Proteomics Proteogenomics Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22 Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) XX_3561_DD.zip (RNASeqData)
  17. 5. New Multi-Workflows are also run Workflows Available Proteomics Proteogenomics used Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) Workflow Execution W2 XX_3561_DD.zip (RNASeqData) Workflow Execution W1 used Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  18. Qualifications of Ha1' hasProvenance Provenance of Ha1' 6. Hypothesis Revision Workflows Available Proteomics Proteogenomics used used Revised Hypothesis Statement Ha1' PRKCDBP Mutated expressedIn TCGA-AA-3561-01A-22 hasConfidenceValue 0.98 Statement Ha1'-S1 producedused Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) Workflow Execution W2 XX_3561_DD.zip (RNASeqData) Workflow Execution W1 used used produced Meta-Workflow Execution MW2 Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22 revisionOf
  19. Representing Lines of Inquiry & Data analysis workflows Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  20. Data Query Pattern DataFile ?d Hypothesis Pattern Lines of Inquiry ๏ Capture how to setup potential analyses that can be pursued to test a certain type of hypothesis bio:Protein ?p hyp:expressedIn bio:Sample ?s producedData Patient ?pcollectedFromSample ?sExperiment ?e experimentedOn Data Analytic Workflows ProteomicsProteogenomics DataFile ?d Meta-workflowsComparisonConfidence estimation Benchmarking
  21. Example Multi-omics Workflow (Zhang et. al replication)
  22. Automated Workflow Generation in WINGS by Reasoning about Semantic Constraints Example: all input data must be from human species, i.e. must have HS in metadata Workflow system uses this constraint to select datasets that have HS in their metadata so they are valid
  23. Representing Hypotheses Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  24. Meta-workflows: 1) Comparison Meta-Workflows Variant Detection Custom Protein DB Protein Identification Protein Identification Custom DB Reference DB Protein IDs Protein IDs Similarity ScoreData Dependent: •  Peptide Level •  Protein Level •  Scan Level Comparison Meta-Workflow ๏ Goals: • Compare results amongst multiple workflows • Measure the global similarity amongst multiple workflows • Provide users with explanation of workflow-dependent differences in results
  25. Meta-workflows: 2) Benchmark Meta-Workflows ๏ Goals: • Evaluation of workflow performance • Training of confidence estimation models (probabilistic) Probabilistic Models Benchmark Meta-Workflow ROC, True/False Positive Rate
  26. Meta-workflows: 3) Confidence estimation Meta-Workflows ๏ Goals: • Composite results from multiple workflows • Estimate confidence of the workflow result • Use estimated confidence to update hypothesis Protein Identification Protein Identification Custom DB Reference DB Protein IDs Protein IDs Probabilistic Model Estimate Confidence Update Hypothesis Benchmark Meta-Workflow
  27. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  28. DISK Walkthrough: Initial Hypothesis ๏ Initial hypothesis is provided by the user • PRKCDBP protein is expressed in a patient sample
  29. DISK Walkthrough: Lines of Inquiry ๏ Line of inquiry suggests to find data from different experiments done with the patient’s sample, then run multi-omic workflows, and then combine evidence into confidence score General hypothesis pattern Data query pattern: search for different experiments that produced omics data (eg type RNASeq and MassSpecData) Data analysis workflows to run on genomics and proteomics data (more omics in the future) Meta-workflows to assess confidence on the hypothesis based on workflow results
  30. DISK Walkthrough: Data & Workflows To test a hypothesis that a protein is present in a patient’s sample: ๏ Retrieve mass spec and RNASeq data ๏ Use workflows • Wf1: Proteome only • Wf2: ProteoGenomic
  31. DISK Walkthrough: Meta-Workflows ๏ After running the workflows, meta- workflow analyse the results and generate a confidence value
  32. DISK Walkthrough: Revised Hypothesis ๏ The hypothesis is revised and given a confidence value: • A mutation of the protein PRKCDBP has been expressed in the patient’s sample TCGA-AA-3561-01A-22 with a confidence 0.9887
  33. DISK Walkthrough: Provenance Details ๏ Hypothesis provenance stores information about workflows run and the data used • Workflow execution provenance is published by WINGS in the prov standard.
  34. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  35. DISK:Automated DIscovery of Scientific Knowledge Workflow Constraints Workflow Reasoning Open Publication of Results as Linked Data Workflow Provenance WINGS Intelligent Workflow System Lines of Inquiry Interactive Discovery Agent Hypothesis EvaluationHypotheses Revised hypotheses & interesting findings Analytic Workflows Data Retrieval Workflow Binding Meta-Workflows Confidence Estimation Benchmarking Formulate Lines of Inquiry Meta-Analysis of Results Data Repository
  36. Our Initial Focus: Reproduce Seminal Omics Analysis [Zhang et al 2014]
  37. ๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer ๏ Successfully reproduced paper findings comparing results at multiple levels (final figure, supplementary tables, etc.) ๏ Took months and direct conversations with authors to replicate paper figures and supplemental figures ๏ Application of analysis approach to new cancer type now takes minutes • Useful whenTCGA is integrated ๏ Expanded analysis to • compare how sensitive findings were to workflow details 0 2 4 6 −1.0 −0.5 0.0 0.5 1.0 spearman correlation density Correlation between mRNA−protein abundance (within samples) 0 1 2 −4 −3 −2 −1 0 spearman correlation density Correlation between mRNA−protein variation (across samples) Impact on Cancer Multi-Omics
  38. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  39. Related Work 1) Discovery Systems ๏ [Lenat 1976] ๏ [Lindsay et al 1980] ๏ [Langley 1981] ๏ [Falkenhainer 1985] ๏ [Kulkarni and Simon 1988] ๏ [Cheeseman et al 1989] ๏ [Zytkow et al 1990] ๏ [Simon 1996] ๏ [Valdes-Perez 1997] ๏ [Todorovski et al 2000] ๏ [Schmidt and Lipson 2009]
  40. Related Work: 2) Hypothesis Representation as Graphs ๏ Existing vocabularies are related but need to be extended to represent hypotheses in DISK • SWAN [Gao et al 2006] • EXPO [Soldatova and King 2006] • Nanopublications [Groth et al 2010] • Ovopublications [Callahan and Dumontier 2013] • Micropublications [Clark et al 2014] • LSC • BEL
  41. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  42. Contributions ๏ Represent scientist hypotheses • Hypothesis ontology includes revisions & provenance ๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued with a data analysis workflow • Lines of inquiry outline what type of data and workflows to use, and customize them to the hypotheses at hand ๏ Design a meta-analysis to assess the results of lines of inquiry and revise the original hypotheses • Meta-analysis workflows assess diverse evidence
  43. Ongoing & Future Work ๏ Ongoing work: • Interactive Discovery Agent that explains interesting findings • Continuous analysis of data (TCGA/CPTAC) as it grows • Extending and generalizing meta-workflows • Using DISK in geosciences: Subsurface water resource modeling ๏ Future challenges: • More complex hypotheses about several entities • Incorporate evidence over time • Designing domain-independent meta-workflows • Resource-bound hypothesis exploration
  44. Thank you