Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Automated HypothesisTesting with
Large Scale Scientific Workflows
Yolanda Gil
Daniel Garijo
Rajiv Mayani
Varun Ratnakar
Info...
Talk Outline
๏ Motivation
๏ Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry
3. Meta-analys...
Scientific Data AnalysisToday:
Inefficient, Incomplete, Irreproducible
๏ Data analysis is time consuming
๏ Not systematic
๏ ...
Our Focus: Cancer Multi-Omics
๏ Data Availability and Complexity:
• The multi-omic domain is filled with multiple levels of...
Our Focus: Cancer Multi-Omics
๏ Analytic Complexity:
• Multi-omic analysis requires the
use of dozens of interconnected
to...
Our Focus: Cancer Multi-Omics
๏ Multiple types and complexities
of hypotheses:
• Hypotheses span the range from
single-gen...
Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry...
Our Approach: Hypotheses-Driven Discovery
๏ Represent scientist
hypotheses
๏ Formulate lines of inquiry
that express how a...
Representing Hypotheses
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and...
Requirements from Omics
๏ Graph-based hypothesis
representation
• Entities are nodes
• Relationships are links
๏ Annotatio...
Representing Hypotheses
Biology
ontology
Hypothesis
ontology
hyp:expressedIn
user:TCGA-AA-3561-01A-22
User data
definitions...
Lifecycle of a hypothesis
Biology
ontology
Hypothesis
ontology
hyp:expressedIn
user:TCGA-AA-3561-01A-22
User data
definitio...
1. Initial Hypothesis, Data & Workflows
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip
...
2. Running workflows on Data
Data Available
Workflows Available
Proteomics
Proteogenomics
XX_3561Proteome_VU.zip
(MassSpecDa...
Qualifications of Hy1'Provenance of Hy1'
Hypothesis Statement Hy1
3. Meta reasoning about workflow results
PRKCDBP
expressed...
4. New Data becomes available
Workflows Available
Proteomics
Proteogenomics
Hypothesis Statement Ha1
PRKCDBP
expressedIn
TC...
5. New Multi-Workflows are also run
Workflows Available
Proteomics
Proteogenomics
used
Data Available
XX_3561Proteome_VU.zip...
Qualifications of Ha1'
hasProvenance
Provenance of Ha1'
6. Hypothesis Revision
Workflows Available
Proteomics
Proteogenomics...
Representing Lines of Inquiry & Data analysis workflows
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (work...
Data Query Pattern
DataFile ?d
Hypothesis Pattern
Lines of Inquiry
๏ Capture how to setup potential analyses that can be p...
Example Multi-omics Workflow (Zhang et. al replication)
Automated Workflow Generation in WINGS by Reasoning about
Semantic Constraints
Example: all input data must be from human s...
Representing Hypotheses
Hypothesis
Lines of Inquiry
Specify relevant analytic methods (workflows),
type of data needed, and...
Meta-workflows:
1) Comparison Meta-Workflows
Variant
Detection
Custom
Protein DB
Protein
Identification
Protein
Identificati...
Meta-workflows:
2) Benchmark Meta-Workflows
๏ Goals:
• Evaluation of workflow performance
• Training of confidence estimation ...
Meta-workflows:
3) Confidence estimation Meta-Workflows
๏ Goals:
• Composite results from multiple workflows
• Estimate confide...
Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry...
DISK Walkthrough: Initial Hypothesis
๏ Initial hypothesis is provided by the user
• PRKCDBP protein is expressed in a pati...
DISK Walkthrough: Lines of Inquiry
๏ Line of inquiry suggests to find data from different experiments done with the
patient...
DISK Walkthrough: Data & Workflows
To test a hypothesis that a protein is present in a patient’s sample:
๏ Retrieve mass sp...
DISK Walkthrough: Meta-Workflows
๏ After running the workflows, meta-
workflow analyse the results and generate a
confidence v...
DISK Walkthrough: Revised Hypothesis
๏ The hypothesis is revised and given a confidence value:
• A mutation of the protein ...
DISK Walkthrough: Provenance Details
๏ Hypothesis provenance stores information about workflows run and the data used
• Wor...
Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry...
DISK:Automated DIscovery of Scientific Knowledge
Workflow
Constraints
Workflow
Reasoning
Open
Publication of
Results as
Lin...
Our Initial Focus: Reproduce Seminal Omics Analysis
[Zhang et al 2014]
๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer
๏ Successfully reproduced paper findings compa...
Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry...
Related Work
1) Discovery Systems
๏ [Lenat 1976]
๏ [Lindsay et al 1980]
๏ [Langley 1981]
๏ [Falkenhainer 1985]
๏ [Kulkarni...
Related Work:
2) Hypothesis Representation as Graphs
๏ Existing vocabularies are related but need to be extended to repres...
Talk Outline
๏ Motivation
๏ Our Approach & Research Challenges
1. Representing Hypotheses
2. Representing Lines of Inquiry...
Contributions
๏ Represent scientist hypotheses
• Hypothesis ontology includes revisions & provenance
๏ Formulate lines of ...
Ongoing & Future Work
๏ Ongoing work:
• Interactive Discovery Agent that explains interesting findings
• Continuous analysi...
Thank you
Upcoming SlideShare
Loading in …5
×

Automated Hypothesis Testing with Large Scale Scientific Workflows

176 views

Published on

(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Automated Hypothesis Testing with Large Scale Scientific Workflows

  1. 1. Automated HypothesisTesting with Large Scale Scientific Workflows Yolanda Gil Daniel Garijo Rajiv Mayani Varun Ratnakar Information Sciences Institute & Department of Computer Science University of Southern California http://www.isi.edu Parag Mallick Ravali Adusumilli Hunter Boyce Stanford School of Medicine Canary Center for Early Cancer Detection Stanford University http://mallicklab.stanford.edu http://www.disk-project.org
  2. 2. Talk Outline ๏ Motivation ๏ Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  3. 3. Scientific Data AnalysisToday: Inefficient, Incomplete, Irreproducible ๏ Data analysis is time consuming ๏ Not systematic ๏ Not updated when new data/methods become available ๏ Hard/impractical to reproduce prior work ๏ Overall process is manually done: inefficient and error-prone ๏ Analytic knowledge is compartmentalised New hypothesis Formulate line of inquiry (data + method) Retrieve data Run workflows (methods) Meta-analysis of results
  4. 4. Our Focus: Cancer Multi-Omics ๏ Data Availability and Complexity: • The multi-omic domain is filled with multiple levels of heterogeneous data that is regularly expanding in volume and complexity through projects likeThe Cancer Genome AtlasTCGA and and the associated Clinical ProteomicTumor Analysis Consortium (CPTAC)
  5. 5. Our Focus: Cancer Multi-Omics ๏ Analytic Complexity: • Multi-omic analysis requires the use of dozens of interconnected tools each of which may require substantial domain knowledge. MAQ BWA BWA-SW (SE only) PERM SOAPv2 MOSAIK NOVOALIGN SAMTOOLS PICARD GATK PICARD SAMTOOLS IGVtools Domain Knowledge is isolated
  6. 6. Our Focus: Cancer Multi-Omics ๏ Multiple types and complexities of hypotheses: • Hypotheses span the range from single-gene/single dataset to multi-gene/multi-ome/multi- dataset • Is this protein is found in this sample ? • Is this gene is found in this sample ? • Is this protein is associated with a certain cancer ? • Which proteins are associated with a certain cancer ? • .. • ..
  7. 7. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  8. 8. Our Approach: Hypotheses-Driven Discovery ๏ Represent scientist hypotheses ๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued by data analysis workflows ๏ Design a meta-analysis that examines the results of lines of inquiry and either validates or revises the original hypotheses ๏ Develop an intelligent agent that can report and explain new findings to the scientist Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  9. 9. Representing Hypotheses Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings Representing Hypotheses
  10. 10. Requirements from Omics ๏ Graph-based hypothesis representation • Entities are nodes • Relationships are links ๏ Annotations on graphs • Represent qualifications of hypotheses: confidence and evidence ๏ Representing hypothesis evolution • Graph versioning Graph representation in RDF ๏ Standard semantic web language ๏ Scalable reasoners available ๏ Qualifications and provenance through triple reification ๏ Versioning through multiple named graphs Representing Hypotheses
  11. 11. Representing Hypotheses Biology ontology Hypothesis ontology hyp:expressedIn user:TCGA-AA-3561-01A-22 User data definitions hyp:associatedWith bio:ColonCancer Graph Hy1 Graph Hy2 bio:PRKCDBP bio:PRKCDBP
  12. 12. Lifecycle of a hypothesis Biology ontology Hypothesis ontology hyp:expressedIn user:TCGA-AA-3561-01A-22 User data definitions hyp:associatedWith bio:ColonCancer Graph Hy1 Graph Hy2 bio:PRKCDBP bio:PRKCDBP
  13. 13. 1. Initial Hypothesis, Data & Workflows Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Hypothesis Statement Hy1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  14. 14. 2. Running workflows on Data Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Workflow Execution W1 hasWorkflowTemplate used Hypothesis Statement Hy1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  15. 15. Qualifications of Hy1'Provenance of Hy1' Hypothesis Statement Hy1 3. Meta reasoning about workflow results PRKCDBP expressedIn TCGA-AA-3561-01A-22 Data Available Workflows Available Proteomics Proteogenomics XX_3561Proteome_VU.zip (MassSpecData) producedData TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX2 (Experiment) experimentedOn Workflow Execution W1 hasWorkflowTemplate used Meta-Workflow Execution MW1 used Revised Hypothesis Statement Hy1' PRKCDBP expressedIn TCGA-AA-3561-01A-22 hasConfidenceValue 0 Statement Hy1'-S1 hasProvenance producedused produced revisionOf
  16. 16. 4. New Data becomes available Workflows Available Proteomics Proteogenomics Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22 Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) XX_3561_DD.zip (RNASeqData)
  17. 17. 5. New Multi-Workflows are also run Workflows Available Proteomics Proteogenomics used Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) Workflow Execution W2 XX_3561_DD.zip (RNASeqData) Workflow Execution W1 used Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22
  18. 18. Qualifications of Ha1' hasProvenance Provenance of Ha1' 6. Hypothesis Revision Workflows Available Proteomics Proteogenomics used used Revised Hypothesis Statement Ha1' PRKCDBP Mutated expressedIn TCGA-AA-3561-01A-22 hasConfidenceValue 0.98 Statement Ha1'-S1 producedused Data Available XX_3561Proteome_VU.zip (MassSpecData) producedData producedData experimentedOn experimentedOn TCGA-AA-3561 (Patient) collectedFromTCGA-AA-3561-01A-22 (Sample) AA_3561_EX1 (Experiment) AA_3561_EX2 (Experiment) Workflow Execution W2 XX_3561_DD.zip (RNASeqData) Workflow Execution W1 used used produced Meta-Workflow Execution MW2 Hypothesis Statement Ha1 PRKCDBP expressedIn TCGA-AA-3561-01A-22 revisionOf
  19. 19. Representing Lines of Inquiry & Data analysis workflows Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  20. 20. Data Query Pattern DataFile ?d Hypothesis Pattern Lines of Inquiry ๏ Capture how to setup potential analyses that can be pursued to test a certain type of hypothesis bio:Protein ?p hyp:expressedIn bio:Sample ?s producedData Patient ?pcollectedFromSample ?sExperiment ?e experimentedOn Data Analytic Workflows ProteomicsProteogenomics DataFile ?d Meta-workflowsComparisonConfidence estimation Benchmarking
  21. 21. Example Multi-omics Workflow (Zhang et. al replication)
  22. 22. Automated Workflow Generation in WINGS by Reasoning about Semantic Constraints Example: all input data must be from human species, i.e. must have HS in metadata Workflow system uses this constraint to select datasets that have HS in their metadata so they are valid
  23. 23. Representing Hypotheses Hypothesis Lines of Inquiry Specify relevant analytic methods (workflows), type of data needed, and how to combine results Query to retrieve Data Data Analysis Workflows Workflow Bindings Meta-Workflows Confidence Estimation Benchmarking Revised hypothesis & interesting findings
  24. 24. Meta-workflows: 1) Comparison Meta-Workflows Variant Detection Custom Protein DB Protein Identification Protein Identification Custom DB Reference DB Protein IDs Protein IDs Similarity ScoreData Dependent: •  Peptide Level •  Protein Level •  Scan Level Comparison Meta-Workflow ๏ Goals: • Compare results amongst multiple workflows • Measure the global similarity amongst multiple workflows • Provide users with explanation of workflow-dependent differences in results
  25. 25. Meta-workflows: 2) Benchmark Meta-Workflows ๏ Goals: • Evaluation of workflow performance • Training of confidence estimation models (probabilistic) Probabilistic Models Benchmark Meta-Workflow ROC, True/False Positive Rate
  26. 26. Meta-workflows: 3) Confidence estimation Meta-Workflows ๏ Goals: • Composite results from multiple workflows • Estimate confidence of the workflow result • Use estimated confidence to update hypothesis Protein Identification Protein Identification Custom DB Reference DB Protein IDs Protein IDs Probabilistic Model Estimate Confidence Update Hypothesis Benchmark Meta-Workflow
  27. 27. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  28. 28. DISK Walkthrough: Initial Hypothesis ๏ Initial hypothesis is provided by the user • PRKCDBP protein is expressed in a patient sample
  29. 29. DISK Walkthrough: Lines of Inquiry ๏ Line of inquiry suggests to find data from different experiments done with the patient’s sample, then run multi-omic workflows, and then combine evidence into confidence score General hypothesis pattern Data query pattern: search for different experiments that produced omics data (eg type RNASeq and MassSpecData) Data analysis workflows to run on genomics and proteomics data (more omics in the future) Meta-workflows to assess confidence on the hypothesis based on workflow results
  30. 30. DISK Walkthrough: Data & Workflows To test a hypothesis that a protein is present in a patient’s sample: ๏ Retrieve mass spec and RNASeq data ๏ Use workflows • Wf1: Proteome only • Wf2: ProteoGenomic
  31. 31. DISK Walkthrough: Meta-Workflows ๏ After running the workflows, meta- workflow analyse the results and generate a confidence value
  32. 32. DISK Walkthrough: Revised Hypothesis ๏ The hypothesis is revised and given a confidence value: • A mutation of the protein PRKCDBP has been expressed in the patient’s sample TCGA-AA-3561-01A-22 with a confidence 0.9887
  33. 33. DISK Walkthrough: Provenance Details ๏ Hypothesis provenance stores information about workflows run and the data used • Workflow execution provenance is published by WINGS in the prov standard.
  34. 34. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  35. 35. DISK:Automated DIscovery of Scientific Knowledge Workflow Constraints Workflow Reasoning Open Publication of Results as Linked Data Workflow Provenance WINGS Intelligent Workflow System Lines of Inquiry Interactive Discovery Agent Hypothesis EvaluationHypotheses Revised hypotheses & interesting findings Analytic Workflows Data Retrieval Workflow Binding Meta-Workflows Confidence Estimation Benchmarking Formulate Lines of Inquiry Meta-Analysis of Results Data Repository
  36. 36. Our Initial Focus: Reproduce Seminal Omics Analysis [Zhang et al 2014]
  37. 37. ๏ Replicated [Zhang et al 2014] Proteogenomic analysis of Colo-rectal cancer ๏ Successfully reproduced paper findings comparing results at multiple levels (final figure, supplementary tables, etc.) ๏ Took months and direct conversations with authors to replicate paper figures and supplemental figures ๏ Application of analysis approach to new cancer type now takes minutes • Useful whenTCGA is integrated ๏ Expanded analysis to • compare how sensitive findings were to workflow details 0 2 4 6 −1.0 −0.5 0.0 0.5 1.0 spearman correlation density Correlation between mRNA−protein abundance (within samples) 0 1 2 −4 −3 −2 −1 0 spearman correlation density Correlation between mRNA−protein variation (across samples) Impact on Cancer Multi-Omics
  38. 38. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  39. 39. Related Work 1) Discovery Systems ๏ [Lenat 1976] ๏ [Lindsay et al 1980] ๏ [Langley 1981] ๏ [Falkenhainer 1985] ๏ [Kulkarni and Simon 1988] ๏ [Cheeseman et al 1989] ๏ [Zytkow et al 1990] ๏ [Simon 1996] ๏ [Valdes-Perez 1997] ๏ [Todorovski et al 2000] ๏ [Schmidt and Lipson 2009]
  40. 40. Related Work: 2) Hypothesis Representation as Graphs ๏ Existing vocabularies are related but need to be extended to represent hypotheses in DISK • SWAN [Gao et al 2006] • EXPO [Soldatova and King 2006] • Nanopublications [Groth et al 2010] • Ovopublications [Callahan and Dumontier 2013] • Micropublications [Clark et al 2014] • LSC • BEL
  41. 41. Talk Outline ๏ Motivation ๏ Our Approach & Research Challenges 1. Representing Hypotheses 2. Representing Lines of Inquiry 3. Meta-analysis to review workflow results ๏ DISK Scenario walkthrough ๏ Results in cancer multi-omics ๏ Related work ๏ Contributions and Future Work
  42. 42. Contributions ๏ Represent scientist hypotheses • Hypothesis ontology includes revisions & provenance ๏ Formulate lines of inquiry that express how a type of hypothesis can be pursued with a data analysis workflow • Lines of inquiry outline what type of data and workflows to use, and customize them to the hypotheses at hand ๏ Design a meta-analysis to assess the results of lines of inquiry and revise the original hypotheses • Meta-analysis workflows assess diverse evidence
  43. 43. Ongoing & Future Work ๏ Ongoing work: • Interactive Discovery Agent that explains interesting findings • Continuous analysis of data (TCGA/CPTAC) as it grows • Extending and generalizing meta-workflows • Using DISK in geosciences: Subsurface water resource modeling ๏ Future challenges: • More complex hypotheses about several entities • Incorporate evidence over time • Designing domain-independent meta-workflows • Resource-bound hypothesis exploration
  44. 44. Thank you

×