provenance of microarray experiments

622 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
622
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A revolution in biology and biomedicine has taken place in the past 10 years with the invention of microarray technologies. A microarray is a small chip (point to figure on the left) where fragments of every known gene have been mapped into precise positions on the chip. Microarrays makes it possible to measure how much of each of those genes is being expressed in cells. This technique is very often used, for example, to determine which genes may be the cause of specific diseases (point to figure on the right).
    A simplistic view of performing a microarray experiment involves collecting sample cells from both the experimental case (for example, a disease sample) and a control. Inside the cell, DNA is converted into mRNA, which is the active molecule of the cell that will be transcribed into whatever biological product that is causing or preventing the disease. mRNA molecules are complementary to the DNA that is bound to the array. To measure the gene expression, the mRNA molecules are first labelled with fluorescent dyes and deposited either in a single array (represented in the figure) or in separate arrays for control vs test. mRNA will naturally bind to the DNA of the arrays and, because the dyes are fluorescent, it will emit light according to the amount of mRNA present in the cell. As a result, it is possible to compare the amount of DNA being expressed in the disease sample, when compared to the normal sample, simply based on the bightness of each spot in the array.
  • Microarray technologies have been so succesfull in predicting disease that they have become standard practice in most biomedical studies. This has resulted in over 5000 results being published each year. Many of these datasets are made publicly available in repositories such as EBI’s array express or NCBI’s gene expression omnibus. Beyond studying a single disease, the data deposited in these repositores is often used, for example, for translational studies where invidivual genes are studied across multiple experiments in an attempt to understand the underlying molecular biology.
    Although some standards such as MIAME exist for reporting the results of microarray experiments, there is still a lack of interoperable data access
  • This work has resulted from ongoing efforts at the W3C HCLS bioRDF task force to understand neurodegenerative diseases. For our pilot study, we haved taken a bottom-up approach to understand what genes affect alzheimer’s patients and to what extent do those results agree over many experiments. For example, one characterization of AD is the formation of intracellular neurofibrillary tangles that affect neurons in brain regions involved in the memory function. It is therefore important to have meta-data such as the cell type(s), cell histopathology, and brain region(s) for comparing/integrating the results across different AD microarray experiments. It is important to consider the (raw) data source and the types of analysis performed on the data to arrive at meaningful interpretations.  Finally, gene expression data may be combined with other types of data including genomic functions, pathways, and associated diseases to broaden the spectrum of integrative data analysis.
  • Separate concerns of interests
    Understanding the needs for provenance v.s. a steep learning curve of existing ontolgoies
    Local, stable terminologies v.s. less stable, external vocabularies
    Identify the minimum set of terms for required users’ queries
  • In order to explicitelly represent microarray experiments in semantic web formats, we needed first to understand the microarray workflow and how experiment translates into results. A microarray experiment usually starts with a question. For example, are there genes that affect the incidence of Alzheimer’s disease? Why are certain areas of the brain more affected by neurofibrillary tangles and could there be a list of genes responsible for that? According to the question, the experiment is designed to collect samples from specific areas of the brain from disease and normal subjects. These samples are then used in the microarray experiment, which is only the first step of data processing. The microarray are scanned, resulting in an image which has to be analysed to expreact the intensity signals. Before they can be used, the raw results need to be normalized in order to reduce the noise. Finally, a variety of statistical techniques and methods are used to trim the list of tens of thousans of genes to only a few highly significant ones that may be responsible for the disease.
  • The results from a microarray experiment is a list of differentially expressed genes. The standard methodology to reporting these results is using a heatmap with the samples as columns and the genes as lines. These resulst are normaly reported as table in a pdf or, if we are lucky, the experimentalists provide the list of differentially expressed genes as a text file in supplementary material. Very seldom is the workflow for data collection/data processing and data analysis recorded in any format other than in the methodology section of an article.
  • Different studies also tend to report their results using different data structured, making it harder to compare the experimental results. For example, in the three experiments that we selected, although the software package used to analyse the data was the same and the microarray workflow was very similar, the results were reported in very different ways. This, is part, reflects the necessity to report not only the sample specifice gene expression values, indicated in dark blue, but also the annotations available for each of the genes discovered, indicated in light blue, as well as the aggregated gene expression values, indicated in green. In some cases, the average signal was used to derive the final list of genes (point to genelist 2) and in other, only the fold change and it’s associated p-value were used (genelist 3).
    For our bottom-up approach we have identified and created a namespace for the terminologies necessary to describe in each of these genelists the portions that are most informative about the procedure to preproecess and to analyse the data, without creating an extensive list of terms that would multiply the amount of data to be represented.
  • These multiple meta-data types or layers of information will allows us to answer specific biomedical questions. For example, a biologist or physician looking into comparing several experiments about the same disease might use information about the samples that were analysed to answer questions such as “What microarray experiments….”
  • Information about the statistical fold-change and p-value, on the other hand, enabe a very dfferent class of questions that are much more oriented by a need to understand the reliability of the results, such as “What genes are overexpresed…” A statistitian might need to look at these value before cross-experiment comparison is possible.
  • Finally, for a researcher interested in a deeper understanding of the molecular biology behind diseases, the gene annotation information is necessary to answer questions like “what other diseases may be…”.
    It is in this type of question tha the advantage of using RDF representations of genelist is most obvious.
  • To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  • To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  • To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  • To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
  • To answer our varied classes of questions, we have identified four types of provenance for our data model. Starting from the center, the provenance levels are the institutional leve, the eexperiment protocol level, the data analysis and significance level and the dataset description level.
  • The institutional level contains metadata about the genelists, such as the laboratory where they were performed, or the link to the publication where the results were published. These help determine the trusthworthiness of the data. The experimental context level includes data about the brain region where samples where collected from or the histology of the cells
  • The data analysis and significance level is concerned with information about the statistical procedures that were used to convert the raw data into a list of diffentially expressed genes. This information helps determine the statistical significance of the results. Finally, the dataset description level is the description of the RDF dataset itself such as the version, the published or the URL location where it is made available. The dataset description level makes use of the vocabulary of interlinked datasets and dublic core terms
  • Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  • Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  • Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  • Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
  • As part of this work, we have created federated queries that link our AD – related genes to genes in the diseasome. The diseasome is a community effort to link known human diseases to genes known to be related to those diseases. By federating our list of AD – related genes with the diseasome, we were able to find other disease that are affected by the same genes. This type of discovery is greatly facilitated by the RDF representation of the genelists.
  • provenance of microarray experiments

    1. 1. Provenance of Microarray Experiments for a Better Understanding of Experiment Results Helena F. Deus University of Texas Jun Zhao University of Oxford Satya Sahoo Wright State University Matthias Samwald DERI, Galway Eric Prud’hommeaux W3C Michael Miller Tantric Designs M. Scott Marshall Leiden University Medical Center Kei-Hoi Cheung Yale University
    2. 2. Outline  Background: microarrays, gene expression and why is provenance important for experimental biomedical data  Objectives  Data: Microarray workflow and gene results  The provenance model  Demo  Future work  Summary
    3. 3. Introduction  High throughput experiments, such as microarray technologies, have revolutionized the way we study disease and basic biology.  Microarray experiments allow scientists to quantify thousands of genomic features in a single experiment Source: http://www.scq.ubc.ca/ Affymetrix microarray gene chips Genes can be used as biomarkers for disease
    4. 4. Introduction  Since 1997, the number of published results based on an analysis of gene expression microarray data has grown from 30 to over 5,000 publications per year  Existing microarray data repositories and standards, but lack of provenance and interoperable data access Source:YJBM(2007)80(4):165-78
    5. 5. Introduction Cont.  A pilot study of the W3C HCLS BioRDF task force  Bottom-up approach  Use Microarray experiments for Alzheimer’s Diseases as the test-bed  Aggregate results across microarray experiments  Combine different types of data
    6. 6. Objectives  To facilitate a better understanding of microarray gene results  Efficiently query gene results  Efficiently combine existing life science datasets  To transform Microarray gene results into Semantic Web format  To encode provenance information about these gene results in the same format as the data itself
    7. 7. Microarray Workflow Biological question Differentially expressed genes Sample gathering etc. Experiment design Microarray experiment Image analysis Normalization Estimation Clustering Discriminatio n T-test… … Data extraction Data analysis and modeling
    8. 8. An Example of differentially expressed genes 8
    9. 9. An Example of gene list from different studies
    10. 10. What microarray experiments analyze samples taken from the entorhinal cortex region of Alzheimer's patients?
    11. 11. What genes are overexpressed in the entorhinal cortex region and what is their expression fold change and associated p-value?
    12. 12. What other diseases may be associated with the same genes found to be linked to AD? 
    13. 13. A Bottom-up Approach  Separate concerns/perspectives  Too many existing vocabularies to choose from  Lack of standardization among existing provenance vocabularies  Lack of a clear understanding of what needs to be captured  Process  Identify user query  Define terms  Test the query using test data
    14. 14. A Bottom-up Approach Raw Data Results
    15. 15. A Bottom-up Approach Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? What software was used to analyse the data? How can the experiment be replicated?
    16. 16. A Bottom-up Approach Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? Provenance of Microarray experiment What software was used to analyse the data? How can the experiment be replicated?
    17. 17. A Bottom-up Approach Provenance models Workflow, experimental design Domain ontologies (DO, GO…) Community models Raw Data Results Questions Which genes are markers for neurodegenerative diseases? Was gene ALG2 differentially expressed in multiple experiments? Provenance of Microarray experiment What software was used to analyse the data? How can the experiment be replicated?
    18. 18. The Provenance Data Model: Four Types of Provenance http://purl.org/net/biordfmicroarray/ns#
    19. 19. RDF genelist representation  Institutional level: metadata associated with each genelist such as the laboratory where the experiments were performed or the reference to the genelist.  Experimental context level: experimental protocols such as the region of the brain and the disease (terms were partially mapped to MGED, DO and NIF).
    20. 20. RDF genelist representation  Data analysis and significance: statistical analysis methodology for selecting the relevant genes  Dataset descriptions: version of a source dataset, who published the dataset. The vocabulary of interlinked datasets (voiD) and dublin core terms (dct) were used.
    21. 21. Provenance types are perspectives on the data
    22. 22. Provenance types are perspectives on the data
    23. 23. Provenance types are perspectives on the data
    24. 24. Provenance types are perspectives on the data
    25. 25. Query federation with diseasome Is there a gene network for AD? Source: PNAS 104:21, 8685 (2007)
    26. 26. Demo  Go to http://purl.org/net/biordfmicroarray/demo
    27. 27. Conclusions  Levels of provenance: 1) institutional; 2) experimental context; 3) Statistical analysis and significance; 4) dataset description  Provenance as RDF: SPARQL queries to express contrains both about the origins and context of the data  Data model is driven by the biological question: a bottom- up approach shields the model from rapidly evolving ontologies while enabling linking to widely used ontologies  Mapping is facilitated: Mapping to existing provenance vocabularies, like OPM, PML, Provenir is facilitated by:  biordf:has_input_value, which can be made a sub-property of the inverse of OPM property used  biordf:derives_from_region, which can become a sub-property of OPM property wasDerivedFrom.
    28. 28. Summary and Future Work  Provenance modeling in a semantic web application  Query genes gathered from specific samples, in a given condition or from given organizations  Query genes produced through particular statistical analysis process  Query for information about genes from a most recent dataset  The bottom-up approach  Separate concerns of interests  Create a minimum set of terms required for motivation queries  Future work  To integrate our model with provenance information generated in scientific workflow workbench  To integrate provenance information as part of the Excel Spreadsheet where most biologists report their results
    29. 29. Acknowledgement  W3C BioRDF group  Kei Cheung, Michael Miller, M. Scott Marshall, Eric Prud’hommeaux, Satya Sahoo, Matthias Samwald  The HCLS IG as well as Helen Parkinson, James Malone, Misha Kapushesky and Jonas Almeida.

    ×