In modern systems biology we have three main data domains.1) Experimental data from genomics types of experiments like in the example, (bottom right) microarrays. Note that this type requires intensive precalculations (quality control, filtering, clustering, annotation) but that is not enough to really understand the data. You see patterns in the data, but you do not really know what they mean. Large scale genomics data has been available over the pas 15 years or so, and although technologies used are now being replaced that doesn’t really change this field.2) Existing knowledge (see next slide), that can be used to better understand the two other types of data3) Genetics (sequence based) data that rapidly becomes more important with the decrease of sequencing cost. The addition of the leftmost corner to the triangle is relatively new, and I will only discuss it in the last few slides 2
Huge amounts of existing knowledge can be found hidden in the literature or inthe heads of people. The hard task is to collect it from there and to make itavailable for analysis. (People on the slide are Ben van Ommen - NuGOdirector, Hannelore Daniel – nutrigenomics chair from Munich and a ThaiPrincess and institute director.Note that a lot of information is also available in curated databases, but thatwas left out of the talk for brevity reasons. You could say that structuring of theother knowledge is needed to provide these databases that can then be usedfor analysis. 3
An historical example of a microarray result. Again note the intensivepreprocessing done. (clustering to the left, annotation to the right).Nevertheless the data is very hard to understand. Especially if you take intoaccount that there are about 20,000 genes on a typical array. About as muchas there are words in a dictionary.
But if you are willing to make the effort you can actually see meaningful groupsof genes within specific coexpression clusters. Like the fatty acid degradationgenes shown here. But it is hard to find (or easy to miss) all relevant pathways.
Probably not an iPAD, those microarrays were at least 10 years old. 6
The problem is not only the long list of resulting genes, but also theoversampling that occurs. In genomics experiments you typically get largenumbers of false positives at useful levels of significance. Of course falsediscovery rate corrections exist but they will usually also loose information.Pathway or function group (ontology) analysis helps since it is not likely that alarger set of genes occur as false positives within a smaller functional group.On the other hand the meaning of pathway statistics should not beoverestimated There are many aspects in real biology and in the way thegroups are build that influence the statistical outcome.For instance when you have two metabolic reactions where one is catalyzedby a single enzyme and the other by 4. Are all enzymes of the sameimportance? Or are the four together as important as the single one? Or are 3of the 4 not important in reality and the other one is? All these situations canoccur and the statistics just doesn’t know.Also suppose you 10 non-regulated genes to a pathway. That will changesignificance of your result, but it doesn’t change the biology behind it. 7
Example of a pathway that can be used for the purposes described.
A closer look at the same pathway.Note that this uses MIM notation from the MIM PathVisio plugin.In general the connections between different genes and metabolites describethe network underlying the pathway. Note that this is already quite complexsince there are different ways to show what interacts with what.Graphical methods to capture this like MIM and SBGN definitely help. Theresult can be captures in descriptive relationships in BioPax, 9
PathVisio can do a combined visualization of different omics results. Hereproteomics and transcriptomics both shown on the same gene product boxes.It can also show effects from metabolomics.
Examples of pathways like we have them on wikipathways.org 12
This talk is not really about WikiPathways. Check out the information in thepaper or the information on the wiki itself. (www.wikipathways.org) developerinformation is mainly on the www.pathvisio.org website. 13
You obtain microarray data (e.g. affymetrix)You can visualize micorarray dataEach color corresponds to a measured datapointFor example, green is up, red is down, grey is constantAnd now? How do you make sure the Affymetrix probeset IDs related to themeasurements can be mapped to the gene products in the pathway? 14
On WikiPathways (or in pathvisio) you can attach identifiers to each gene. Aclick opens up the corresponding page on (this specific case) the wormdatabase.You can download the corresponding transcript sequence in two clicksThis makes it for instance really easy to design primers 15
As soon as you have entered one (and only one) identifier to describe whatgene product or metabolite you really mean this information is linked to manyother identifiers from other databases and links to these respective pages areshown in the so called “backpage” (actually one of the pages under the tabs atthe righthand side of the pathway). 16
BridgeDB (see www.bridgedb.org and the paper mentioned on the slide)provides the mechanism needed for that identifier mapping. 17
Pathways can be downloaded to be used in different tools.There is also a wikipathway webservice. See:http://www.wikipathways.org/index.php/Help:WikiPathways_WebserviceThomas Kelder, Alexander R Pico, Kristina Hanspers, Chris Evelo & Bruce RConklin. Mining biological pathways using WikiPathways web services.PLoS One (2009) 4: 7 e644. http://dx.doi.org/10.1371/journal.pone.0006447We also have semantic output in RDF which can be queried through aSPARQL endpoint described at semantics.bigcat.unimaas.nl.
And a solution that isn’t really a solution. There are just too many things youcould add. 20
The PathVisio Regulatory Interaction plugin (author Stefan van Helden) has anew approach where information is not really added to a pathway, but shownin a separate page upon request. 21
The plugin can be found here:http://chianti.ucsd.edu/cyto_web/plugins/displayplugininfo.php?name=GPML-PluginIt can be used to read and write gpml pathway files used by WikiPathways andPathVisio in Cytoscape 22
Example showing some more advanced usage of the GPML plugin.Data from the NuGO proof of principle study with dietary challenged mice.Three tissues were sampled and in the other two tissues relatively manygenes showed expression changes on Affymetrix arrays but not manypathways were found.For liver the number of genes affected was lower but the number of pathwaysfound to be affected was found to be higher (how come)?The pathway based network analysis showed that there was a set of strongeraffected pathway (more reguated genes, large blue circles) that shareregulated genes (the red diamonds). When looking at the highlighted group ofpathways it became clear that these all belong to the same superste ofbiologically relevant pathways (fatty acid metabolism and inflammation). 23
A paper that we published with a more extensive pathway relationshipapproach. It takes into account relations between pathways through affectedgenes not necessarily showing up in either pathway. 24
The approach takes into account all data use (pathways, interactions andexperimentally determined weight). Check out the original paper for details. 26
Example result. Pathways with stronger interaction based on gene snotpresent in them. 27
And you can do the same for relatively large sets of pathways “driving” aprocess like apoptosis. 28
CyTargetLinker is a Cytoscape plugin that can be used to extend one networkwith information about things targeting entities in that network from databasesthat are created as a network. It already provides a number of target relationdatabases as mentioned on the slide. 29
Example of a target network. (You will normally see this, it contains theinformation that is used to extend your source network). 30
You can drive it from a gene set, that isn’t even a network at the start. Butwhen miRNAs are found to target more than one gene in the ggroup thenetwork is created on the fly. 32
Or you can bootstrap the approach from an existing network. Which can be apathway based one imported with the GPML plugin like shown here. 33
An overview of the Open Phacts project that pulls in lots of information in asemantic web triple store (including information from WikiPathways RDF) andthen provides that for use in other tools. In WikiPathways we use that tosuggest possible pathway extensions to curators 34
This show the PathVisio Loom plugin in action. A gene or metabolite in apathway under development (left side) is right clicked and the LOOM isactivated to pull related genes or metabolites from another resource(database, text mining result or Open Phacts API). The suggested interactionsare shown in the window on the right and the entities are added to the pathway(two already shown on the left).
Talk so far focused on the genomics-knowledge relationship shown on theright, So what about genetics? 36
This is the image was to us by Jim Kaput (at that time NTCR, nowNestle).”Look people group those SNPs in gene groups, made sense of thedirections and showed them in a pathway. Can you do something like that?” 38
There are just too many SNPs for any given gene. 40
So it would really look like a bunch of jellies if we show these all on the genesin a pathway, and you would not know what they mean. 41
There are loads of bioinformatics tools out there (like Sift and Polyphen) thatallow us to estimate functional effects of SNPs on coded protein (activity orprotein-protein interactions), binding site for transcription factors in the DNA, ormiRNA in RNA. Doing that we can decide what edges SNPs would affect (andhow much in what direction). Now as soon as you do that you can use theresult to strengthen SNP statistics (ie create groups that can be used forsupervised types of group based GWAS analysis) or to build predictive modelsto estimate that specific (personal or tissue/tumor based) sets of variationswould do. That provides a need to use the pathways to link experimental(genomics) data not only to the genetic variations occurring in there, but alsoto modeling results 42
Showing the concept. Integrating flux predictions from modelling (of coursethat could also be real fluxomics data) 43
And showing “real” results from the new flux data representation plugin.The plugin is functional but we still need better mapping databases for reactionidentifiers 44
Many people involved in this work. (Really many if you count associatedgroups like the plugin developers, pathway curators etc).Most importantSF group (Kristina Hanspers, Bruce Conklin and Alex Pico) collaborating onmany things but primarily WikiPatwhaysMartijn van Iersel top left (PathVisio, BridgeDB). Thomas Kelder (top middle)(WikiPathways including webservices, pathway integration networks fornutrigenomics), Martina Kutmon (top right) (CyTargetLinker, PathVisio furtherdevelopment), Andra Waagmeester (second row, right) (WikiPathways RDF),Anwesha Dutta (bottom, 2nd from the left) (flux visualization), Stefan vanHelden (not on the picture) for the RI PathVisio plugin 45