With increasing data volumes -- and issues specific to biological data findings -- the pharma industry needs to find new data mining and analysis techniques for the drug discovery process, including integrative analysis.
Better Target Identification and Validation Through an Integrative Analysis of Biological Data
• Cognizant 20-20 InsightsBetter Target Identification andValidation Through an IntegrativeAnalysis of Biological Data Executive Summary poses a peculiar challenge. In these fields, rela- tionships among data sets are rarely simple and Pharmaceutical target identification and valida- often are not apparent without deeper investi- tion today is an exercise in complex data mining. gation. So geologists study aerial photography, The amount, breadth and depth of biological data satellite images, rock analysis and seismographic available for such mining is increasing exponen- data to attempt to locate oil basins; meteorolo- tially, signaling both opportunities and challenges gists examine ocean currents and surface tem- for the biopharma industry. peratures, barometric pressure, polar ice cover More data should lead to more insights and better and more to predict climate conditions. decisions. However, the sheer volume of available The drug discovery process similarly requires data is overwhelming. Further, biological data assimilation and analysis of seemingly discon- findings must be considered in the context in which nected data sources with the goal of gaining they were discovered and in light of their inter- insights for forming a hypothesis to validate actions and/or dependencies on other data sets through experimentation. and conditions. Integrating a wide variety of data sets with such understanding of their contexts is In the initial steps of drug discovery, comprised a major logistical hurdle for biologists. of target identification and validation, knowledge of disease biology is crucial for picking the right To fully capitalize on the rich biological data targets. Arguably the biggest breakthrough sources available today, scientists require a tech- to that end was the completion of the human nological platform that eases data integration and genome sequence in 2000. comparison across diverse types and sets of data regardless of their sources. The complexity of the However, the genome sequence is static; it does technology supporting this platform should be not reveal the dynamic role of the targets in a hidden while it offers an easy, streamlined means variety of cellular circumstances. Today, this of interpreting results. data is available through genomics technologies like microarrays, which can be used to measure Dynamic Context and Meaningful thousands of mRNAs or DNA or proteins at Biological Insights the same time. Now scientists have data at a In predictive fields such as oil exploration, compu- molecular level along multiple cellular/molecular tational finance, or climatology, data abundance dimensions that can help them understand the cognizant 20-20 insights | august 2011
dynamic roles of targets in normal and diseased millions of data points with varied meanings. biological processes. Each of these data types is distinct in terms of its format, analysis and interpretation. For The data volumes in public “omics”* data reposi- example, the normalization of CGH data needs tories, like the Gene Expression Omnibus (GEO), to be handled very differently from that of have grown exponentially in the decade since gene expression data. the human genome was sequenced. A snapshot of the type of data sets in GEO reveals that • Context as critical as data. The context of omics data sets can be of varying types and that the experimental data is extremely important microarray technology can be used to generate in the biological world. To generate quality functionally diverse data sets. inferences and hypotheses, scientists need a thorough understanding of how the experi- Omics technological advances ments were performed. This context must be Locked within like gene expression microar- preserved while integrating data for analysis.these huge volumes rays and aCGH1 offering deeper • Multiple public and proprietary data sources. of data about a revelations about siRNAs2 and A significant number of public repositories hold miRNAs3 have enabled us to cell’s DNA, RNA produce diverse and mammoth valuable information about the experimental findings of various facets of cell biology. In and proteins and volumes of data. In fact millions addition, every biopharma organization has the relationships of data points can be produced its own internal data sources and maintains for a whole genome single run among them are and aggregated to even larger specific research findings. The types of data and the sheer divergence of available datathe information and quantities in a very short time. sources and the variations in data formats insights required Concomitant with the explosion among these entities make it a formidable to successfully of data is the need for spe- task for scientists to maneuver through the data maze. Each of the data sources captured identify and cialized data skills. Target contains different levels of detail, so compari- identification has become validate high-value a data mining exercise. The sons can be challenging.therapeutic targets. current challenge is how one • Lack of universally accepted standards. can understand the dynamic The lack of common standards for data sets context of the data, and analyze and interpret it results in the uncontrolled proliferation of data to gain meaningful biological insights. It’s chal- as each research group describes the context lenging for the lead biologist working on a target and the results of experiments in its own way. to interpret and integrate the vast amount of Though standards such as Minimum Informa- numeric data available in his decision-making tion About Microarray Experiments (MIAME) process. This leads to underuse or improper use have been developed, multiple variants of of high throughput data sets. MIAME have evolved to describe the complex biological microarray experiments. These dis- The Challenges of Mining Today’s parities mean scientists must manually look Trove of Biological Data into multiple databases and correlate data Locked within these huge volumes of data about among them to validate the observations — a a cell’s DNA, RNA and proteins and the rela- time-consuming, tiresome task. tionships among them are the information and insights required to successfully identify and • Need for complex visualizations. Visualiza- tion plays a key role in finding the patterns in validate high-value therapeutic targets. Yet the the underlying data. The visualization tools very nature of the data and how it is uncovered for omics data have evolved from simple create further challenges to using it effectively. heatmaps to networks overlaying omics data These challenges include: and are continuing to evolve into three and • Multiple data types from multiple high- four dimensions incorporating time series throughput technologies. High-through- information. Researchers use various tools put technologies have evolved to measure like Spotfire, R packages, Matlab, Excel and different molecular entities in a cell, such as Cytoscape for visualizations. Typically, no DNA, RNA, protein, methylation status, etc. single tool provides all the types of visualiza- Each high-throughput experiment results in tions required by an omics researcher. Often, cognizant 20-20 insights 2
researchers must develop custom visual- hypothesis generation even as it makes those izations to address their specific research hypotheses more relevant. questions. We believe that an effective data integration• Fast-evolving domain technologies. The solution enables productivity gains of at least microarray technology that revolutionized 20%. Some of the key solution components that biological data generation is now common, will drive such productivity gains are: and several other high-throughput technolo- gies, like next-generation sequencing, are • Faster insight generation. evolving that further push the boundaries of The platform should inte- We believe that data generation. As they do so, the challenge grate various public data an effective data of effectively mining this data to establish sources and have the ability useful biological insights becomes increasingly to combine those with com- integration solution critical to address. mercial and private data. It enables productivity should help propagate bio- gains of at least 20%.• Lack of a common platform. In many modern pharmaceutical research organizations, logical insights using any integrated access to all the data needed to of the common identifiers and be easy to use, advance a discovery project is often impossible enabling scientists to more easily connect the to achieve because it is spread across a variety data to reach insights. of databases and visualization and analysis • More revelation of disease mechanisms. tools. Thus, the scientists maintain their own Integrating different types of data should personal copies of the data, typically in the help researchers investigate and understand form of Excel spreadsheets, an inefficient disease mechanisms more thoroughly, solution. which assists in building the target/disease association.• Annotated integration. Integration in the usual sense is assimilating the parts into the • Maximize utilization of Oncology has long whole. However, this traditional notion of inte- high throughput data gration is not possible with biological data. It is sets. By providing easy been a challenging much harder to numerically integrate experi- access to diverse high- domain because of its mental values in such a way that the resulting throughput data sets, complex biology. Our number represents a biologically meaningful including those avail- phenomenon. able within a company’s goal with Inventus is A more pragmatic approach is to integrate research groups and to help analyze the the results from each experiment after systems, an integration data and answer key platform would maximize analyzing them separately. This approach the return on invest- questions around can be extended to any molecular data type. For example, tissue microarray data can be ment on generating such annotation for analyzed independently by the pathologist, data sets. additional identifiers, and the resulting inference of over-expression • Platform-independent gene ontology, or under-expression of the protein of interest and extensible. The solu- can be easily compared or integrated with the mutation, expression tion should be domain over-expression or under-expression inference platform-independent and and pathway within obtained from an RNA microarray experiment. be extended to open plat- the solution through forms like caBIG or anyMaking Information and Insights data propagation. proprietary platform.Obvious: An Integrated SolutionResearchers and scientists clearly need an intel- The Cognizant Approach: Inventusligent, intuitive solution to data collation and cor- We are creating a prototype for a biological datarelation so they can spend the majority of their integration platform using oncology as a thera-time interpreting complex biological relationships peutic area. Oncology has long been a challeng-and their impact on target identification and ing domain because of its complex biology. Ourvalidation. A single technology platform enabling goal with Inventus is to help analyze the dataa workflow that lets scientists look at different and answer key questions around annotation forbiological data types in context and to quickly additional identifiers, gene ontology, mutation,analyze their relationships would streamline expression and pathway within the solution cognizant 20-20 insights 3
Inventus Analysis Workflow Validate by Wetlab Experiments Design & Conduct Study Identify Gene Pathway of Interest Commons NCBI Study Mutation & Expression COSMIC BioGPS Figure 1 through data propagation. The typical analysis To demonstrate the utility of Inventus, we queried workflow is depicted in Figure 1. for gene TP53, a well-studied tumor suppressor gene. The default EntrezGene plug-in displays As the next step, we investigated existing open the basic functional information about TP53. The integration platforms like Life Science Grid (LSG) mutation plug-in displays mutation data for TP53 and cancer Biomedical Informatics Grid (caBIG). from COSMIC summarizing the mutations studied/ We built Inventus using LSG because it provided identified in various cancer types (see Figure 2). the necessary minimal information technology The scientist is quickly able to understand that framework. We developed the following plug-ins TP53 is frequently mutated in a wide variety of for Inventus: cancers. The BioGPS plug-in displays the gene expression data from the Novartis gene expression • Web services-based pathway data from atlas and shows TP53 to be under-expressed (see Pathway Commons. Figure 3). The above data quickly highlights to the • Access to mutation data fromept: Investigation of Gene TP53 COSMIC. scientist that TP53 could be involved in cancer. • Web services-based gene annotation informa- tion from NCBI EntrezGene. The pathway plug-in from Pathway Commonstation displays all the pathways that TP53 participatesws the general known • Portal-based data access to BioGPS.org for (see Figure 4). One can observe that TP53 isP53 like aliases, functional gene expression. involved in cell cycle pathways. Launching thesomal location etc • Cytoscape for visualizing pathways. pathway in Cytoscape allows the scientist to further understand the interacting partnersderstands that TP53 is a Mutation Gene Expression Proof of Concept: Investigation of Gene TP53 Step 3: Gene Expression • Actionws the mutations studied in • Scientist views the gene expression dataata sources like COSMIC. from public data sources like Novartis Gene Atlas.tices that TP53 is frequently • Inferencecancer types. • Scientist notices that TP53 could be lowly expressed in cancer tissues. • Step 4: Pathways Figure 2 • Action Figure 3s Confidential • Scientist views the pathways that TP53 is involved from public pathway data sources like Pathway Commons. cognizant 20-20 insights 4 • Inference • Scientist notices that TP53 is involved in various cell cycle pathways.