BioScope  Advanced Search Grammar Tool for identification of Functional                        Noncoding Elements         ...
accurately identify, annotate and functionally characterizegenes. Thus, mining of genomics and proteomics data usingcomput...
versus a false positive. Besides, the variations in genomicsequence across species further increases the noise. Althoughco...
3.1 Phase I Specific Aims1.To develop a web-based module that allows the researcher tosearch for cisregulatory elements. T...
4. Provide a work_ow like tool which takes the query run on anorganism and apply it another organism with a single key5. S...
his broad career he has helped bring several products to market.His most recent work is in Life Sciences Regulatory Compli...
[7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag,Jun S. Liu and X.Shirley Liu A suite of web-based programs...
Figure 1: Input web form to search the genomic sequence using                   user defined constraints
Figure 2: Results summaryFigure 3: Detailed results display for
Figure 4: Flow chart describing the flow of the algorithm
Figure 5: Diagram describing the Phase I flow
Appendix The ultimate goal is to build a self-contained BioRegulatoryappliance that supports automatic updates of the geno...
•Creating a sound computing infrastructure. The infrastructurerequires writing(?) a separate server to perform thesearch/c...
Figure 6: SuperCluster - Web form for user input
Upcoming SlideShare
Loading in …5
×

Bio Scope

464 views

Published on

Advanced search grammar tool to located cluster motif sequences within a genome along with display of annotations.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
464
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bio Scope

  1. 1. BioScope Advanced Search Grammar Tool for identification of Functional Noncoding Elements Principal Investigator - Hariharane Ramasamy Sanjeev Mishra Tulasi RavuriSummaryThe completion of several genomic sequences has provided themotivation for development of a tool that can aid in locatingand analyzing transcription factor binding sites (TFBS)responsible for regulating the gene transcritption. TFBS areshort sequences 4-20 in length, and often located near the genesthey regulate. These sequences occur in groups or modules alsocalled enhancer or cisRegulatory modules (CRM). CRM contain oneor more TFBS and interact with a specific combination oftranscription factors to regulate gene expression. Suchsequences are often abundant near the genes they regulate. Thegoal of developmental biologists is to understand how these CRMare organized in a genome, and regulate the gene. Laboratorymethods, that are performed to locate CRM, are often laboriousand time consuming. Hence computational methods have become aninvaluable tool. The success of computational methods depends onhow well they can be utilized in a lab environment. Severalcomputational tools exist to locate motifs in a genomicsequence. These tools fall under two categories. The firstcategory tools employ statistical and probabilistic methodsusing known motifs and the frequencies of codons in a genomicssequence. Although some motifs have been discovered using thesetools, often they yield more false positives. Tools in thesecond category employ fundamental principles of thecombinatorial logic underlying the occurrence of the enhancers /cisRegulatory modules (CRM). It is believed that genes withsimilar temporal and spatial expression patterns are controlledby similar CRM. The experimental biologists who areknowledgeable about CRM occurrences need an efficient tool tolocate them by applying the combinatorial knowledge such ascounts of the binding site occurrences within a specified width,logical combination of one of more binding sites, orientationand more. The tools should be efficient, scalable, and fast. Theaim of this proposal is to build such tools.1 IntroductionSeveral genomes including the human and the mouse genomes havebeen sequenced close to completion. In this post-genomic era, itis imperative that researchers are equipped with novelmethodologies that will facilitate them to rapidly and
  2. 2. accurately identify, annotate and functionally characterizegenes. Thus, mining of genomics and proteomics data usingcomputational approaches seems to be the superior way to extractinformation from these resources in a short time frame. Thetranscriptional regulation of a gene depends on the concertedaction of multiple transcription factors that bind to cis-regulatory modules located in the vicinity of the gene. Cis-regulatory modules are regulatory elements that occur close toeach other and control the spatial and temporal expression ofgenes. The regulatory language that the genome uses to dictatetranscriptional dynamics can be revealed by identifying thesecis-regulatory elements. Often these elements are transferredevolutionarily across organisms with little mutations butwithout losing their functional value. Knowledge of these motifsmay help drive discovery of similar genes in other closelyrelated organisms. The availability of accurate models alongwith useful search methods with enhanced sensitivity andspecificity will be the first step in being able to detectputative regulatory elements in a genome-wide manner.2 BackgroundThe identification of regulatory sequences and their location ina genome is an important step in understanding the geneexpression. Genes that have similar expression are believed tohave similar regulatory logic. Such genes are governed by uniquecombinatorial transcriptional codes known as cis-actingregulatory modules (CRMs) or enhancers. CRMs are oligonucleotidesequences that act together to activate or suppress the gene. Inthe past, several studies have been performed in understandingthe behavior of enhancers and their role in developmentalbiology. The experiments, performed to study the expression ofthe gene in a developmental stage, are often time consuming,and laborious. Computational tools are often sought bybiologists to scan the whole genome for better candidateselection of these regulatory regions.Several computational methods exist to predict the regulatorymotif sequences. The motifs are overly represented near the genethey transcribe. Using the earlier knowledge and position basedprobabilities, several tools were built to predict newregulatory motifs. CisAnalyst, developed by Berman et. al., hasbeen successfully applied for fruitfly to find new clustersusing a purely computational approach. Bioprospector uses Gibbsampler to predict regulatory sequences. The main problem withthese tools are the presence of background noise and theinability to differentiate between a true regulatory motif
  3. 3. versus a false positive. Besides, the variations in genomicsequence across species further increases the noise. Althoughcomputational methods have served well for purposes of findinggenes and even individual exons in genomic data, regulatoryelement predictions have proven difficult.Markstein [1] developed a tool for biologists to search usingthe previous knowledge of enhancers. The tool allows thebiologists to input desired regular expressions using {A,T,G,C},gene name, width, and proximity constraints. However, the toolis genome-specific and does not contain some importantconstraints like distance to the next binding site, orientationand order of the motifs, low affinity sequences, variable lengthregular expression, and user-defined overlap constraints.A brief survey for computational identification of regulatoryDNA is described in Dmitri Papatsenko and Michael Levine. Thepaper elucidates the need for computational tools providing acomparison of available tools without going into the specificdetails of the algorithms. The article however emphasizes theneed for a fast and efficient computational tools.3 Project ProposalThe project aims to provide the following :1.restrictive search capabilities like distance to the nextmotif, orientation of the motif, low affinity motif, order ofmotif occurrence [5],2.limited integrated information like nearby genes/exons, geneexpression data, annotation details around the target once it islocated [5],3.interactive chain search where a search for a target on anorganism can be linked to intra species or cross species search.4.Scalable, and efficientMore importantly, our proposed module will be highly flexible,allowing constant integration of newer genomes and at the sametime being a powerful tool that will allow the researcher tosearch for complex gene clusters.To that end we developed a software program that will moreprecisely locate the regulatory region with far more ease forthe researcher than programs that are currently available. Thecontrol, more importantly, of the result of the program will begiven to developmental biologist. The tool is very ideal for alab environment.
  4. 4. 3.1 Phase I Specific Aims1.To develop a web-based module that allows the researcher tosearch for cisregulatory elements. The tool will input motifand search constraints as mentioned in figure 1 and will displayresults as shown in figure 2 and 3. The search feature of theprogram will provide ◦ability to enter 10 regular expressions using A,T,G,C and letters given in the table below. ◦an option to allow self overlap ◦capacity to input a name for the motif ◦a box to specify width constraint ◦flexibility to input logical combination of motifs typed in (1) such as (2A and 2B), (A or B or C) ◦ability to disallow overlap across motifs type in first item. ◦To type name of the gene within a specified distance once a cluster is found using the above rules ◦a name to save the results. The name will/can be used in SuperCluster Letter Codon B C,G,T D A,G,T H A,C,T K G,T M A,C N A,C,G,T R A,G S C,G V A,C,G W A,T Y C,4 Summary: Significance of proposed workThe tool will also provide integration and maintenance thatinclude1. Update to new versions of genomics sequences when they areavailable from the public site.2. Rerun the program on old results and inform automatically viaemail on new results.3. Integrate with Gene Ontology information and other usefuldatabases as advised by biologists.
  5. 5. 4. Provide a work_ow like tool which takes the query run on anorganism and apply it another organism with a single key5. Storage and maintenance of results.5 Commercialization StrategyAfter Phase I launch, every person who visits the site will berequested to fill their profile before access to use theirprogram along with the purpose of the visit. The visitor willalso be requested to give feedback which will be collected andused as leads to prepare the BioRegulatory Appliance in Phase II.6 KEY PERSONNEL1)Hariharane Ramasamy is pursing his PhD Computer Science, atIllinois Institute of Technology, IL., and has more than 15years of experience in developing applied computational toolsfor biomedical engineering. Few relevant tools include•implemented motif search system for genomic sequences thatdisplays the results graphically on the screen along with thesequence annotation.•developed surveillance system to detect novel sequences.•Developed a program that calculates the digest of peptides foruser input proteins and also performs differential combinationof post-translational modification along with pI/Mw calculations.•Pattern induced Multiple alignment using properties of aminoacids.•New Extended Genetic Algorithm for 3D lattice simulation ofprotein folding using conflicting criteria,•Simulation of human stand-sit movement using 3 link stick figuremodel.Sanjeev MishraSanjeev Mishra is a seasoned professional having about 20 yearsof industry experience. Half of his industry life is spent doingstartups in the field of business activity management, businessintelligence and mobile application and management platforms.Rest half in research and development. He is awarded with one USpatent. Sanjeev is passionate about biking, hiking, running,meditation and gardening. Sanjeev holds a masters degree inPhysics from DBS College Dehradun, India.Tulasi RavuriTulasi Ravuri is an experienced software engineering managerwith 23 years of experience at several Silicon Valley companiessuch as Unisys, Novell, McAfee, DoCoMo Labs and others. Through
  6. 6. his broad career he has helped bring several products to market.His most recent work is in Life Sciences Regulatory Complianceand Administration software suite used by Universities likeStanford, Berkeley, Harvard; Pharma companies such as GSK,Hospitals such as Palo Alto Medical Foundation and Government.He advises several software companies and is an advocate of opensource software. He has an MSCS from University of Louisiana &BS (Chemical Engg.) from Andhra University, India.7 ConsultantsIn phase I, the following help will be used to guide the programto Phase II1. two student interns for refining the search and gatheringdata on the abilities of the program2. Consultant for designing user interface and graphics display8 Prior SupportThe proposal has no prior or current support.References cited[1] Marc S. Ha_on, Yonaton Grad, George M. Church, Alan M.Michelson, computation-Based Discovery of RelatedTranscriptional Regulatory Modules and Motifs Using anExperimentally Validated Combinatorial Model Howard HughesMedical Institute and Department ofMedicine, Brigham and Womens Hospital, Link®oping University,Sweden.[2] Dimitri Papatsenko, Michael Levine, ComputationalIdentification of regulatory DNAs underlying animal developmentNature Methods, Vol. 2 No. 7:529-534, 2005.[3] Markstein, M., Markstein, P., Markstein, V. Levine, M.S.,ìGenome-wide analysis of clustered Dorsal binding sitesidentifies putative target genes in the Drosophila embryo,Proc.Natl Acad. Sci. USA, Vol. 99:763-768, 2002.[4] Benjamin P. Berman, Barret D. Pfeiffer, Todd R. Laverty,Steven L.Salzberg, Gerald M.Rubin, Michael B. Eisen and Susan E.Celniker, Computational identification of developmentalenhancers : conservation and function of transcription factorbinding-site clusters in Drosophila melanogaster and Drosophilapseudoobscura. Genome Biology, Vol. 5:R81, 2004.[5] Alan M. Michelson,Deciphering genetic regulatory codes : Achallenge for functional genomics. PNAS, Vol. 99 No. 2, 546-548,2002.[6] Matthias Harbers, Piero Carninci, Tag-based approaches fortranscriptiome research and genome annotation. Nature Methods,Vol. 2, No 7, 499-502, 2005.
  7. 7. [7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag,Jun S. Liu and X.Shirley Liu A suite of web-based programs tosearch for transcriptional regulatory motifs. Nucleic AcidsResearch, Vol. 32 Web Server Issue, 2004.[8] Mike P. Liang, Olga G. Troyanskaya, Alain Laederach, DouglasL Brutlag, and Russ B. Altman Computational Functional Genomics.IEEE Signal Processing Magazine, 2004.Budget Description Expense Amount for 6 months Salary for Principal $36,000 Investigator Salary for Software engineer $30,000 Salary for 2 student interns $24,000 Salary for Biology $24,000 consultant Hardware and Software cost $24,000 (4) Internet & Cloud hosting $12,000 services Miscellaneous expenses $6,000 Office rent & expenses $15,000 Travel $5,000 Total Cost $176,000
  8. 8. Figure 1: Input web form to search the genomic sequence using user defined constraints
  9. 9. Figure 2: Results summaryFigure 3: Detailed results display for
  10. 10. Figure 4: Flow chart describing the flow of the algorithm
  11. 11. Figure 5: Diagram describing the Phase I flow
  12. 12. Appendix The ultimate goal is to build a self-contained BioRegulatoryappliance that supports automatic updates of the genomicsequences, rerun the old queries on the new sequences and informusers of new results, thereby saving enormous amount of time forthe developmental biologist who depend on computers to locatethe target.Phase II PlanSpecific Aims - To enhance the available module, Biocis so thatthe module is user friendly and easy to navigate by aresearcher. Phase II will also aim to create a work_ow modulethat will allow easy storage and retrieval of data fromdisparate sources and will integrate with useful information.The phase II feature will include1.Advanced Regular Expression Search Tool for genomic sequencesthat uses the prebuilt index positions for 4 length bases (AAAA,AAAG, ,,,, GCGC, ...,TTTT) to locate the motifs.2.Advance multithreaded server tool to perform fast parallelsearch of the motif sequences.3.Advanced caching in memory/disk and database to avoid repeatedsearch of previous sequences4.Automated daemon process to get new releases and rerun thesaved searches, inform via email to scientists on new results.5.Link to GeneOntology database that provides gene functioninformation6.Cross species ortholog results from existing public annotateddatabase.7.simple statitical tools to look at the motif occurrences onthe whole genome from the interesting results8.creation of BioRegulatroy software package and plan fordesigning a spec for BioRegulatory Appliance.9.to provide supercluster tool which will perform a similarsearch as in Aim I.10.The input in A -J are the names of the search performed in AimI. The tool will help supporting the theory where cluster ofenhancers act to in regulating the gene. A sample input form isshown in 63.1.2 Phase IIIThe phase III
  13. 13. •Creating a sound computing infrastructure. The infrastructurerequires writing(?) a separate server to perform thesearch/caching capabilities. The search module will not be runvia a web server like some of the existing tools. Every requestto perform a search on the web server indicates the whole genomesequence will be read in memory. The length of genomic sequencevaries from 1 Megabytes to 200 Megabytes in length. If thenumber of users on the system grows, the system will run out ofmemory, thus imposing a limit on the number of users. Using aweb server to preload the data during startup is not advisable.Hence a separate server, to perform the search for any genericgenome sequence is needed. The caching in phase I is achieved intwo levels - memory, and disk.• will concentrate on adding more features to the query, creatinga continuity in search.For example, once one performs a search, the result will displaygenes along with the other species orthologs. The search can beimmediately performed for the same enhancer for the species thathas the closest orthologs. Phase III will also look at improvingthe performance of the BioRegulatory appliance.
  14. 14. Figure 6: SuperCluster - Web form for user input

×