• Like


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

REDUCE: an online tool for inferring cis-regulatory elements ...

Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. # 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 13 3487–3490 DOI: 10.1093/nar/gkg630REDUCE: an online tool for inferring cis-regulatoryelements and transcriptional module activities frommicroarray dataCrispin Roven1 and Harmen J. Bussemaker1,2,*1 Department of Biological Sciences and 2Center for Computational Biology and Bioinformatics,Columbia University, New York, NY, USAReceived February 18, 2003; Revised and Accepted April 28, 2003ABSTRACT requires distinct microarray experiments to be combined in an ad hoc fashion. Another problem is that it groups genes intoREDUCE is a motif-based regression method for disjoint subsets, while a promoter region usually has multiplemicroarray analysis. The only required inputs are (i) transcription factor binding sites and receives input from aa single genome-wide set of absolute or relative number of different signaling cascades (11).mRNA abundances and (ii) the DNA sequence of the Our laboratory has pioneered the motif-based regressionregulatory region associated with each gene that is analysis of a single transcriptome. Implemented as theprobed. Currently supported organisms are yeast, algorithm REDUCE—an acronym that stands for regulatoryworm and fly; it is an open question whether in its element detection using correlation with expression—ourcurrent incarnation our approach can be used for method naturally takes into account the combinatorial naturemouse or human. REDUCE uses unbiased statistics of gene expression regulation and provides context-specificto identify oligonucleotide motifs whose occurrence information about transcription factor activities (12). REDUCEin the regulatory region of a gene correlates with the works by fitting a multivariate predictive model to a single genome-wide expression pattern. The expression level of alevel of mRNA expression. Regression analysis is gene is modeled as a sum of independent contributions fromused to infer the activity of the transcriptional all transcription factors for which binding sites occur in themodule associated with each motif. REDUCE is promoter region. A forward parameter selection strategy isavailable online at http://bussemaker.bio.columbia. used to select motifs from a large set of candidate motifs (e.g.edu/reduce/. This web site provides functionality for all oligonucleotides up to a given length).the upload and management of microarray data. We have used REDUCE to analyze microarray data forREDUCE analysis results can be viewed and down- mRNA expression in Saccharomyces cerevisiae (12,13) andloaded, and optionally be shared with other users or Drosophila melanogaster (14). We have also successfullymade publicly accessible. applied REDUCE to analyze genome-wide DNA–protein interaction data in Drosophila (14,15). Our motif-centered regression approach has been adopted and extended by other groups (16–18).BACKGROUNDIn recent years, DNA microarrays have become a popular USERS AND DATA MANAGEMENTexperimental tool for monitoring the mRNA transcriptabundance for all genes in a cell simultaneously (1,2). The Users must first register for an account via a fully automatedDNA sequence of the non-coding part of the genome contains procedure. The login system and tree-based data managementa variety of regulatory signals that together are responsible for structure allow for the safe and convenient storage ofthe differential regulation of gene expression. Through the expression data and REDUCE analysis results. The uploadedcombined analysis of transcriptome and genome, significant data sets are private unless the user explicitly allows them to beadvances can therefore be made in understanding the shared within a group or made publicly accessible through ourmolecular mechanisms underlying transcription control (3,4). permissions management interface.A widely used strategy first clusters genes based on their After registration and login, users are taken to theexpression profile across multiple conditions (5) and then ‘Experiment View’ where they can upload and organize datasearches for over-represented DNA motifs in the upstream (Fig. 1). The upload mechanism subjects incoming data to aregions of each gene cluster (6–10). One problem with the series of filters and illegal expression values are discarded.clustering approach to motif discovery, however, is that it Expression values specified as ratios or fold-changes are*To whom correspondence should be addressed. Tel: þ1 212 854 9932; Fax: þ1 212 865 8246; Email: hjb2004@columbia.eduNucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved
  • 2. 3488 Nucleic Acids Research, 2003, Vol. 31, No. 13 deleted (19) is it possible to infer the character of the transcription factor from the sign of Fm. Motifs can be ranked in terms of how much they improve the fit of the model to the data (as described in detail in 12). This allows for an unbiased search of relevant motifs from a large set of candidates, e.g. all oligonucleotides up to length 7. From this set of 21 844 motifs, typically 100 motifs correlate strongly enough with expression to be significant. However, these motifs fall into 10 classes of closely related, partially overlapping motifs. The forward selection procedure as described (12) selects one representative from each such class. The set M will therefore typically contain 10 motifs. Once a motif has been found to be of interest based on the analysis of a single microarray experiment, scoring it against a large set of other (published) experiments provides a useful way of ‘annotating’ it, since the conditions under which the corresponding (and often unknown) transcription factor changes activity can be identified. It the case of mutant versus wild-type data, factors that are upstream of the transcription factor in a signal transduction pathway will be identified this way. Our web site allows one to specify an IUPAC consensusFigure 1. ‘Experiment View’ of the REDUCE analysis server. Uploaded motif, select any number of experiments and compute the valueexpression data can be organized in folders and shared with other users or of Fm for each of them. We also provide functionality to plotmade publicly available. such inferred transcription factor activity profiles for a number of different motifs. The usefulness of this approach has already been demonstrated (12,13).converted to log-ratios base two. The data can be uploaded oneexperiment at a time, as a two-column data file with each linecontaining an ORF and an expression value separated by a tab. AN EXAMPLEIt is also possible to upload multiple experiments as a singlemulti-column, tab-delimited file. Figures 1–3 show the most important screens users will encounter when uploading and analyzing their own expression data. Using the ‘expression view’ shown in Figure 1, users firstMULTIVARIATE REGRESSION ANALYSIS select the directory to which they want to save their experiment. We will use the 15 min time point of the heatREDUCE is based on the following multivariate linear model shock experiment performed by Gasch et al. (20) as anfor transcription initiation control: example. Figure 2 shows how the two-column file containing X the log-ratio for each gene is uploaded to the server. Once this Ag ¼ C þ Fm Nmg step has been completed, the experiment can be submitted for m2M analysis by checking the box in front of it and pressing the ‘REDUCE’ button. The user will be prompted for a choice ofThe power of multiple-motif regression analysis is that the sequence data (in yeast, a typical choice is 600 bp upstreamgenome-wide transcriptional response measured in a micro- from the translation start site) and the maximum length of thearray experiment can be decomposed in terms of distinct candidate motifs. As soon as the analysis is complete—regulatory modules. This greatly facilitates the biological typically after 1–2 min—the user is notified that the results caninterpretation of transcriptome data. The regulatory modules be viewed. Figure 3 shows the resulting model. The motifs arethat are significantly affected in a particular experiment will colored red or green according to the sign of the regressioneach be represented by one motif in the set M. The model tries coefficient Fm. The number of genes with a match to the motifto explain the log-ratio Ag for each gene g in terms of the is listed as well, and by clicking on it, the user will see a list ofnumber of occurrences Nmg of the motif m in its promoter. all matched genes with a one-line description of their function, The regression coefficients or slopes Fm are uniquely and sorted by their log-ratio.determined by the fit to the log-ratios. They have a directinterpretation as the change in the concentration of the (activeform of the) transcription factor in the nucleus. Note that it is SYSTEM DESIGNnot possible to determine whether a transcription factor is anactivator or a repressor based on the sign of Fm. For instance, The computation server uses a modular architecture andboth increased activation and decreased repression by a employs only open source components. To plan for growingtranscription factor will manifest itself as an increase in the performance demands we have incorporated the CONDORexpression level of its target genes. Only when an experiment batch processing system so that a higher throughput can beis analyzed in which a wild-type strain is compared to a strain achieved by just adding more servers (http://www.cs.wisc.edu/in which a transcription factor is over-expressed or has been condor). Caching in the database makes CONDOR even more
  • 3. Nucleic Acids Research, 2003, Vol. 31, No. 13 3489Figure 2. The form for uploading a microarray experiment. Figure 3. Typical result of a REDUCE analysis.efficient, as we attempt to never perform the same operationtwice. Algorithms that take more than a few seconds to run are if so, we will add the corresponding upstream sequences to ourdisconnected from the application layer so that a user never server. Finally, the use of weight matrices was already shownwaits. to have the potential of greatly increased performance of Our application server is built on Apache and mod_perl REDUCE (12) and we are working on putting this approach on(http://www.apache.org) using HTML::Mason (http://www. a more solid basis. In general, the infrastructure provided bymasonhq.com) for presentation logic. We use Postgres our upload server and the expandable way in which it was(http://www.postgresql.org/) as our database for session man- designed will allow us to quickly add new functionality as newagement caching and data storage. We use Apache::Session tools are developed.from a centralized database so that we can add more appli-cation servers without having to implement complicated loadbalancing. ACKNOWLEDGEMENTS The application itself is implemented as Perl (http://www. We thank Andre Boorsma, Barrett Foat, Feng Gao and Marcelperl.com) objects. We store all the data necessary to instantiate van Batenburg for stimulating discussions and feedback, andthese objects in the database. For example, a REDUCE batch Laura Rogers for her help with the filtering and conversionjob is entered into the queue as a list of motif, sequence and modules. H.J.B. was partially supported by NIH grantexpression instance identifiers necessary to build the job. The LM007276.computation servers know how to rebuild the job and then runthe necessary functions. REFERENCES 1. Brown,P.O. and Botstein,D. (1999) Exploring the new world of theFUTURE DIRECTIONS genome with DNA microarrays. Nature Genet., 21, 33–37. 2. Lockhart,D.J., Dong,H., Byrne,M.C., Follettie,M.T., Gallo,M.V .,One of our priorities is to bring our system into compliance Chee,M.S., Mittmann,M., Wang,C., Kobayashi,M., Horton,H. andwith the MIAME and MAGE-ML standards (21–24). This will Brown,E.L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol., 14, 1675–1680.enable us to integrate with other databases and data curation 3. Banerjee,N. and Zhang,M.Q. (2002) Functional genomics as applied totools. We will also expand the list of organisms and sequence mapping transcription regulatory networks. Curr. Opin. Microbiol.,types for which REDUCE analysis can be performed and make 5, 313–317.it possible for users to upload their own sequence data in 4. Fickett,J.W. and Wasserman,W.W. (2000) Discovery and modeling ofFASTA format. Currently, users can select upstream and transcriptional regulatory regions. Curr. Opin. Biotechnol., 11, 19–24. 5. Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Clusterdownstream sequences for S.cerevisiae, Caenorhabditis analysis and display of genome-wide expression patterns. Proc. Natl Acad.elegans and D.melanogaster. The analysis of transcription Sci. USA, 95, 14863–14868.regulation in human and mouse is inherently more difficult 6. Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S., Neuwald,A.F. andsince enhancer regions are spread over distances of 10 kb or Wootton,J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208–214.more from the coding region. Ongoing research in our 7. Neuwald,A.F., Liu,J.S. and Lawrence,C.E. (1995) Gibbs motif sampling:laboratory is trying to answer the questions whether in its detection of bacterial outer membrane protein repeats. Protein Sci., 4,current incarnation REDUCE can be used for there organisms; 1618–1632.
  • 4. 3490 Nucleic Acids Research, 2003, Vol. 31, No. 13 8. van Helden,J., Andre,B. and Collado-Vides,J. (1998) Extracting regulatory 17. Wang,W., Cherry,J.M., Botstein,D. and Li,H. (2002) A systematic approach sites from the upstream region of yeast genes by computational analysis of to reconstructing transcription networks in Saccharomyces cerevisiae. oligonucleotide frequencies. J. Mol. Biol., 281, 827–842. Proc. Natl Acad. Sci. USA, 99, 16893–16898. 9. Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation 18. Chiang,D.Y., Brown,P.O. and Eisen,M.B. (2001) Visualizing associations maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. between genome sequences and gene expression data using genome-mean Syst. Mol. Biol., 2, 28–36. expression profiles. Bioinformatics, 17 (Suppl. 1), S49–S55.10. Hertz,G.Z., Hartzell,G.W.,III and Stormo,G.D. (1990) Identification of 19. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., consensus patterns in unaligned DNA sequences known to be functionally Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. (2000) related. Comput. Appl. Biosci., 6, 81–92. Functional discovery via a compendium of expression profiles. Cell, 102,11. Yuh,C.H., Bolouri,H. and Davidson,E.H. (1998) Genomic cis-regulatory 109–126. logic: experimental and computational analysis of a sea urchin gene. 20. Gasch,A.P., Spellman,P.T., Kao,C.M., Carmel-Harel,O., Eisen,M.B., Science, 279, 1896–1902. Storz,G., Botstein,D. and Brown,P.O. (2000) Genomic expression12. Bussemaker,H.J., Li,H. and Siggia,E.D. (2001) Regulatory programs in the response of yeast cells to environmental changes. Mol. element detection using correlation with expression. Nature Genet., Biol. Cell., 11, 4241–4257. 27, 167–171. 21. Brazma,A., Hingamp,P., Quackenbush,J., Sherlock,G., Spellman,P.,13. Koerkamp,M.G., Rep,M., Bussemaker,H.J., Hardy,G.P., Mul,A., Stoeckert,C., Aach,J., Ansorge,W., Ball,C.A., Causton,H.C. et al. (2001) Piekarska,K., Szigyarto,C.A., De Mattos,J.M. and Tabak,H.F. (2002) Minimum information about a microarray experiment (MIAME)-toward Dissection of transient oxidative stress response in Saccharomyces standards for microarray data. (2001) Nature Genet., 29, 365–371. cerevisiae by using DNA microarrays. Mol. Biol. Cell., 13, 2783–2794. 22. Ball,C.A., Sherlock,G., Parkinson,H., Rocca-Sera,P., Brooksbank,C.,14. Orian,A., Van Steensel,B., Delrow,J., Bussemaker,H.J., Li,L., Sawado,T., Causton,H.C., Cavalieri,D., Gaasterland,T., Hingamp,P., Holstege,F. et al. Williams,E., Loo,L.W., Cowley,S.M., Yost,C. et al. (2003) Genomic (2002) Standards for microarray data. Science, 298, 539. binding by the Drosophila Myc, Max, Mad/Mnt transcription factor 23. Spellman,P.T., Miller,M., Stewart,J., Troup,C., Sarkans,U., Chervitz,S., network. Genes Dev., 17, 1101–1114. Bernhart,D., Sherlock,G., Ball,C., Lepage,M. et al. (2002) Design15. van Steensel,B.D., and J.Bussemaker,H.J. (2003) Genome-wide analysis of and implementation of microarray gene expression markup language Drosophila GAGA factor target genes reveals context-dependent DNA (MAGE-ML). Genome Biol., 3, RESEARCH0046. binding. Proc. Natl Acad. Sci. USA, 100, 2580–2585. 24. Brazma,A., Parkinson,H., Sarkans,U., Shojatalab,M., Vilo,J.,16. Keles,S., van der Laan,M. and Eisen,M.B. (2002) Identification of Abeygunawardena,N., Holloway,E., Kapushesky,M., Kemmeren,P., regulatory elements using a feature selection method. Bioinformatics, Lara,G.G. et al. (2003) ArrayExpress-a public repository for microarray 18, 1167–1175. gene expression data at the EBI. Nucleic Acids Res., 31, 68–71.