Plant Physiol.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow A correction has been published
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via ISI Web of Science (15)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.
Agricola
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.

Plant Physiol, June 2002, Vol. 129, pp. 451-454

SCIENTIFIC CORRESPONDENCE

Using Cauliflower to Find Conserved Non-Coding Regions in Arabidopsis1


Juliette Colinas, Kenneth Birnbaum, and Philip N. Benfey*

Department of Biology, 1009 Main Building, New York University, 100 Washington Square East, New York, New York 10003


    ARTICLE
TOP
ARTICLE
LITERATURE CITED

A bioinformatics approach is used to analyze the degree of conservation between upstream non-coding regions of cauliflower (Brassica oleracea) and Arabidopsis. The level of homology suggests that comparison of these two species could reveal functional cis-regulatory elements.

There is growing interest in comparing genome sequences to identify regulatory regions (Stojanovic et al., 1999). This arises in part from the failure of de novo computational methods to consistently recognize functional promoter elements from single genomes (Loots et al., 2000; Pennacchio and Rubin, 2001). Because genomic regions that have a biological function are often conserved through evolution, non-coding regions conserved between species are more likely to contain regulatory sequences (Stojanovic et al., 1999). Numerous computer programs have been written to extract conserved regions or motifs from orthologous sequences (reviewed elsewhere; Fickett and Wasserman, 2000; Stormo, 2000; Ohler and Niemann, 2001). In addition, several studies have shown that the conserved non-coding sequences (CNS) found using such comparisons often have biological meaning (Hardison, 2000; Kent et al., 2000; Loots et al., 2000) and are enriched in transcription factor binding sites (Levy et al., 2001).

The genomes to be used in these comparisons must be carefully selected if useful results are to be obtained; comparison of too closely related genomes identifies nonfunctional conservation, whereas too distantly related genomes lack sufficient conservation for a meaningful comparison. Evidence from studies in animals and bacteria suggest that more closely related species are more likely to be useful for identification of regulatory regions because they appear to change more rapidly than coding regions (Huynen and Bork, 1998; Cargill et al., 1999).

Among plants, extensive genomic sequence is at present only available for Arabidopsis. As a consequence, the choice of additional plant species to sequence is important to provide maximal information from sequence comparisons. This choice could be made if sequence data were available from a number of related plant species, but presently limited sequence data is only available for cauliflower. The genus Brassica includes many species and cultivars (O'Neill and Bancroft, 2000) for which there are economical incentives for genome sequencing. This genus is closely related phylogenetically to Arabidopsis, their divergence time being estimated at 14.5 to 20.4 million years based on mitochondrial DNA data (Quiros et al., 2001). However, it is still unclear how much conservation can be found in the non-coding genomic regions of these two genera. In an analysis of the promoter of APETALA3 orthologs in Arabidopsis and cauliflower, Hill et al. (1998) found 62% identity in the 440 bases upstream of the transcription start site. However, another study comparing a genomic region between cauliflower and Arabidopsis found less identity in several promoter comparisons, except for one region of 59 bp with 78% identity in one promoter and a 340-bp region with 54% identity in another promoter (Quiros et al., 2001). Thus, it is important to expand upon such analyses to establish whether a comparison of Arabidopsis with cauliflower is likely to provide useful regulatory site information. The study described here is a first step toward answering that question. Using more extensive data now available for cauliflower from a shotgun-sequencing project along with the completed Arabidopsis sequence, we conducted a preliminary comparison of cauliflower and Arabidopsis putative regulatory regions.

Cauliflower shotgun sequences (8,864 total) of about 400 to 700 bp in length (covering about one-hundredth of the estimated 600-Mb genome; O'Neill and Bancroft, 2000) were obtained from Washington University and Cold Spring Harbor Laboratory (ftp://cshl.org/pub/sequences/brassica_shotgun/, submitted on 2001/05/04) and were subjected to a BLAST analysis (http://www.ncbi.nlm.nih.gov/BLAST/; Altschul et al., 1997) against the entire National Center for Biotechnology Information nucleotide database, including expressed sequence tags. To identify the best candidate sequences for comparative analysis, a program was written to select the cauliflower sequences that were homologous to the 5' end of an Arabidopsis gene and also contained part of the 5' non-coding region of that gene. This was done by screening the BLAST output to select for cauliflower sequences that hit at least one non-plastid and non-ribosomal complete Arabidopsis cDNA with an alignment of at least 50 bp and an overhang at the 5' end of the cDNA of at least 100 bp. To ensure that true orthologs were compared, the alignments of the 60 cauliflower and Arabidopsis sequences retrieved from this first selection were then manually inspected. Only the cauliflower shotgun sequences that aligned with a BLAST score above 80 to a single Arabidopsis genomic fragment which also aligned almost perfectly with the originally identified Arabidopsis cDNA were kept for further analysis. Twenty-six sequences were rejected based on these criteria. In addition, 13 sequences were discarded due to inconsistent annotation in Arabidopsis (that is, the cDNA annotation contradicted the genomic annotation). Finally, eight sequences aligned to Arabidopsis ribosomal or plastid cDNA that were not annotated as such. Thirteen of the initial 60 cauliflower sequences were, thus, kept and analyzed further. Because we selected only promoters with the best evidence for orthology between cauliflower and Arabidopsis, it is probable that more cauliflower sequences could have been analyzed, but we decided to use conservative criteria.

The 13 cauliflower shotgun sequences retained from the manual selection were aligned with Arabidopsis using VISTA (http://www-gsd.lbl.gov/vista/; Mayor et al., 2000; Dubchak et al., 2000), which can find windows of high identity in an alignment that shows generally poor conservation. Because some of the 5' non-coding regions compared were around 100 bp and we wanted to identify small regions of conservation, a window size of 25 bp was chosen. Eight negative controls using random pairs of cauliflower and Arabidopsis non-coding sequences revealed that windows of 25 bp with at least 75% identity were unlikely to occur by chance alone (none was found in the eight random comparisons). A CNS was, thus, defined here as having 75% or more identity over a window of at least 25 bp.

The alignments obtained for the 13 sequences are summarized in Table I and two representative alignments are illustrated in Figure 1. The results show that 10 of the 13 genes contain at least one conserved region between cauliflower and Arabidopsis in their 5'non-coding sequence. The size of these regions varies between 25 and 118 bp and averages 48 bp, and 37% of the non-coding bases belong to a conserved region. Most genes (8/10) contain one to two CNSs, which are always separated either from the site of translation initiation (as in the RSH3 gene of Fig. 1) or transcription initiation (as in the unknown gene of Fig. 1) by a region of low conservation of at least 30 bp. As seen in Table I (column "distance from translation start site"), CNSs can be found at distances from the translation start ranging from 46 to 434 bp, whereas the sizes of the non-coding regions available for comparison range from 130 to 540 bp. Thus, CNSs can be found throughout the non-coding sequences. Although it is possible that some of these CNSs represent cryptic exons, this is unlikely to be the case for all the genes compared. We also note that for one of the three genes for which no CNS is found (unknown protein 3), the coding sequence conservation is poor (150 bp of 500), indicating that the two sequences might not be true orthologs because all the other genes show almost complete conservation in the coding region available for comparison. Finally, no CNSs were found in the introns that were available for comparison (three sequences).


                              
View this table:
[in this window]
[in a new window]
 
Table I.   Summary of the VISTA alignment results between 13 cauliflower shotgun sequences and their Arabidopsis homologs

Gene abbreviations: RSH3, RelA/SpoT homolog 3; PLC, phospholipase C; SIMIP, salt-stress-inducible major intrinsic protein; Syp42, syntaxin of plants 42; and DRT112, recombination and DNA damage resistance 112. The "unknown proteins" are unidentified sequenced cDNA clones.



View larger version (41K):
[in this window]
[in a new window]
 
Figure 1.   Examples of VISTA alignments of cauliflower shotgun sequences with their Arabidopsis homologs. The alignments for the RSH3 gene (top) and unknown protein 1 (bottom) are shown. The horizontal and vertical axes represent the position in the sequences (in basepairs), and the percent identity of the two sequences in a 25-bp window around that position, respectively. Regions in which the identity is greater than or equal to 75% are colored in pink (for non-coding regions), turquoise (5'-untranslated region [UTR]), or blue (coding region). The level of conservation observed in the coding region and the short, relatively well-defined region of conservation in the non-coding region is representative of most of the others genes examined.

Because most sequence comparison analyses have been carried out between much more distantly related animal or bacterial species, e.g. mouse and human, which are separated by about 80 million years (Hardison et al., 1997), one question was whether there would be too much conservation between Arabidopsis and cauliflower for most of these CNSs to be functionally meaningful. However, the degree of conservation of non-coding sequences does not seem to be greater than between mice and human. Levy et al. (2001) found that 20% of the bases in the upstream 500 bp of 502 disease genes from human and mouse are aligned by BLAST (parameters: match = +1 and mismatch = -1). Performing a similar analysis with our sequences, we also find an average of 20% conservation (data not shown). This number might be an overestimate because we are comparing shorter sequences and most of the conservation might be expected to lie proximal to the 5' end of the genes, but it shows that the level of conservation between Arabidopsis and cauliflower does not seem to be dramatically higher than between mouse and human. Nevertheless, the functional significance of these CNSs remains to be experimentally tested.

Overall, even though the comparison set is small, this study indicates that there is likely to be significant conservation of promoter regions between Arabidopsis and cauliflower. This suggests that sequence comparisons across these two species may prove useful for the identification of regulatory regions. Coupled with experimental studies, conducting similar pilot studies with other plant species would allow the identification of the most informative plant species for sequence comparison with Arabidopsis.


    ACKNOWLEDGMENTS

We thank Richard McCombie from Cold Spring Harbor Laboratory for providing the cauliflower shotgun sequences, Dennis Shasha for discussion, and Mike Chou and Borislav Iordanov for help with the programming.

    FOOTNOTES

Received January 9, 2002; accepted March 17, 2002.

1 This work was supported by the National Institutes of Health (grant no. GMR01-43788 to P.N.B.).

* Corresponding author; e-mail philip.benfey{at}nyu.edu; fax 212-995-4204.

www.plantphysiol.org/cgi/doi/10.1104/pp.002501.


    LITERATURE CITED
TOP
ARTICLE
LITERATURE CITED

© 2002 American Society of Plant Physiologists



This article has been cited by other articles:


Home page
Plant Physiol.Home page
A. J. Windsor, M. E. Schranz, N. Formanova, S. Gebauer-Jung, J. G. Bishop, D. Schnabelrauch, J. Kroymann, and T. Mitchell-Olds
Partial Shotgun Sequencing of the Boechera stricta Genome Reveals Extensive Microsynteny and Promoter Conservation with Arabidopsis.
Plant Physiology, April 1, 2006; 140(4): 1169 - 1182.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. Ayele, B. J. Haas, N. Kumar, H. Wu, Y. Xiao, S. Van Aken, T. R. Utterback, J. R. Wortman, O. R. White, and C. D. Town
Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis
Genome Res., April 1, 2005; 15(4): 487 - 495.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. S. Katari, V. Balija, R. K. Wilson, R. A. Martienssen, and W. R. McCombie
Comparing low coverage random shotgun sequence data from Brassica oleracea and Oryza sativa genome sequence for their ability to add to the annotation of Arabidopsis thaliana
Genome Res., April 1, 2005; 15(4): 496 - 504.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
J.-Y. Lee, S. F. Baum, J. Alvarez, A. Patel, D. H. Chitwood, and J. L. Bowman
Activation of CRABS CLAW in the Nectaries and Carpels of Arabidopsis
PLANT CELL, January 1, 2005; 17(1): 25 - 36.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
C. D. Buchanan, P. E. Klein, and J. E. Mullet
Phylogenetic Analysis of 5'-Noncoding Regions From the ABA-Responsive rab16/17 Gene Family of Sorghum, Maize and Rice Provides Insight Into the Composition, Organization and Function of cis-Regulatory Modules
Genetics, November 1, 2004; 168(3): 1639 - 1654.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
B. G. Ayre, J. E. Blair, and R. Turgeon
Functional and Phylogenetic Analyses of a Conserved Regulatory Program in the Phloem of Minor Veins
Plant Physiology, November 1, 2003; 133(3): 1229 - 1239.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
D. C. Inada, A. Bashir, C. Lee, B. C. Thomas, C. Ko, S. A. Goff, and M. Freeling
Conserved Noncoding Sequences in the Grasses
Genome Res., September 1, 2003; 13(9): 2030 - 2041.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
S. Rombauts, K. Florquin, M. Lescot, K. Marchal, P. Rouze, and Y. Van de Peer
Computational Approaches to Identify Promoters and cis-Regulatory Elements in Plant Genomes
Plant Physiology, July 1, 2003; 132(3): 1162 - 1176.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
R. L. Hong, L. Hamaguchi, M. A. Busch, and D. Weigel
Regulatory Elements of the Floral Homeotic Gene AGAMOUS Identified by Phylogenetic Footprinting and Shadowing
PLANT CELL, June 1, 2003; 15(6): 1296 - 1309.
[Abstract] [Full Text]


Home page
Plant CellHome page
H. Guo and S. P. Moose
Conserved Noncoding Sequences among Cultivated Cereal Genomes Identify Candidate Regulatory Sequence Elements and Patterns of Promoter Evolution
PLANT CELL, May 1, 2003; 15(5): 1143 - 1158.
[Abstract] [Full Text]


Home page
Plant Physiol.Home page
D. T. Morishige, K. L. Childs, L. D. Moore, and J. E. Mullet
Targeted Analysis of Orthologous Phytochrome A Regions of the Sorghum, Maize, and Rice Genomes using Comparative Gene-Island Sequencing
Plant Physiology, December 1, 2002; 130(4): 1614 - 1625.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Full Text (PDF)
Right arrow A correction has been published
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via ISI Web of Science (15)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.
Agricola
Right arrow Articles by Colinas, J.
Right arrow Articles by Benfey, P. N.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
ASPB Publications PLANT PHYSIOLOGY THE PLANT CELL
Copyright © 2002 by the American Society of Plant Biologists