|
|
||||||||
|
Plant Physiology 138:47-54 (2005) © 2005 American Society of Plant Biologists Genome Cluster Database. A Sequence Family Analysis Platform for Arabidopsis and Rice1Center for Plant Cell Biology, Department of Botany and Plant Sciences, University of California, Riverside, California 92521
The genome-wide protein sequences from Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) spp. japonica were clustered into families using sequence similarity and domain-based clustering. The two fundamentally different methods resulted in separate cluster sets with complementary properties to compensate the limitations for accurate family analysis. Functional names for the identified families were assigned with an efficient computational approach that uses the description of the most common molecular function gene ontology node within each cluster. Subsequently, multiple alignments and phylogenetic trees were calculated for the assembled families. All clustering results and their underlying sequences were organized in the Web-accessible Genome Cluster Database (http://bioinfo.ucr.edu/projects/GCD) with rich interactive and user-friendly sequence family mining tools to facilitate the analysis of any given family of interest for the plant science community. An automated clustering pipeline ensures current information for future updates in the annotations of the two genomes and clustering improvements. The analysis allowed the first systematic identification of family and singlet proteins present in both organisms as well as those restricted to one of them. In addition, the established Web resources for mining these data provide a road map for future studies of the composition and structure of protein families between the two species.
Sequence similarity comparisons play an essential role in analyzing the phylogenetic and structure-function relationships of genes and proteins. They are critical for dissecting complex functional differences between the members of protein families to ultimately understand their full activity spectrum. Efficient tools for analyzing complex families are of particular importance to plant biology since the majority of the genome-encoded proteins from the model organisms Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) are members of large sequence families. The number and sizes of these families frequently exceed the complexity in animal and other kingdoms (Riechmann and Ratcliffe, 2000
Although computational approaches will not overcome these difficulties, the development of efficient bioinformatics tools for analyzing differences within and across families is critical for guiding future research in many plant science areas. Most in silico family analysis strategies are based on the simple but effective concept that sequences with a higher degree of similar residues share more structural and functional properties than those with weaker similarities. This guideline allows the assignment of putative functions to unidentified proteins sharing significant sequence similarities and, hence, functional properties with characterized candidates. Although weak similarities are less reliable indicators for predicting function, they commonly provide essential information for discovering proteins with novel properties. To perform the required comparisons, all related sequences of interest need to be identified by similarity searches, the retrieved candidates organized in multiple alignments, their conserved domains localized, and distance trees calculated. Those basic family analysis steps have guided the functional identification of countless uncharacterized genes in the past. Various software tools are available for grouping unclustered protein sets into families, largely by employing distance matrices derived from all-against-all comparisons (Enright and Ouzounis, 2000 In this study, we have clustered all protein sequences from Arabidopsis and rice into similarity groups, calculated their corresponding alignments, localized their conserved domains, and generated distance trees. The resulting data sets provide comprehensive information about the similarities and dissimilarities between a monocotyledon and a dicotyledon representative with regard to the size, quantity, and composition of their family and singlet proteins. The provided data sets represent a foundation for future studies of the ortholog and paralog sequences of the two species. The user-friendly Genome Cluster Database (GCD; http://bioinfo.ucr.edu/projects/GCD) was designed to provide to the public an efficient cluster mining tool for Arabidopsis and rice to perform various intraspecies and interspecies comparisons, and also to retrieve related sequences from other organism groups.
Protein Similarity Clustering To identify and compare all family and singlet proteins from Arabidopsis and rice spp. japonica, their protein sequences from The Institute for Genomic Research (TIGR) were clustered into similarity groups. Two profoundly different approaches were chosen for this purpose to minimize the limitations inherent in most available methods for clustering large and diverse sequence sets with high sensitivity and low false-positive rates. To guide the reader through the following text, a summary of the two methods with regard to their relative performance for high-sensitivity clustering of remotely related sequences is provided in Table I. It is important to point out that the reported differences greatly depend on the parameter settings (see below) of the two methods.
The first approach (BCL) used the BLASTCLUST software from the National Center for Biotechnology Information (ftp://ftp.ncbi.nlm.nihi.gov/blast/executables) to automatically group the proteins based on BLASTP similarity scores and single-linkage clustering. Low-complexity regions had been masked in the sequences to avoid overclustering due to biased amino acid distributions in certain proteins (Promponas et al., 2000 8 sphingolipid desaturase form separate families in this clustering despite the fact that they all share a cytochrome b5 domain with sequence identities above the similarity threshold (Table II). They are not joined into a hybrid cluster due to the relatively short length of the shared domain. Nevertheless, very large gene families with extremely complex domain architectures can be contaminated with false-positive proteins. An example for this event is the kinase superfamily that contains unrelated sequences in this clustering. The following domain-based approach generates far more reliable results for subgroups of this extremely complex family with more than 2,000 members. Clustering all kinases into one superfamily with accurate separation of subgroups requires several manual curation steps and specialized clustering techniques as described by Wang et al. (2003)
The second clustering approach (hidden Markov model [HMM] domain-based clustering [HCL]) used the serial arrangement of Pfam domains in each protein to form families with the same order of known protein domains. The domains were identified in the two protein sets by HMMPFAM (http://hmmer.wustl.edu/) searches against the Pfam domain model database. A custom Perl script was developed to group the proteins according to their identified domain architecture. Similarly as above, the composition of known families was used as benchmark for parameter optimization. Our experience with more than 30 curated plant families from Girke et al. (2004) 0.1 as cutoff allows clustering of nearly complete families with false positive rates close to zero. In addition, no constraints were set in this clustering regarding the domain coverage relative to the entire protein length. Those restrictions were avoided to favor the formation of complete families, even though limited coverage can result in false positives in which sequences share only short similarities. To further evaluate the cluster qualities by manual inspection of selected cases, multiple alignments and distance trees for all identified families of the two methods were calculated with the programs MultAlin and PHYLIP, respectively (Corpet, 1988
The outcome of the two clustering methods is summarized in Table III. The HCL data for Arabidopsis agree in large parts with the domain signature clustering results from Wortman et al. (2003)
The comparative proteome-wide clustering of this study allowed the systematic identification of most singlet and family proteins that are present only in one of the two organisms. Those organism-restricted clusters are a rich resource for studying the molecular and functional diversities between the two organisms. Table IV provides a summary of the statistics of these complex differences, and Table V shows several examples that are organism restricted according to both clustering methods and contain only one Pfam domain. The complete family information for all cluster intervals of Table IV can be retrieved through predefined queries from GCD's Advanced Search page. Overall, the relative abundance of organism-restricted singlet and family proteins is much higher in rice (45% and 57%) than in Arabidopsis (29% and 25%). This finding is in agreement with published search results between the two organisms (Kikuchi et al., 2003
As outlined above, comprehensive clustering of entire proteomes is a very complex process. Although automated computational approaches provide very efficient solutions to this problem, it is currently not possible to generate perfect cluster information for all families, even with two different approaches. This is largely due to the often very specialized requirements for extremely diverse candidates and incomplete proteome knowledge. Therefore, the provided family information has its limitations, and users are asked to critically assess the quality of a family of interest with their expert knowledge before they base critical research decisions on the results.
The generated information of this study was organized in the public GCD that is equipped with many powerful query, visualization, and download features for flexible interspecies analyses of gene and protein families. A multifunctional entry page allows users to search the database in single or batch mode by querying with gene/protein IDs, functional descriptions, cluster IDs, cluster names, or gene ontology keys. Combinatorial queries of scalable complexity can be generated through a separate Advanced Query page. Alternatively, a search and sortable cluster table enables navigation by cluster sizes, family names, and other criteria. All of the above query options return a result list with rich information on the specified gene/protein entries from Arabidopsis and rice. This includes the statistics of a query containing the number of its returned loci, gene models, and clusters. The corresponding protein, gene model, untranslated region, intergenic, and putative promoter sequences for any cluster or query can be displayed on the same page. This versatile sequence batch retrieval system allows efficient download of almost all types of Arabidopsis and rice sequences in a single step. The provided annotations for individual members contain detailed cluster information, including cluster names, total cluster sizes, organism distribution within clusters, and many links to external resources. Subclusters of higher similarity can be easily identified through BCL results with more stringent thresholds of 50% and 70% sequence identity. To quickly retrieve all members of a family on the result page, users can activate a hyperlinked subquery system for any given cluster in the database. This action will send the correct query syntax back to the main page and return all of the members of a family of interest. A sortable list of related sequences from all other organisms represented in the UniProt database (Leinonen et al., 2004
Functional names for all clusters are available. Those were assigned by a computational method that is based on the GO annotations for the two organisms (Ashburner et al., 2000
An automated download and reclustering pipeline has been implemented for the database backend to ensure up-to-date clustering and sequence information upon major changes in the genome annotations of the two organisms. GCD will be further maintained and improved by including additional well annotated plant genomes in the future and adding new features to enhance its functionality for the community. We will also continue to work on data interoperability and sharing of data with various protein family resources, TIGR, The Arabidopsis Information Resource (TAIR; Rhee et al., 2003
The comprehensive protein family information of this project and the associated GCD Web service both provide many new and unique opportunities for efficient comparative studies between Arabidopsis and rice. Those resources are expected to be of broad interest to researchers who are interested in exploring the molecular and structural diversities within and across the two plant species.
Sequence Clustering
The plant proteome and genome sequences used for this project were downloaded from TIGR's ftp site (ftp://ftp.tigr.org). The latest genome annotation versions 5.0 and 2.0 were retrieved for Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) spp. japonica, respectively. The nucleotide sequences and annotation data that are provided through the GCD interface were extracted with internal Bioperl parsers from the corresponding pseudochromosome files in XML format. Orthologs in other species were identified through local BLASTP (Altschul et al., 1990
Protein sequence similarity clustering was performed with the BLASTCLUST program (ftp://ftp.ncbi.nlm.nihi.gov/blast/executables) using 50% overlap and 35% identity as cutoff values for family assembly. Two additional cluster sets with 50% and 70% identity were generated for the Web page. Prior to this clustering, low-complexity regions of the proteins were masked with the freely available CAST program from Promponas et al. (2000)
Domain composition clustering was performed in two steps. First, the Pfam domains were identified in the proteins with HMMPFAM searches against the latest Pfam HMM library (Pfam_ls). Second, the proteins were clustered with a custom Perl script based on their order of identified domains using an HMM E value of
Multiple alignments for clusters were calculated using the MultAlin program from Corpet (1988) The GO annotations, used for protein family naming, were retrieved from TIGR's pseudochromosome files. Based on consistency considerations of GO categories between the two species, the TIGR annotations were used for both organisms. The more comprehensive Arabidopsis GO annotations from TAIR are not included at this point. For consensus mapping of terms, the current GO tree was downloaded from the GO Consortium page (http://www.geneontology.org/). The developed naming strategy is divided into two steps. First, a single molecular function GO identifier is assigned to each protein by using the deepest one in the network. If several GOs with the same depth are determined, then only the first one is used. Second, the GO term appearing most often in a cluster is chosen to be the cluster name. If the GO count ends in two or more groups of identical size, then the first one is used.
The generated protein family cluster information was uploaded into a relational PostgreSQL database (http://www.postgresql.org/). To provide unlimited data access to the public through the Internet, a user-friendly Java-based Web interface was developed that integrates several open-source applications and Bioperl modules (Stajich et al., 2002
We thank Brian Haas and Shu Ouyang from TIGR for keeping us informed about new updates in the annotations of the two plant genomes. Received December 27, 2004; returned for revision March 15, 2005; accepted March 21, 2005.
1 This work was supported by the Center for Plant Cell Biology at the University of California, Riverside, and by the National Science Foundation (grant no. IOB0420152 to J.B.-S. and T.G., and grant no. MCB0296080 to N.R.). www.plantphysiol.org/cgi/doi/10.1104/pp.104.059048. * Corresponding author; e-mail thomas.girke{at}ucr.edu; fax 9518274437.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403410[CrossRef][ISI][Medline]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 2529[CrossRef][ISI][Medline]
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al (2004) The Pfam protein families database. Nucleic Acids Res (Database issue) 32: D138D141
Beisson F, Koo AJ, Ruuska S, Schwender J, Pollard M, Thelen JJ, Paddock T, Salas JJ, Savage L, Milcamps A, et al (2003) Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol 132: 681697
Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 745755
Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R (2001) Clustering protein sequencesstructure prediction by transitive homology. Bioinformatics 17: 935941 Briggs WR, Christie JM (2002) Phototropins 1 and 2: versatile plant blue-light receptors. Trends Plant Sci 7: 204210[CrossRef][ISI][Medline]
Brown NP, Leroy C, Sander C (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14: 380381
Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16: 1088110890 Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6: 361365[CrossRef][ISI][Medline]
Enright AJ, Kunin V, Ouzounis CA (2003) Protein families and TRIBES in genome sequence space. Nucleic Acids Res 31: 46324638
Enright AJ, Ouzounis CA (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16: 451457
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 15751584
Faik A, Price NJ, Raikhel NV, Keegstra K (2002) An Arabidopsis gene encoding an alpha-xylosyltransferase involved in xyloglucan biosynthesis. Proc Natl Acad Sci USA 99: 77977802 Felsenstein J (2004) PHYLIP (Phylogeny Inference Package) Version 3.6. Distributed by the author. Department of Genetics, University of Washington, Seattle
Fowler TJ, Bernhardt C, Tierney ML (1999) Characterization and expression of four proline-rich cell wall protein genes in Arabidopsis encoding two distinct subsets of multiple domain proteins. Plant Physiol 121: 10811092
Girke T, Lauricha J, Tran H, Keegstra K, Raikhel N (2004) The Cell Wall Navigator database. A systems-based approach to organism-unrestricted mining of protein families involved in cell wall metabolism. Plant Physiol 136: 30033008
Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, et al (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301: 376379 Koonin EV, Wolf YI, Karev GP (2002) The structure of the protein universe and genome evolution. Nature 420: 218223[CrossRef][Medline]
Krause A, Haas SA, Coward E, Vingron M (2002) SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein. Nucleic Acids Res 30: 299300
Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R (2001) CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 29: 3336
Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R (2004) UniProt archive. Bioinformatics 20: 32363237
Mohseni-Zadeh S, Louis A, Brezellec P, Risler JL (2004) PHYTOPROT: a database of clusters of plant proteins. Nucleic Acids Res (Database issue) 32: D351D353
Nelson DR, Schuler MA, Paquette SM, Werck-Reichhart D, Bak S (2004) Comparative genomics of rice and Arabidopsis. Analysis of 727 cytochrome P450 genes and pseudogenes from a monocot and a dicot. Plant Physiol 135: 756772
Okushima Y, Overvoorde PJ, Arima K, Alonso JM, Chan A, Chang C, Ecker JR, Hughes B, Lui A, Nguyen D, et al (2005) Functional genomic analysis of the AUXIN RESPONSE FACTOR gene family members in Arabidopsis thaliana: unique and overlapping functions of ARF7 and ARF19. Plant Cell 17: 444463 Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R (2002) ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics (Suppl 2) 18: S182S191[Abstract]
Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16: 915922
Qin C, Wang X (2002) The Arabidopsis phospholipase D family. Characterization of a calcium-independent and phosphatidylcholine-selective PLD zeta 1 with distinct regulatory domains. Plant Physiol 128: 10571068
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31: 224228
Richmond TA, Bleecker AB (1999) A defect in beta-oxidation causes abnormal inflorescence development in Arabidopsis. Plant Cell 11: 19111924 Riechmann JL, Ratcliffe OJ (2000) A genomic perspective on plant transcription factors. Curr Opin Plant Biol 3: 423434[CrossRef][ISI][Medline]
Sarria R, Wagner TA, O'Neill MA, Faik A, Wilkerson CG, Keegstra K, Raikhel NV (2001) Characterization of a family of Arabidopsis genes related to xyloglucan fucosyltransferase1. Plant Physiol 127: 15951606
Sperling P, Zahringer U, Heinz E (1998) A sphingolipid desaturase from higher plants. Identification of a new cytochrome b5 fusion protein. J Biol Chem 273: 2859028596
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12: 16111618
Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28: 3336 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4: 41[CrossRef][Medline]
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278: 631637
Tchieu JH, Fana F, Fink JL, Harper J, Nair TM, Niedner RH, Smith DW, Steube K, Tam TM, Veretnik S, et al (2003) The PlantsP and PlantsT functional genomics databases. Nucleic Acids Res 31: 342344
Wang D, Harper JF, Gribskov M (2003) Systematic trans-genomic comparison of protein kinases between Arabidopsis and Saccharomyces cerevisiae. Plant Physiol 132: 21522165
Wang R, Tischner R, Gutierrez RA, Hoffman M, Xing X, Chen M, Coruzzi G, Crawford NM (2004) Genomic analysis of the nitrate response using a nitrate reductase-null mutant of Arabidopsis. Plant Physiol 136: 25122522 Wootton JC, Federhen S (2003) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17: 149163[CrossRef]
Wortman JR, Haas BJ, Hannick LI, Smith RK Jr, Maiti R, Ronning CM, Chan AP, Yu C, Ayele M, Whitelaw CA, et al (2003) Annotation of the Arabidopsis genome. Plant Physiol 132: 461468 This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|