|
|
||||||||
|
Plant Physiology 138:27-37 (2005) © 2005 American Society of Plant Biologists MetaCyc and AraCyc. Metabolic Pathway Databases for Plant Research1,[w]The Arabidopsis Information Resource, Department of Plant Biology, Carnegie Institution of Washington, Stanford, California 94305 (P.Z., H.F., C.P.T., S.Y.R.); Cornell University, Ithaca, New York 14853 (L.M.); and SRI International, Menlo Park, California 94025 (S.P., P.D.K.)
MetaCyc (http://metacyc.org) contains experimentally determined biochemical pathways to be used as a reference database for metabolism. In conjunction with the Pathway Tools software, MetaCyc can be used to computationally predict the metabolic pathway complement of an annotated genome. To increase the breadth of pathways and enzymes, more than 60 plant-specific pathways have been added or updated in MetaCyc recently. In contrast to MetaCyc, which contains metabolic data for a wide range of organisms, AraCyc is a species-specific database containing only enzymes and pathways found in the model plant Arabidopsis (Arabidopsis thaliana). AraCyc (http://arabidopsis.org/tools/aracyc/) was the first computationally predicted plant metabolism database derived from MetaCyc. Since its initial computational build, AraCyc has been under continued curation to enhance data quality and to increase breadth of pathway coverage. Twenty-eight pathways have been manually curated from the literature recently. Pathway predictions in AraCyc have also been recently updated with the latest functional annotations of Arabidopsis genes that use controlled vocabulary and literature evidence. AraCyc currently features 1,418 unique genes mapped onto 204 pathways with 1,156 literature citations. The Omics Viewer, a user data visualization and analysis tool, allows a list of genes, enzymes, or metabolites with experimental values to be painted on a diagram of the full pathway map of AraCyc. Other recent enhancements to both MetaCyc and AraCyc include implementation of an evidence ontology, which has been used to provide information on data quality, expansion of the secondary metabolism node of the pathway ontology to accommodate curation of secondary metabolic pathways, and enhancement of the cellular component ontology for storing and displaying enzyme and pathway locations within subcellular compartments.
The goal of the MetaCyc database is to catalog every experimentally determined biochemical pathway for small molecule metabolism (Krieger et al., 2004
With the release of the fully sequenced plant genomes of Arabidopsis (Arabidopsis Genome Initiative, 2000 In this article, we describe how the MetaCyc and AraCyc databases are updated, including manual curation of new plant pathways and revision of predicted AraCyc pathways with information from the literature, updating the AraCyc pathway predictions using the latest genome annotations, recording evidence to pathways and enzymes, and enhancing data ontologies. We also describe general applications of the two databases to other plant genomes. Finally, we discuss the limitations and issues of these databases and future directions.
Manual Curation of Plant Pathways in MetaCyc and AraCyc
The number of manually curated plant pathways in MetaCyc and AraCyc have been expanded significantly in the last few years. To ensure that the newly added pathways benefit a broad user base, primary metabolic pathways universal to plants were given the highest priority. Pathways shared among a few species or those involving secondary metabolism were given a lower priority. Pathways are curated from literature following standard curation procedures developed by the curators at SRI International and the Carnegie Institution (http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf). In total, 63 plant pathways have been added to or updated in MetaCyc between release versions 6.5 (August 2002) and 8.6 (November 2004), and 28 pathways have been curated in AraCyc since last described (Mueller et al., 2003
Prediction of Pathways Using PathoLogic
The starting point for building an updated version of AraCyc was making use of the increased quantity and quality of Arabidopsis genome annotation. The annotation file used in the original build of AraCyc was generated by manually extracting free-text gene descriptions of the The Institute for Genomic Research (TIGR) genome annotation (Mueller et al., 2003
Validation of Pathway Predictions The 219 pathways of the AraCyc 2.0 initial build were validated manually by consulting the primary literature, review articles, and textbooks. A valid pathway is defined as a pathway whose existence in Arabidopsis is supported by experimental evidence described in the literature. If a pathway is not well known and a curator could not find it referenced in the literature, the following two criteria were considered for validation. First, do the reactions that are found only in this pathway (i.e. unique to this pathway) have any Arabidopsis genes assigned to them? When a gene could be associated to only one pathway, this pathway was kept. This criterion also applied even when the end product of the biosynthetic pathway had not been reported in Arabidopsis. For example, lipid A is a membrane component specifically found in gram-negative bacteria. It has not been described in plants. However, the Arabidopsis genome is predicted to include genes that catalyze at least three unique reactions of the lipid-A precursor biosynthesis pathway, suggesting plants may be able to synthesize the metabolite (http://www.biochem.duke.edu/Raetz/raetznew.html), or other similar compounds. The second criterion uses information about the existence of the metabolites. If none of the unique reactions of the pathway were associated with an Arabidopsis gene, we then looked for evidence of the existence of the metabolites (Robinson, 1983
Summary of AraCyc 2.0 Pathways
On the other hand, 53 pathways that were previously predicted in AraCyc 1.0 were no longer called by the PathoLogic software in this run (Table III). Of these, 32 pathways either no longer exist in the reference database MetaCyc due to a lack of literature evidence, or were replaced by different pathway variants in AraCyc 2.0. Twelve pathways that were predicted in AraCyc 1.0 but not AraCyc 2.0 were due to changes in gene annotations that removed the supporting evidence for the pathways. One notable exception is the photosynthesis (light reaction) pathway. The two enzymes of the pathway in MetaCyc, PSI and PSII, are enzyme complexes. However, within the GO, which has been used in functional annotation of the Arabidopsis genes, PSI and PSII are categorized as children terms of cellular component, not catalytic activity. Therefore, the genes encoding their individual subunits were not included in our input file because the input file was restricted to genes annotated to the catalytic activity term and its children terms. The pathway was later manually added to AraCyc 2.0. The remaining nine pathways that were no longer predicted are due to a glitch in the PathoLogic algorithm. The bug was subsequently fixed in PathoLogic and the nine pathways were added to AraCyc2.0.
In addition, 19 plant pathways in MetaCyc were not predicted in AraCyc 2.0. There are several possible reasons for the missed predictions. First, slight name variations between the annotations in the input file and the enzymes in MetaCyc and poorly annotated names for the P450 cytochromes in MetaCyc made it difficult to match Arabidopsis genes to MetaCyc enzymes. The poorly annotated names were later fixed in MetaCyc. Second, in many cases lack of enough evidence from the Arabidopsis annotation input file made it impossible for the pathways to be predicted. For example, even though enzymes of the homogalacturonan biosynthesis pathway have been characterized from other plants (http://biocyc.org/META/NEW-IMAGE?type=PATHWAY&object=PWY-1061), no Arabidopsis genes could be assigned to the pathway based on the current genome annotation. Third, the same PathoLogic bug mentioned above failed to predict seven pathways for which supporting evidence exists. For example, within the lipoxygenase pathway, Arabidopsis genes can be assigned to at least one unique reaction of the pathway, which is one of the criteria for inclusion. Nevertheless, since all of the 19 pathways are either specific to Arabidopsis (e.g. glucosinolate biosynthesis) or universal to plants (e.g. homogalacturonan biosynthesis), they were imported into AraCyc 2.0 using the pathway import utility of Pathway Tools (Karp et al., 2002
Overall, updates to the gene annotations resulted in prediction of 16 new pathways (or 7.8% of the 204 total pathways in AraCyc 2.0) since last described (Mueller et al., 2003 After removal of the nonvalid pathways and addition of the missing pathways, AraCyc 2.0 contains 204 pathways with 1,418 unique genes assigned (Table III). The evidence and citations supporting the functional annotations of these genes were imported from TAIR and were associated to the corresponding enzymes (Fig. 1).The distribution of the 204 pathways according to the pathway ontology is summarized in Table IV. The three top categories, "Biosynthesis," "Degradation/Utilization/Assimilation," and "Generation of Precursor Metabolites and Energy," contain 115, 74, and 26 pathway instances, respectively. Biosynthesis of all 20 protein amino acids, all DNA/RNA purine and pyrimidine nucleosides and nucleotides, commonly occurring sugars and polysaccharides, major fatty acid and lipid classes including triacylglycerol and phospho- and glyco-lipids, 15 cofactors, prosthetic groups and electron carries, and all seven known major plant hormones are represented in AraCyc 2.0. In addition, biosynthesis of the major molecules found in plant primary and secondary cell wall and plant epidermal structure, including cellulose, homogalacturonan (a pectin), lignin, suberin, wax, and cutin, are included. Pathways for central energy metabolism are well represented. It is not easy to assess the comprehensiveness of pathways under "Degradation/Utilization/Assimilation." There is much less information available for catabolism than for biosynthesis in plants. Nonetheless, two known degradation pathways for odd chain and unusual fatty acids need to be added. Supplemental Table I to this manuscript provides a comprehensive list of all the pathways in AraCyc 2.0. For each pathway, it lists the number of reactions, the number of pathway holes, and the number of genes assigned to the pathway along with known genetic symbols. Pathways that have been curated are identified in the list.
Enhancement to the Database Ontology
Data objects in MetaCyc and AraCyc, including pathways, compounds, subcellular compartments, and evidence types, are structured in a hierarchical ontology (Gruber, 1993
The secondary metabolism class of the pathway ontology has been significantly enhanced. Secondary (sometimes referred to as specialized) metabolites are widespread in higher plants. They contribute substantially to the reproduction, fitness, and adaptations that plants acquired throughout the course of evolution (Pichersky and Gang, 2000
To represent pathways in a cellular context, MetaCyc and AraCyc also store protein subcellular location information. We have expanded the cellular component terms from 35 terms (Mueller et al., 2003
There is an increasing need for attaching evidence to data objects to distinguish data that have been curated with experimental evidence from those that are computationally predicted (Gene Ontology Consortium, 2001
Although MetaCyc and AraCyc share a significant overlap in data and data access/analysis software, the two databases have different purposes and, thus, different applications. The goal of MetaCyc is to represent all experimentally verified metabolic pathway data from all organisms. On the other hand, the goal of AraCyc is to represent the complement of metabolism of one organism, including both experimentally determined and computationally predicted data. In general, MetaCyc is suitable for use as a reference database to predict the metabolism complement for any newly sequenced and annotated genome because it is designed to maximize sensitivity by including pathways from many organisms. On the other hand, AraCyc may be at least as good as or better than MetaCyc for predicting metabolic pathways for genomes that are evolutionarily closer to Arabidopsis (such as other flowering plants) because pathways that are predicted to exist in plants but have not been experimentally validated can exist in AraCyc but not in MetaCyc. Also, MetaCyc contains a number of pathway variants that are specific to different organisms, whereas AraCyc contains only the plant variants. Therefore, it may take more time to curate the results of the prediction program from MetaCyc. To maximize sensitivity and specificity of the program's results, it may be worthwhile to generate a new metabolism database using more than one reference database (e.g. using both MetaCyc and AraCyc). Regardless of which database is used as reference, the critical importance of validating and curating the outputs of the prediction cannot be overemphasized. Computational pathway prediction is meant to serve as a starting point for building the metabolic content of an organism. After the newly created database has undergone curation, comparison to another organism may shed light on the growth/development and physiology of each organism. For individual species such as Arabidopsis, the Omics Viewer of Pathway Tools (the AraCyc version is at http://arabidopsis.org:1555/expression.html) can be used in data analysis. The tool paints data from gene expression, protein expression, gene family analysis, or metabolite profiling experiments onto a diagram of the full metabolic network of Arabidopsis. Each reaction (represented as a line connecting the compounds) can be color-coded according to the expression level of the gene or protein that catalyzes the reaction. Metabolite levels can also be depicted by color-coding the symbols for compounds (represented as squares or triangles connected by the reaction lines). Note that only those genes and compounds that are included in the pathways of AraCyc can be displayed on the metabolic map. However, it is possible to extrapolate from the Omics Viewer to identify additional components of a pathway. For example, if a set of genes from a gene expression microarray experiment appears to be involved in the same pathway and show similar changes in expression values, one could cluster the original dataset to identify other genes having a similar expression profile. These genes in turn may represent components of the pathway missing from AraCyc.
MetaCyc and AraCyc can be accessed in a number of different ways: They can be queried and browsed using the Pathway Tools software through the Web (http://metacyc.org and http://arabidopsis.org/tools/aracyc). Datasets for AraCyc can be obtained as text files (ftp://ftp.arabidopsis.org/home/tair/Pathways/). The complete databases can also be downloaded (http://biocyc.org/download.shtml and http://arabidopsis.org/tools/aracyc). The first two options are freely accessible to anyone without a license, whereas the last option is available with a license (free to academic users). In addition, both databases can be queried and browsed using the desktop version of Pathway Tools, which provides more functionality than the Web version and is available for Windows/PC, Linux/PC, and SUN workstations (http://biocyc.org/download.shtml).
We have described recent updates of MetaCyc and AraCyc, which aimed to increase the breadth of data coverage and the accuracy of plant metabolism data. The remaining issues regarding quality of existing data, breadth of data curation, and limitations of data displaying are discussed below. Currently, about 25% of the AraCyc pathways have been manually curated, meaning that we have verified and corrected the pathway diagrams according to literature information, and added pathway comments and literature citations. The noncurated pathway diagrams, which were curated from microorganisms in the reference database MetaCyc and were predicted to exist in Arabidopsis by PathoLogic, represent what is experimentally verified in other organisms. These pathways need to be further curated from literature to represent what is experimentally verified in Arabidopsis and other plants. For example, additional intermediate reactions may be required for plants to synthesize the same compound. Or, plants may use different cosubstrates in converting compound A to compound B. The PathoLogic assignments of genes (and their encoded enzymes) to reactions and pathways are solely based on gene annotations. There are inevitably false-positive associations of genes with pathways. For example, isoenzymes localized in different subcellular compartments, though catalyzing the same reaction, may be involved in different pathways. These isoenzymes are not distinguished by PathoLogic and thus may be assigned to pathways in which they are not involved. For example, an isoenzyme catalyzing reaction X is localized in the cytoplasm. It may be incorrectly assigned to a pathway that contains reaction X but is located in the chloroplast. This kind of false-positive assignment needs to be removed during curation. False-positive assignments to reactions may also arise because of low-quality gene annotations. At present, functional annotations for the majority of the Arabidopsis genes lack experimental evidence. AraCyc users are advised to be cautious when using any of the noncurated data or data without experimental support. Our immediate goal for enhancement of the database content for MetaCyc and AraCyc is to expand the breadth of existing coverage of plant secondary metabolism, i.e. curation of representative pathways for each of the main compound classes, followed by increasing the depth, i.e. curation of additional pathways representing each of the major subclasses. In addition, we plan to curate and integrate transporters into their relevant pathways in AraCyc and MetaCyc. Transporters will also be added to the Metabolic Overview Map. In addition to data curation, many enhancements to the data visualization capabilities of the Pathway Tools software are planned, such as the ability to overlay expression values of individual isoenzymes onto reactions (currently the highest value is overlaid), to zoom in from the Metabolic Overview Map overlaid with expression data to pathway detail pages color-coded in the same way, and to display pathways in the context of subcellular location information. Pathway curation is a time-consuming process. One way to expedite the rate of curation and increase the quality of data is to encourage data submission by users, especially experts in a particular metabolism field. Submissions could include updates or corrections to an existing pathway or a new pathway to be added. An easy data submission form will be developed and available in the near future. Currently, users are encouraged to contact AraCyc curators if they notice errors or omissions in the data (curator{at}arabidopsis.org).
PathoLogic Prediction of AraCyc 2.0
AraCyc 2.0 was built by running the PathoLogic (Pathway Tools 8.5) as described previously (Mueller et al., 2003
The amino acid sequence of the Arabidopsis genome annotation ATH1_pep_cm_20040228 (ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/) was used as the input sequence file for the PathoLogic hole-filler program. EcoCyc was chosen as the training data for the hole filler. The probability cutoff was set to 0.9.
Curators follow standard procedures and guides to collect and enter information into the databases (http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf). Information is collected from major textbooks describing general plant biochemistry or specific areas of plant biochemistry, and from primary literature searched at databases including PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) and Scirus (http://www.scirus.com/srsapp/). Curated pathway diagrams are entered with evidence codes and literature citations. Curators also write a summary describing the pathway's role and significance. Reactions are curated with EC numbers or EC classes and subclasses. Chemical structures are entered for compounds. Enzymes are curated with physical and catalytic properties and their coding genes. Evidence codes along with literature citations are assigned to enzyme activities.
To enhance the secondary metabolic pathway ontology and develop the subcellular component ontology, existing terms are collected from textbooks or other resources, including GO (http://www.genetonology.org). Additional terms are created when necessary to meet the database needs. Each term is defined and classified according to the "is-a" relationship. Synonyms andadditional relationships to other terms such as "surrounded-by" and "component-of" are added if they exist.
We thank Aleksey Kleytman and Shijun Li for technical assistance, and Tanya Berardini, Leonore Reiser, and Wolf Frommer for reviewing the cellular component ontology of AraCyc and MetaCyc. We are grateful to Leonore Reiser and Eva Huala for critical reading the manuscript. Received January 28, 2005; returned for revision March 1, 2005; accepted March 21, 2005.
1 This work was supported by the National Science Foundation (grant no. DBI9978564) and by the National Institutes of Health (National Institute of General Medical Sciences; grant no. 1R01GM6546601).
[w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.105.060376. * Corresponding author; e-mail rhee{at}acoma.stanford.edu; fax 6503256857.
Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline]
Berardini TZ, Mundodi S, Reiser R, Huala E, Garcia-Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 745755 Dixon RA (2001) Natural products and plant disease resistance. Nature 411: 843847[CrossRef][Medline]
Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res 11: 14251433
Gene Ontology Consortium (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258D261
Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100 Green ML, Karp PD (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5: 76[CrossRef][Medline] Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5: 199220[CrossRef] Hadacek F (2002) Secondary metabolites as plant traits: current assessment and future perspectives. Crit Rev Plant Sci 21: 273322 Harborne JP, Baxter H (1993) Phytochemical Dictionary: A Handbook of Bioactive Compounds from Plants. Taylor & Francis, London
Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 2730
Karp PD (2000) An ontology for biological function based on molecular interactions. Bioinformatics 16: 269285 Karp PD, Paley S, Krieger CJ, Zhang P (2004) An evidence ontology for use in pathway/genome databases. Pac Symp Biocomput 9: 190201 Karp PD, Paley S, Romero P (2002) The Pathway Tools software. Bioinformatics 18: S225S232[Abstract]
Keseler IM, Collado-vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD (2005) EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 33: D334D337
Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD (2004) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res 32: D438D442
Mueller LA, Zhang P, Rhee SY (2003) AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol 132: 453460
Paley S, Karp PD (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18: 715724 Pichersky E, Gang DR (2000) Genetics and biochemistry of secondary metabolites in plants: an evolutionary perspective. Trends Plant Sci 5: 439445[CrossRef][ISI][Medline] Robinson T (1983) The Organic Constituents of Higher Plants. Cordus Press, North Amherst, MA Singer AC, Crowley DE, Thompson IP (2003) Secondary plant metabolites in phytoremediation and biotransformation. Trends Biotechnol 21: 123130[CrossRef][ISI][Medline] Verpoorte R, Memelink J (2002) Engineering secondary metabolite production in plants. Curr Opin Biotechnol 13: 181187[CrossRef][ISI][Medline] Wink M (1988) Plant breeding: importance of plant secondary metabolites for protection against pathogens and herbivores. Theor Appl Genet 75: 225233[CrossRef] Wink M (2003) Evolution of secondary metabolites from an ecological and molecular phylogenetic perspective. Phytochemistry 64: 319[CrossRef][ISI][Medline]
Wortman JR, Haas BJ, Hannick LI, Smith RK Jr, Maiti R, Ronning CM, Chan AP, Yu C, Ayele M, Whitelaw CA, et al (2003) Annotation of the Arabidopsis genome. Plant Physiol 132: 461468
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992 This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||