|
|
||||||||
|
Plant Physiology 138:573-577 (2005) © 2005 American Society of Plant Biologists ChemMine. A Compound Mining Database for Chemical Genomics1Center for Plant Cell Biology, Department of Botany and Plant Sciences, University of California, Riverside, California 92521
Chemical genomics is a promising new technology for studying gene functions in the context of living organisms or cell systems. It complements existing molecular and genetics tools (e.g. mutagenesis, RNAi) by allowing fine-tunable in vivo modulations of protein functions and cellular processes (Blackwell and Zhao, 2003 Chemical genomics has several outstanding advantages over classical genetics and molecular techniques for studying gene functions. Standard genetics approaches target one gene at a time and provide limited opportunity to control the extent of the downstream cellular effects. By contrast, chemicals can be targeted with spatiotemporal precision against a selected spectrum of proteins. They can be applied in defined dosages to distinct cells, organs, or developmental stages, often with rapid response times and reversible effects. Since chemical switches can act in a similar manner across a range of model or nonmodel organisms, their identification is of great interest for researchers working with different model systems. Finally, the chemicals can be used to inactivate a family of proteins with related sequences or structures in a single step. In the future, these "chemical family knock-downs" may be the method of choice for the functional characterization of paralogous genes with redundant functions.
In spite of the broad spectrum of new opportunities, chemical genomics has not yet evolved into a widely used strategy for biological systems analysis in academic research. This is due to several factors. One is the paucity of information resources, compound search and analysis services for annotated drugs, and agrochemicals in the public domain. An additional reason is the high cost of compound libraries and high-throughput equipment. This Update will provide a short outline of the existing open-access informatics resources that are relevant for chemical genomics-based research, and how ChemMine fills some of the missing links.
The critical software and database resources for bioactive chemical discovery projects are: tools for structure similarity comparisons, database searching, structure-activity comparisons, evaluations of the chemical descriptor (property) space, design of customized libraries (subsetting), lead optimization steps, and compound and screening databases. Despite the importance of these very basic enabling tools, most are not yet freely available. Recently, the first online services were established that give the public access to basic bioactivity information of drug-like compounds and virtual screening tools. The late start of such obvious and overdue information resources is particularly surprising since very similar resources are required in drug discovery, which is an established and well-funded research discipline (Strausberg and Schreiber, 2003
The open National Cancer Institute (NCI) database was one of the first consolidated public efforts to change this situation by disseminating screening and bioactivity information for a larger compound set in a searchable database format for the cancer and HIV research community (Voigt et al., 2001
To further facilitate the incorporation of chemical genomics-based approaches in the discovery process of novel protein functions and gene networks, we have developed the ChemMine database (http://bioinfo.ucr.edu/projects/PlantChemBase/search.php). The first release of this public service provides access to an integrated suite of analysis and information retrieval tools for compound searching, structure-based clustering, descriptor generation (chemical properties), and retrieval of published bioactivity and target protein information (Fig. 1).
At the current stage of this project, ChemMine centralizes compound structure and activity information from a growing number of public providers and vendors of chemical screening libraries. The incorporation of commercially available compounds provides access to their purchase information. This knowledge can be critical for follow-up studies and assembly of focused libraries in secondary screens when the resources for resynthesis of novel chemicals in larger quantities are limited or do not exist at all. It is expected that the current set of commercial compound collections in ChemMine (over 1 million) will quickly grow when more businesses realize the benefits of a public presence and express interest in participating in this project. In addition to commercial compounds, most collections from public initiatives are included in ChemMine. These highly annotated compound sets maximize access to bioactivity information, known target proteins, literature, and other useful annotation information, enabling the user to correlate screening results with available biological knowledge. Additional information will be included as it becomes available. Searches for analogs of metabolic compounds are available through the incorporation of the KEGG ligand database. Information about bioactive chemicals (e.g. known drugs, herbicides) and their functional characterization is provided through the data sets from ChEBI, ChemBank, NCI, PubChem, and other providers. The annotations from ChEBI illustrate the growing utility of these services (Fig. 1). This initiative was started to provide systematic target associations of small compounds that interfere with processes of living organisms. Via this linkage, ChemMine users can retrieve the target protein sequences, three-dimensional structures, and literature for annotated drugs or metabolic molecules that are available or hyperlinked in the UniProt database. Similar drug-to-target associations are available in the data sets from ChemBank and PubChem. With regard to the specific needs of scientists working with proprietary or customized compound libraries, general purpose compound databases will remain incomplete no matter how many structures they contain. An additional reason for this limitation is that thousands of new compounds can be synthesized every day or their structures designed in silico. To counterbalance this inevitable incompleteness, the ChemMine project has a strong focus on online services. These features allow users to utilize most of ChemMine's analysis tools for external compound sets without being restricted to the compound coverage in the database. Since downstream analyses of compounds and their target proteins require the usage of various molecular modeling and computational chemistry programs, ChemMine supports interconversions of the most common structure formats (SDF, SMILES, PDB, etc.) for file exchange with other tools. The libraries from Open Babel (http://openbabel.sourceforge.net) are used for these reformatting steps.
The ChemMine interface allows queries in single or batch mode using one or many compound identifiers, compound names, or external annotations. The initial query results are displayed in a flexible table format that can be expanded and sorted by the chemical properties of the retrieved compounds. Annotations and structure images for each compound can be viewed on the next level for single or many entries. These pages contain links to additional information, such as available literature, target proteins, external annotation pages from different compound providers, and download options in different structure file formats. The structure images are generated with the batch rendering tool from the CACTVS package (Ihlenfeldt et al., 2002
Structure-based clustering and descriptor space analyses are very useful strategies for both basic quantitative structure-relationship studies and lead optimization steps in compound screens. Structure-based clustering can be performed through the ChemMine interface using external or internal compounds or a combination of both. The similarity scores, generated by the fragment-based similarity search tool, are used for calculating the distance values required for clustering. The present set of clustering techniques consists of hierarchical clustering and a binning approach with variable similarity cutoffs. The open-source program Cluster 3.0 is used for the hierarchical clustering step (Eisen et al., 1998 To identify clusters of structural similarity within entire libraries, ChemMine contains precomputed cluster tables for most of its compound sets. These tables summarize the number of similarity clusters using incremental similarity values as stringency cutoffs. The composition of each identified cluster is stored in the database and its members can be retrieved through the corresponding hyperlinks in each table. Since this data representation is particularly useful for evaluating the structural redundancy in customized compound sets (e.g. interlibrary comparisons), additional analyses will be uploaded to this site upon user request. Commercial libraries can only be included here after approval by their providers.
More than 40 different descriptors can be created in ChemMine for any set of externally provided compounds or for those represented in the database. They are generated with the open-source JOELib computational chemistry package. They include molecular properties, such as molecular weight, octanol/water partition coefficient, counts of hydrogen-bond donors/acceptors, rotable bonds, types of atoms, and reactive groups per molecule. The descriptors of the popular Lipinski's "rule of five" for drug-likeness prediction are included in this list (Lipinski et al., 1997
The ChemMine project is unique by providing several new online tools (e.g. clustering, descriptor generation) and integrating them with a wide variety of bioactive, natural, and screening compounds from public and commercial providers. Based on the experience from chemical genomics studies in plants and numerous discussions with colleagues (Zhao et al., 2003 In the future, we will further develop ChemMine as an open-source project by implementing several new features. First, the database will be augmented with bioactivity information from internal and external screening programs using standardized and interchangeable formats that support screens from an unlimited number of organisms, in addition to those from plants. Second, an upload functionality for compound structures and screening data from external researchers will be integrated. Third, additional structure search tools will be implemented to increase the speed and functionality of the similarity searches. Fourth, complex query functions will be added to enable filtering on various descriptor fields and other criteria. Fifth, automated upload routines will be developed to easily expand compound collections in the database and to update their annotation and provider information in a timely manner. Sixth, the developed software tools will be released to the public via download options. Seventh, multicomponent clustering using variable sets of molecular descriptors and structural similarities will be implemented. Finally, we will work on the integration and interoperability of ChemMine with the ChemBank, NCI, PubChem, ChEBI, and other projects in this area. This effort will strongly support the vision that public activities in this area should have the common goal of developing an ultimate "meta-database" as a central depository and mining service for compound and screening data.
We thank Eric Brauner, Caroline Shamu, and Stephen Haggarty from the Institute for Chemistry and Cell Biology for their continuous support of this project. Received March 10, 2005; returned for revision March 31, 2005; accepted April 1, 2005.
1 This work was supported by the Center of Plant Cell Biology at the University California, Riverside, and by the Office of Biological and Physical Research of the National Aeronautics and Space Administration (grant no. NNA04CC73C).
2 These authors contributed equally to the paper. www.plantphysiol.org/cgi/doi/10.1104/pp.105.062687. * Corresponding author; e-mail thomas.girke{at}ucr.edu; fax 9518274437.
Armstrong JI, Yuan S, Dale JM, Tanner VN, Theologis A (2004) Identification of inhibitors of auxin transcriptional activation by means of chemical genetics in Arabidopsis. Proc Natl Acad Sci USA 101: 1497814983
Austin CP, Brady LS, Insel TR, Collins FS (2004) NIH Molecular Libraries Initiative. Science 306: 11381139 Baurin N, Baker R, Richardson C, Chen I, Foloppe N, Potter A, Jordan A, Roughley S, Parratt M, Greaney P, et al (2004) Drug-like annotation and duplicate analysis of a 23-supplier chemical database totalling 2.7 million compounds. J Chem Inf Comput Sci 44: 643651[Medline]
Blackwell HE, Zhao Y (2003) Chemical genetic approaches to plant biology. Plant Physiol 133: 448455 Carhart RE, Smith DH, Venkataraghavan R (1985) Atom pairs as molecular-features in structure activity studiesdefinition and applications. J Chem Inf Comput Sci 25: 6473[CrossRef] Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42: 14071414[Medline]
Couzin J (2003) NIH dives into drug discovery. Science 302: 218221
de Hoon MJ, Imoto S, Nolan J, Miyano S (2004) Open source clustering software. Bioinformatics 20: 14531454
Drews J (2000) Drug discovery: a historical perspective. Science 287: 19601964
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 1486314868
Girke T, Lauricha J, Tran H, Keegstra K, Raikhel N (2004) The Cell Wall Navigator database. A systems-based approach to organism-unrestricted mining of protein families involved in cell wall metabolism. Plant Physiol 136: 30033008
Haggarty SJ, Koeller KM, Wong JC, Grozinger CM, Schreiber SL (2003) Domain-selective small-molecule inhibitor of histone deacetylase 6 (HDAC6)-mediated tubulin deacetylation. Proc Natl Acad Sci USA 100: 43894394 Ihlenfeldt WD, Voigt JH, Bienfait B, Oellien F, Nicklaus MC (2002) Enhanced CACTVS browser of the Open NCI Database. J Chem Inf Comput Sci 42: 4657[CrossRef][Medline] Irwin JJ, Shoichet BK (2005) ZINCa free database of commercially available compounds for virtual screening. J Chem Inf Comput Sci 45: 177182 Lipinski C, Hopkins A (2004) Navigating chemical space for biology and medicine. Nature 432: 855861[CrossRef][Medline] Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23: 325[CrossRef][ISI] Oprea TI (2002) Chemical space navigation in lead discovery. Curr Opin Chem Biol 6: 384389[Medline] Oprea TI, Gottfries J (2001) Chemography: the art of navigating in chemical space. J Comb Chem 3: 157166[Medline] Savchuk NP, Balakin KV, Tkachenko SE (2004) Exploring the chemogenomic knowledge space with annotated chemical libraries. Curr Opin Chem Biol 8: 412417[Medline] Schreiber SL (1998) Chemical genetics resulting from a passion for synthetic organic chemistry. Bioorg Med Chem 6: 11271152[CrossRef][Medline] Stockwell BR (2004) Exploring biology with small organic molecules. Nature 432: 846854[CrossRef][Medline]
Strausberg RL, Schreiber SL (2003) From knowing to controlling: a path from genomics to drugs using small molecule probes. Science 300: 294295
Surpin M, Rojas-Pierce M, Carter C, Hicks G, Vasquez J, Raikhel N (2005) The power of chemical genomics to study the link between endomembrane system components and the gravitropic response. Proc Natl Acad Sci USA 102: 49024907 Voigt JH, Bienfait B, Wang S, Nicklaus MC (2001) Comparison of the NCI open database with seven large chemical structural databases. J Chem Inf Comput Sci 41: 702712[CrossRef][Medline] von Grotthuss M, Koczyk G, Pas J, Wyrwicz LS, Rychlewski L (2004) Ligand. Info small-molecule Meta-Database. Comb Chem High Throughput Screen 7: 757761[Medline] Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38: 983996[CrossRef][ISI]
Zhao Y, Dai X, Blackwell HE, Schreiber SL, Chory J (2003) SIR1, an upstream component in auxin signaling identified by chemical genetics. Science 301: 11071110
Zouhar J, Hicks GR, Raikhel NV (2004) Sorting inhibitors (Sortins): chemical compounds to study vacuolar sorting in Arabidopsis. Proc Natl Acad Sci USA 101: 94979501 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|