|
|
||||||||
|
First published online August 1, 2002; 10.1104/pp.003459 Plant Physiol, August 2002, Vol. 129, pp. 1448-1463 Using Genomic Resources to Guide Research Directions. The Arabinogalactan Protein Gene Family as a Test Case1Department of Plant Science, Waite Agricultural Research Institute, The University of Adelaide, RMB1, Glen Osmond, South Australia 5064, Australia (C.J.S.); 13 Riverway, Fulham Gardens, South Australia 5024, Australia (M.P.R.); and Plant Cell Biology Research Centre, School of Botany, University of Melbourne, Victoria 3010, Australia (K.L.J., B.J.J., Y.M.G., A.B.)
Arabinogalactan proteins (AGPs) are extracellular hydroxyproline-rich proteoglycans implicated in plant growth and development. The protein backbones of AGPs are rich in proline/hydroxyproline, serine, alanine, and threonine. Most family members have less than 40% similarity; therefore, finding family members using Basic Local Alignment Search Tool searches is difficult. As part of our systematic analysis of AGP function in Arabidopsis, we wanted to make sure that we had identified most of the members of the gene family. We used the biased amino acid composition of AGPs to identify AGPs and arabinogalactan (AG) peptides in the Arabidopsis genome. Different criteria were used to identify the fasciclin-like AGPs. In total, we have identified 13 classical AGPs, 10 AG-peptides, three basic AGPs that include a short lysine-rich region, and 21 fasciclin-like AGPs. To streamline the analysis of genomic resources to assist in the planning of targeted experimental approaches, we have adopted a flow chart to maximize the information that can be obtained about each gene. One of the key steps is the reformatting of the Arabidopsis Functional Genomics Consortium microarray data. This customized software program makes it possible to view the ratio data for all Arabidopsis Functional Genomics Consortium experiments and as many genes as desired in a single spreadsheet. The results for reciprocal experiments are grouped to simplify analysis and candidate AGPs involved in development or biotic and abiotic stress responses are readily identified. The microarray data support the suggestion that different AGPs have different functions.
With the completion of the
Arabidopsis genome (Arabidopsis Genome Initiative [AGI], 2000 Many of the genes encoding the protein backbones of proteoglycans and
glycoproteins of the extracellular matrix are encoded by large
multigene families. These wall "proteins" are believed to be
involved in many aspects of plant growth and development, but their
roles are poorly defined. A large number of these proteins are rich in
Pro and/or Hyp. The Pro-/Hyp-rich glycoproteins (P/HRGPs) were
originally classified into three separate classes: the Pro-rich proteins (PRPs), the extensins, and the AGPs (Showalter, 1993 We have proposed the following nomenclature to provide some consistency
to the naming of the P/HRGPs (Johnson et al., 2002 For the extensins, agreement needs to be reached within the cell wall
community about whether the extensin "glycomodule," i.e.
Ser-Pro3-4, is sufficient to define an extensin, or
whether a larger repeat motif containing Tyr, Lys, and Val defines an extensin. Throughout this manuscript, we have adopted the latter view
that a "true" extensin contains the extensin glycomodule (Kieliszewski and Shpak, 2001 Individual members of each P/HRGP subclass are difficult to separate
biochemically. It has also been difficult to clone these genes because
of the repetitive nature of their protein backbones and, in the case of
AGPs, their overall low sequence similarity (Du et al., 1996 Putative orthologs of the classical AGPs originally identified in pear
(Pyrus communis; Chen et al., 1994 There are no obvious orthologues of the Asn-rich chimeric
(nonclassical) AGPs in Arabidopsis. There are three proteins in the
Arabidopsis annotated database (encoded by At1g03820, At1g28400, and
At2g33850) that contain Asn-rich C termini that are similar to NaAGP2
and PcAGP2 (Gaspar et al., 2001 AGPs are implicated in diverse roles in plant growth and development.
One of the most influential findings in AGP research in the last 5 years was the biochemical evidence that certain AGPs are anchored to
the plasma membrane by glycosylphosphatidylinositol (GPI) anchors
(Youl et al., 1998 One of the attractions of using Arabidopsis to study AGP function is
the availability of DNA insertion mutants. Because the proposed
function(s) of AGPs are diverse, it is not possible to devise a mutant
screen that would specifically identify mutants in AGP genes. The
publicly available pools of DNA from T-DNA-tagged lines has made it
possible to identify several AGP mutants (Gaspar et al., 2001 Now that the genome of Arabidopsis is sequenced, it is theoretically possible to identify all of the AGP protein backbone genes in a single species. We have used novel software approaches based on biased amino acid composition and, in the case of the AG-peptides, their short length, to identify AGPs in Arabidopsis. We have adopted a systematic approach to obtaining and evaluating the publicly available Arabidopsis resources to help us select specific subsets of genes for targeted experimental approaches. This approach can be used to help guide research direction in all gene families and should accelerate outcomes by enabling the choice of the most appropriate experiments for each family member.
Finding AGPs Based on Biased Amino Acid Composition With the completion of the Arabidopsis genome, we wanted to do a
final check for AGP protein backbone genes. Finding all AGP family
members using BLAST is difficult due to their low degree of amino acid
similarity (Du et al., 1996
Not all of the known AGPs were identified using this high PAST
percentage, so the program was run with a lower cutoff (50% PAST).
Using these criteria, 62 proteins were identified, of which 49 were
predicted to be secreted. Only one of the AGPs that we had previously
identified (Schultz et al., 2000 Finding AG-Peptides Based on Size and Biased Amino Acid Composition To identify AG-peptides the same set of 25,617 predicted proteins
from the Arabidopsis genome was searched to identify proteins between
50 and 75 amino acid residues in length. A total of 308 proteins were
found (Table I). This number was reduced to a manageable level by first
selecting all proteins with a PAST composition of >35% and then
selecting the proteins that are predicted to be secreted. As with the
classical AGPs, the ones that were not predicted to be secreted did not
look like AGPs and tended to have most of their PAST residues in the
N-terminal secretion signal or C terminus, where the GPI anchor signal
would be for an AG-peptide. Of the 19 proteins identified, two new
AG-peptides were identified (At3g57690 and At5g40730), as well as the
known AG-peptides (Schultz et al., 2000 Finding FLAs Using Hidden Markov Models FLAs, along with other classes of chimeric AGPs, are not
identified using the biased amino acid search at the 50% PAST
threshold because the length of the fasciclin domain(s) are large
compared with the regions containing the AGP glycomodule(s). For
example, the entire FLA7 protein is only 39% PAST. However, if the
single fasciclin domain is ignored, the remaining protein is 52% PAST. When the program is run at 39% PAST, FLA7 is identified along with 344 other proteins, including a histone H1 (At2g30620, 46% PAST), a
We were interested in determining whether fasciclin domains were
associated with protein domains other than AGP domains, so we adopted a
strategy to identify all Arabidopsis proteins containing fasciclin
domains. Fasciclin domains are approximately 100 amino acids long and
are not well conserved (Kawamoto et al., 1998 A hidden markov model for the 88 fasciclin domains in the Pfam database
was generated using HMMbuild (Durbin et al., 1998 In total, we identified 14 classical AGPs, 10 AG-peptides, three
"basic" AGPs containing a Lys-rich domain, and 21 FLAs. All of these genes are listed in Table II. A
flow chart of data to collect for each gene was adopted to help
prioritize and guide our research (Fig.
1). A list of the Web sites used at each
step is included in Figure 1 and details of the preferred options at each site are provided in "Materials and Methods." All of the annotated proteins were checked for the presence of a N-terminal secretion signal and for the C-terminal signal required for the addition of a GPI anchor. PSORT calculates a probability for the subcellular localization of each protein, e.g. cytoplasmic, organellar (membrane or lumen), extracellular, or GPI anchored. Most of the AGP
protein backbone genes are predicted to be GPI anchored (Table II). For
AGPs, PSORT (Nakai and Horton, 1999
Profilescan was used to confirm the presence of fasciclin domains in the FLAs identified using hidden markov models. Profilescan identifies many motifs, including common motifs such as potential N-linked glycosylation sites. Most of the fasciclin domains contained one or more N-linked glycosylation sites. After identifying the proteins, we were interested in determining what is known about the expression profile of each of the AGP protein backbone genes. EST Contigs and Expression Summaries TIGR maintains a database, the Arabidopsis Gene Index (AtGI), that combines the DNA sequence of the annotated proteins from the Arabidopsis genome and ESTs into contigs of overlapping sequence. This information is useful because it identifies expressed genes and provides an "expression summary" of each gene based on the library source of each EST. In addition, the alignment of clones can be used to check the annotation of the genomic sequences. For example, if the annotated protein is significantly shorter than the region spanned by all of the ESTs, it is possible that the protein is not correctly annotated. To access the TIGR contigs, it is necessary to first identify a single
EST by doing a translated Blast search (tBLASTn) against the EST
database (Altschul et al., 1997 Many AGPs Are on the AFGC Microarray Thirty-five of the 47 AGP genes (including FLAs) are represented
on the current AFGC microarray (Wu et al., 2001b Reformatting AFGC Microarray Data to Compare Multiple Genes in All the Experiments To make the AFGC microarray data more accessible, a computer program (Perl script) was written that makes it possible to view the ratio data for all AFGC experiments and as many genes as desired in a single spreadsheet. A "snapshot" of the output from this program is in Table III. By reformatting the data, it is easier to check the consistency of results where there are multiple ESTs for the same gene and also to identify experiments where there is no data.
The experiments that produced significant results for the AGPs
are summarized in Table IV. The
microarray results supports the suggestion that different AGPs have
different functions (Knox, 1995
Unexpected Complexity in the Response of AGP2 to Acid and Al RNA gel-blot analysis was used to experimentally verify the
results of the microarray experiment for Al stress for AGP2
because data verification is a key to interpreting DNA microarray
results (Wu et al., 2001a
Affymetrix Array The Affymetrix microarray GeneChips
(Affymetrix, Sunnyvale, CA) are increasingly being used because
of their in-built controls. Julian I. Schroeder's (University
of California, San Diego) group has made it easier to determine if a
gene of interest is on the Affymetrix chip (Ghassemian et al., 2001 Checking for Insertion Mutants in Each Gene Initially, we looked for tagged mutants using PCR from the pools
of DNA from the Feldman lines that are distributed by the Arabidopsis
Biological Resource Center (ABRC; Ohio State University, Columbus),
using the method described by McKinney et al. (1995)
The sequencing of the border regions of large populations of insertion
mutants makes it possible to identify mutants more efficiently (Liu et
al., 1995
Finding Genes Using Biased Amino Acid Composition and Size With the completion of the Arabidopsis genome, we wanted to be certain that most of the AGP protein backbone gene family members were identified as we are attempting a systematic analysis of AGP function. Finding AGP protein backbone genes with BLAST searches is very time consuming because for each gene, it is necessary to analyze every promising "hit" by obtaining the full-length sequence for the similar protein and determining whether the sequence matches the criteria for an AGP. By designing our own software, each of the 25,617 predicted proteins in the Arabidopsis genome was evaluated for the most prominent feature of an AGP, namely the high PAST. This analysis (>50% PAST) identified 62 candidate genes, less than the number of sequences that can be returned for a single BLAST search (Table I). Using tailored approaches for each type of AGP, we have identified three new classical AGPs and two new AG-peptides (plus three AG-peptides that we had previously identified but not published). Only one of the AGPs (AGP3) that we had previously identified was not
found using this approach (Schultz et al., 2000 We are confident that we have identified most of the AGPs that have been sequenced because most of the classical AGPs and AG-peptides do not have introns and therefore their annotation is relatively straight forward. Those that do have introns have a single intron between the exon coding for the mature protein backbone and the exon coding for the C-terminal signal for addition of the GPI anchor. Therefore, even if the AGPs with introns are incorrectly annotated, because the second exon is missing, they would still be detected using the biased amino acid approach. We have not identified all of the proteins containing AGP glycomodules.
Proteins with AGP glycomodules include the FLAs, the chimeric
(nonclassical) AGPs (Mau et al., 1995 Finding all of the protein backbones containing AGP glycomodules could
be achieved by calculating the PAST percentage in overlapping "windows" of 15 to 25 amino acid residues. A "windows"
approach should pick up less false positives than further reducing
the PAST percentage. When the PAST percentage is reduced to 39%, the level needed to identify FLA7, other proteins such as a histone H1
(At2g30620, 46% PAST), a To identify all of the FLAs, we decided to use the fasciclin domains as
our searching criteria. Twenty-one proteins in Arabidopsis contain one
or two fasciclin domains and all of these proteins have at least one
region containing AGP glycomodules. The size of the AGP region is
variable (data not shown). The FLAs can be separated into three classes
(Gaspar et al., 2001 In plants, all of the proteins containing fasciclin domains also
contain AGP glycomodules, suggesting that regions of the protein
backbone will be glycosylated with large O-linked
polysaccharide chains. This is generally not the case with animal
proteins containing fasciclin domains. However, the fasciclin domains
in the protein AlgalCAM, involved in cell adhesion in Volvox
carteri, are flanked by SPn and
TPn motifs (Huber and Sumper, 1994 Other P/HRGPs in Arabidopsis Our search for AGPs also identified 19 genes encoding extensins.
The details of the extensins is described elsewhere (Johnson et al.,
2002 Twelve of the 19 extensins identified here contain a Ser-Pro-Ser-Pro
motif in the middle of every second Tyr-, Lys-, and Val-rich spacer
[e.g.
S(P)4YVYSS(P)4YYSPSPKV(D/Y)YK].
This suggests that these extensins will have both large AG-containing
polysaccharides (attached to non-contiguous Hyp residues) and short
arabinosyl chains (attached to contiguous Hyp residues; Goodrum et al.,
2000 Extensins, unlike the AGPs, contain conserved repetitive motifs, so it may have been possible to identify many of the extensins using BLAST searches. One advantage of the biased amino acid approach is that the entire protein is returned in the output, rather than a small region of the protein. Therefore, it is possible to classify the protein immediately. The software we have developed can be modified for many different
classes of proteins with biased amino acid compositions, e.g. the
Gly-rich proteins and the PRPs (Johnson et al., 2002 Identifying and Evaluating Genomic Resources DNA insertion mutants are essential in our quest for understanding
the function of AGPs. Finding tagged mutants is now considerably easier
than 5 years ago due to the widespread sequencing of the genomic DNA
flanking DNA insertions (Liu et al., 1995 The first DNA insertion mutants we identified for AGP protein backbone
genes came from PCR screening of DNA pools from the Feldman lines
(McKinney et al., 1995 Finding phenotypes for the other AGP mutants is a major challenge. Two
different approaches have been used by other researchers to uncover
phenotypes. One is to apply a wide range of environmental conditions
and/or stresses (Meissner et al., 1999 Large tissue-specific libraries from the Kasuza Research Institute
(Chiba, Japan) provide an electronic RNA gel-blot analysis (Asamizu et al., 2000 AFGC Microarray Experiments Are a Valuable Resource to the Arabidopsis Community The AFGC microarray results are particularly attractive to researchers of genes with unknown function. Until now, it has been very difficult to analyze all of the members of a multigene family in all of the AFGC experiments. By reformatting the data, it is possible to view the ratio data for all AFGC experiments and as many genes as desired in a single spreadsheet. One of the other difficulties in interpreting the AFGC data is that it is very time consuming to determine what experiments were actually performed. This is necessary to identify the appropriate control (reciprocal) experiments. By grouping reciprocal experiments and providing more details about each experiment, we have made the AFGC data more accessible. We will collaborate with the AFGC to develop a Web-based version of this software for inclusion at the Stanford Microarray Database. This will make it even easier for researchers to: (a) analyze their favorite genes, (b) check the consistency of results where there are multiple ESTs for a single gene, and (c) identify experiments where there is no data. This last point is important because a nonsignificant ratio has a very different meaning than no data. Specific AGPs Respond to Biotic and Abiotic Stress One hypothesis is that all AGPs of the same subclass would have
the same function. The microarray results suggest that this is not the
case. Rather, specific AGPs from each subclass respond to similar
stimuli. Classical AGPs (AGP1 and AGP2), Lys-rich
AGP18, AG-peptide AGP12, and FLAs
(FLA2 and FLA18) are up-regulated by antimycin A
treatment (experiments 5197 and 5198, Table IV). This treatment leads
to stress by inhibiting electron transport in the mitochondria and
induces the expression of alternate oxidase. In the published analysis
of this microarray experiment, cluster analysis was used to identify
other experiments where a significant number of genes were regulated in
the same manner (Yu et al., 2001 To validate the results of the Al microarray experimentally, RNA gel-blot analysis was used. Our results show that at 3 h after treatment with Al, there is minimal difference between the treated and untreated samples (Fig. 2). However, at 8 and 24 h, the expression of AGP2 is much higher in the plants grown in the presence of Al. Unfortunately, there was no control for the zero time point in this experiment where the media was changed to a low-pH media for both the Al-treated and untreated samples. The simplest scenario is that AGP2 expression is relatively abundant before changing the media and that minimal changes in expression levels occur in the first 3 h after the media change. This is supported by the fact that five of the 14 ESTs for AGP2 were obtained from a library made from liquid-cultured seedlings (AtGI library no. 5338). The dramatic differences seen at 8 and 24 h after the addition of the stress suggest that AGP2 is not involved in the recognition of stress, but rather, it is an integral part of the plant response to Al stress. Precisely how AGP2 helps the plant respond to Al stress is an important
question for the future. It is interesting that only 255 of 8,000 nonredundant genes on the AFGC array were specifically up-regulated by
Al. AGP2 showed the 42nd highest ratio between treated and
untreated samples on the array. As with all microarray experiments, it
is important to validate the result before proceeding with further
analysis (Wu et al., 2001b Spatial bias is where the expression ratios are influenced by the
physical position of the genes on the array (Finkelstein et al., 2002 AGP2 is also up-regulated by biotic stress. When wild-type
plants are infected with the bacterium X. campestris,
AGP2 expression is higher than in mock-treated plants.
However, in etr1 plants, the levels of AGP2
expression were not affected by bacterial infection. etr1
mutants lack one of the ethylene receptors (Hall et al., 1999 A Different Subset of AGPs Respond to Developmental Changes AGPs are known to be abundant components of the transmitting tract
of styles (Sommer-Knudsen et al., 1997 The expression level of at least one AGP from each subclass is
differentially expressed in roots compared with undifferentiated or
redifferentiating tissue (Table IV). These microarray experiments were
performed in tissue culture by comparing roots with root explants that
had been placed on callus-inducing media and subsequently on
shoot-inducing media. These findings support the theory that certain
AGPs are markers of cell identity (Dolan et al., 1995 Our microarray reformatting program is complementary to a new program released by AFGC called Expression Viewer. Expression Viewer is designed to identify groups of genes that have similar expression patterns over several different experiments in the same category (i.e. abiotic stress or development). This option is accessed by selecting the second (right) icon that appears on selected ESTs (i.e. those on the array) after searching the clone list (see "Materials and Methods"). This program is particularly useful for identifying unrelated genes that are regulated in the same manner as the gene of interest. The Expression Viewer is not as "sensitive" as our program at picking up genes with similar expression patterns over only a few experiments. For example, if you choose the EST for AGP18 (205N2T7), the results in the hormone dataset of the expression viewer only show two ESTs (both corresponding to a non-AGP gene, At1g22530). The hormone dataset of experiments includes the 30-min antimycin treatment where AGP18, FLA2, and FLA18 are all significant in the both the test and the reciprocal experiments (Table IV). In the development dataset of experiments that includes the flowers to leaves comparison, two of the FLAs, FLA1 and FLA8, are identified as being regulated in a similar manner as AGP18, as are seven other non-AGP genes (e.g. an endo-1,4-beta glucanase [122H24T7/At4g02290]). However, many other AGP genes were identified as significant in our analysis (Table IV). Our approach has the limitation that it only looks at a user-defined subset of genes. By combining these two approaches, and performing whole-gene family analysis on the different classes of genes identified by Expression Viewer, it should be possible to identify other gene families that are important for AGP function.
Evaluating the available genomic resources, as outlined in Figure
1, should help all researchers determine which family members to
concentrate on for specific categories of experiments. In some cases,
it will be desirable to concentrate on genes with similar expression/response profiles, or it may be preferable to choose ones
with different profiles. By checking just a few Web sites, researchers can determine whether an insertion mutant already exists
for the gene of interest, and failing this, they can choose to make
knockouts using RNA interference (Wesley et al., 2001 When experiments are performed that require the sampling of plant tissue, the EST expression summary and AFGC microarray data will help select the most appropriate tissue for analysis to maximize the levels of RNA for all genes. This information will also help in the choice of genes for analysis. Obviously, all of the genes with significant microarray results in test and reciprocal experiments should be used. In many cases, only a few genes need to be tested. The inclusion of genes that are borderline (i.e. one significant value and the other value either less than 0.66 or greater than 1.5 as appropriate) is easily done and may prove informative. If the results are not significant, this gene can be used as a control. In cases where mutants do not have a phenotype, the information accumulated will help researchers make more informative choices about the treatments to perform to uncover phenotypes and which mutants to cross. For example, AGP1 and AGP2 are regulated in a similar fashion; therefore, they are more likely to be functionally redundant than say AGP1 and AGP6. Therefore, a cross between agp1 and agp2 mutants may be more informative than a cross between agp1 and agp6 mutants. The new data on AGPs presented in this manuscript support the suggestion that different AGPs have different functions. We are initially focusing our research effort on those candidate AGP genes, based on EST and microarray data, which are specifically implicated in plant development and in biotic and abiotic stress responses and for which we have access to putative loss-of-function mutants.
Plant Material Wild-type Arabidopsis (Columbia-0 strain CS1092, ABRC) plants were used. For the Al stress experiment, seed were sterilized and grown in 50 mL of media in 250-mL conical flasks, approximately 40 seeds per flask. Flasks were rotated slowly at room temperature under continuous fluorescent light. Plants were germinated and grown for 11 d in an unbuffered media containing 250 µM NH4SO4, 250 µM Ca(NO3)2.4H20, 200 µM KH2PO4, 1 mM CaCl2.2H2O, 1 mM MgSO4.7H20, 1 mM K2SO4, 1 µM MnSO4.4H2O, 5 µM H3BO3, 0.05 µM CuSO4.5H2O, 0.2 µM ZnSO4.7H2O, 0.02 µM Na2MoO4.2H2O, 0.001 µM CoCl2.6H2O, and 1% (w/v) Suc (measured pH was 5.4). After 11 d, the media was poured off and 50 mL of fresh low pH media, either with or without 50 µM AlCl3.6H2O, was added. The low-pH media was the same as the germination media, except that KH2PO4 was omitted and replaced by 3 mM homopiperazine-1,4-bis(2-ethanesulfonic acid), to give a pH of 4.5. Three separate flasks contained low-pH media with 50 µM AlCl3 and three control flasks contained the low-pH media without Al. Two flasks were harvested for each time point (3, 8, and 24 h), one with, and one without, AlCl3. Finding AGPs and AG-Peptides Using Biased Amino Acid Composition and Length A Perl script (protein bias) was written to calculate the PAST
for all of the proteins in the Arabidopsis annotated database. The
program can be downloaded from our Web site
(http://planta.waite.adelaide.edu.au/people/cs/index.htm). The
annotated database (ATH1.pep; August 10, 2001) was downloaded from the
TIGR Web site
(ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/SEQUENCES/). For
information on the Perl language, consult Wall et al. (2000) Proteins were classified as AGPs if they did not contain repeats
associated with extensins or PRPs (e.g. Ser-Pro4 or
Pro-Pro-Xaa-Yaa-Lys), but contained predominantly Ala-Pro, Ser-Pro, or
Thr-Pro throughout the protein with no more than 11 amino acid residues
between consecutive Pro residues. The exception here is the Lys-rich
AGPs that are a subclass of GPI-anchored AGPs. These AGPs have a
Lys-rich domain of approximately 16 amino acid residues that is flanked
on both sides by AGP glycomodules. AGPs were defined as chimeric if
they contained a region with Ala-Pro, Ser-Pro, and/or Thr-Pro motifs and other regions with 20 or more amino acid residues between (A/S/T) P
motifs (excluding the N- and C-terminal signals). These proteins
included an ENOD20-like protein (At4g27520), a nonspecific lipid
transfer protein (At1g36150), and two blue copper-binding proteins
(At3g60280 and At5g53870). In most cases, the AGPs and the AGP chimeric
protein backbones were predicted to be GPI anchored by PSORT (Nakai and
Horton, 1999 To identify AG-peptides, the "short" output from the protein bias
program (again searching the 25,617 annotated proteins in the
Arabidopsis database) was analyzed. Proteins were classified as
AG-peptides if the encoded protein backbone was between 55 and 75 residues in length and also contained an N-terminal signal sequence and
at least two consecutive Ala-Pro or Ser-Pro motifs in the mature
protein backbone. AGP23 contained only two consecutive Ala-Pro or
Ser-Pro motifs and all of the other AG-peptides (AGP12-AGP16 and
AGP20-AGP24) contained three consecutive motifs. All of the AG-peptides, except AGP16 and AGP20, are predicted to be GPI anchored by PSORT (Nakai and Horton, 1999 Proteins were classified as extensins if they contained repeats of Ser-Pro3 and/or Ser-Pro4 and these repeats were mostly separated by Tyr, Lys, and Val residues. One of the extensins (At4g08370) contained Ser-Pro5 repeats and an S(V/A) PR(I/V)(P/T) FIY spacer. The single "hybrid" P/HRGP (At1g62763) contained many Ser-Pro and Ser-Pro2 motifs, two SSPPPSLSLPSSPPPPPP motifs in the N-terminal domain of the mature protein (116 residues), and a C-terminal domain (171 residues) with similarity to citrus pectin methyl esterase (At1g62763). The N-terminal domain is not rich in Tyr, Lys, and Val; therefore, this protein is not considered an extensin. There are no ESTs for this "hypothetical" protein. The proteins classified as others included several PRPs/hybrid PRPs (At2g27380, At3g22120, At4g22470, and At5g14920), a possible En/Spm transposon (At2g28440), and proteins containing no known or consistent motifs (At5g11990, At2g22510, At1g31250, and At3g22070) as determined by Profilescan (http://hits.isb-sib.ch/cgi-bin/PFSCAN). The two proteins classified as short extensins may not be annotated correctly based on TIGR contigs (data not shown). To find FLAs, we obtained the full Pfam alignment of 88 fasciclin
domains from the Pfam Web site
(http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF02469). This alignment
was saved as a text file and used to build a hidden markov model for
fasciclin domains using the HMMbuild program in HMMER2.2 (Durbin et
al., 1998 A list of the CloneIDs for many of the ESTs that we sequenced is in
Schultz et al. (2000) Key Web Sites and Options Needed to Access Information We have developed a flow chart to streamline the information available to us about the AGP genes (Fig. 1). The links to follow and/or the options to select at each Web site are detailed below using the numbering system in Figure 1. Link 1 is genomic information. By typing the GenBank accession or
GenInfo Identifier number and selecting search, you get a page
containing the following information: AGI locus (e.g. At1g35230); bacteria artificial chromosome locus (e.g. T9I1.2); and a
graphical display of the annotation of the gene using the three
commonly used gene prediction programs, GlimmerA (Mihaela Pertea,
http://www.tigr.org/tdb/glimmerm/glmr_form. html), Genscan+ (Chris
Burge, http://genes.mit.edu/GENSCAN.html), and GenMarkHMM (Mark
Borodovsky, http://opal.biology.gatech.edu/GeneMark). These and
other annotation programs have been reviewed recently (Aubourg and
Rouzé, 2001 In link 2, SignalP is used to determine if the protein contains an
N-terminal signal sequence for targeting of the protein (Nielsen et
al., 1997 In link 3, PSORT (Nakai and Horton, 1999 In link 4, Profilescan allows you to search Prosite and Pfam databases for protein motifs. You should select all four databases. In link 5, to find the "first" EST for each gene, it is best to do a tBLASTn search against the EST database. It is necessary to choose database EST_others (the default is nonredundant [nr]). To increase the speed, limit the search to Arabidopsis ESTs (select from the pull-down menu). To maximize the chance of finding an EST, select the following options: turn filtering off (especially important for repetitive proteins), set the word size to 2, and increase the expectation to 100 or 1,000. Once an EST is found with high similarity (>95%), the GenBank accession number (GB#) can be pasted into the fourth box on the TIGR AtGI Reports page (see 6 below) to find the contig (TC#) for the specific gene. In link 6, the GenBank accession number of any EST can be used to find the contig. To view the contig, select the link for contig (e.g. TC109594). The contig shows the overlapping ESTs and indicates the source of each EST. To determine the library source of each EST, select the expression profile button to view the summary of hits from each library. In link 7, the best way to determine if any of the ESTs are on the AFGC array is to enter the AGI locus (e.g. At5g10430). This will provide a list of all ESTs for the gene. The ESTs that are on the array will come up with two colored icons. The left (array) icon gives you the histogram showing the ratio data of all experiments. If nothing appears when you click an icon, different Web browser should be tried (e.g. Internet Explorer or Netscape 6). To save all of the data for this EST onto a local computer, click on the "download all data" button. By saving this file as a tab-delimited file with an appropriately formatted filename (see below), this data can be reformatted using our custom software to make it easier to interpret the AFGC microarray data. The right (Expression Viewer) icon on the EST clone list is explained in "Discussion." First time users of the AFGC microarray facility may want to click one or more of the "outlying" green bars on the histogram (i.e. those with a ratio of less than 0.5 or greater than 2.0). The experiments with the selected ratio will appear on the right (e.g. Shoot Development Affy Scan 4). Click on the "display data" button to obtain more information about the experiment(s) of interest. Select the "option" icons on this "search results" page to obtain details about each experiment. Click on the "view" icon to see what RNA samples were used in channel 2 (red) and channel 1 (green). Note the stated ratio should be red (R) to green (G). The right most of the option icons on the search results page provides a scatter plot, which gives an indication of how many genes have significant ratios. To see genes with the 50 highest ratios in any experiment, click on the data icon, then highlight the CloneID, Gene model, Accession, and Description options, then click on the submit button (do not change any of the other preselected values the first time). If you want to see the histogram data for one of the genes in the top 50, select the "history" link. To obtain more information about each experiment (i.e. abstracts and RNA information), go to http://afgc.stanford.edu/afgc_html/site2Cycle1.htm. An alternate way to access all of the experiments is from http://genome-www5.stanford.edu/MicroArray/SMD. From this page select public search to get to the advanced results search page. From here, select: (a) Arabidopsis, (b) All (experiments), and (c) display data. In link 8, the Affymetrix GeneChip array has approximately 8,200 genes
on it. It is an oligonucleotide-based array with 16 probe pairs per
gene. For each probe pair, there is an exact match 25 oligomer and
a corresponding oligomer with a single nucleotide mismatch. To
determine if a gene is on the GeneChip, the AGI locus identifier is used to search the file downloaded from
http://www-biology.ucsd.edu/labs/schroeder/downloads.html (Ghassemian
et al., 2001 In link 9, FSTs from the Versailles collection of Genoplante are identified by selecting the "requests to FLAGdb/FST" link. Only partial DNA sequence is required. The first sequence should have an E value of 0.0 and represent the exact match. The last column will indicate the number of FST (if any) that are found in a 20-kb window surrounding the gene. This page displays the exon-intron borders of all the annotated proteins around your gene and a "flag" icon is used to indicate the position of the FSTs. A red flag indicates that this is the best match in the genome for the FST. A green flag indicates that there are better matches in the genome. In link 10, it is necessary to submit sequences one at a time to search the TMRI Arabidopsis T-DNA collection. The searches are not confidential and these lines have a more restrictive material transfer agreement (MTA) than the other lines. The MTA can be downloaded from the site. The BLAST results are e-mailed to the address provided. In link 11, single or multiple sequences can be submitted to Insertwatch. For those wishing to submit multiple sequences, this site provides a description of the required "FastA" format. If you do not want to be contacted automatically when a new flanking sequence matches your gene, use Insertblast. Only 2000 sequences are currently available, but more sequences are expected to be released. In link 12, the IMA lines were generated by Professor Venkatesan
Sundaresan (Institute of Molecular Agrobiology, Singapore). He
is now at the University of California (Davis). Approximately 500 border sequences were obtained from Ds insertion lines (Parinov et al.,
1999 Reformatting AFGC Microarray Information A Perl script, ma_analysis.pl, was written to reformat the AFGC microarray data into a format where the ratio data for ALL of the AFGC experiments and as many genes as desired can be viewed in a single spreadsheet. Input1 for this program is a tab-delimited spreadsheet ("arraystart.txt") that contains the category of each experiment as defined by AFGC (column 1), the experiment ID (column 2), and a description of each experiment rewritten to make it more clear what is being compared (column 3). "Test" and "reciprocal" experiments (where present) are on consecutive rows, with unrelated experiments separated by a blank row. In all cases, the experimental details are based on the Channel 2 (red) to Channel 1 (green) description found in the "view" option from the "advanced search page" on the AFGC Web site (January, 2002). In some cases, e.g. leaves to flowers (experiment ID 2,370), this information is not consistent with the experiment name "flowers leaves." The microarray data for each gene is downloaded from the AFGC Web site (see 7 above) and saved onto the local machine as a tab-delimited file (.txt). The filename given to the data for each gene must be in a precise format because this information provides the column heading information in the final spreadsheet containing all the genes. The format is GeneName_CloneorGenBankID_Date, with an underscore separating the components; for example, AGP4_T41664_28.11.01. The underscore character cannot be used as part of "GeneName," "CloneorGenBankID," or "Date". The filenames of each file containing the microarray data are placed in another file, one filename per line and in the order that the columns should appear in the final spreadsheet. This list file is saved as a text only file (e.g. datafile.txt). The program and arraystart.txt file can be downloaded from our Web site (http://planta.waite.adelaide.edu.au/people/cs/index.htm). To run the program on either a Unix- or Windows-based computer, go to the command line and type Perl ma_analysis.pl arraystart.txt datafile.txt >arrayout.txt. The command line places the output into the file arrayout.txt in tab-delimited format and is suitable for exporting into various computer applications, including Excel (Microsoft, Redmond, WA). To make it easier to identify all of the "significant" ratios, conditional formatting can be used to make all values greater than 2.00 or less t |