|
|
||||||||
|
First published online August 6, 2004; 10.1104/pp.104.041640 Plant Physiology 135:2040-2045 (2004) © 2004 American Society of Plant Biologists Types and Frequencies of Sequencing Errors in Methyl-Filtered and High C0t Maize Genome Survey Sequences1,[w]Department of Genetics, Development, and Cell Biology (Y.F., L.G., P.S.S.), Department of Agronomy (A.-P.H., P.S.S), Interdepartmental Graduate Programs in Genetics (Y.F., P.S.S.) and Bioinformatics and Computational Biology (L.G., P.S.S.), and Center for Plant Genomics (P.S.S.), Iowa State University, Ames, Iowa 500113650
The Maize Genome Sequencing Consortium has deposited into GenBank more than 850,000 maize (Zea mays) genome survey sequences (GSSs) generated via two gene enrichment strategies, methylation filtration and high-C0t (HC) fractionation. These GSSs are a valuable resource for generating genome assemblies and the discovery of single nucleotide polymorphisms and nearly identical paralogs. Based on the rate of mismatches between 183 GSSs (105 methylation filtration + 78 HC) and 10 control genes, the rate of sequencing errors in these GSSs is 2.3 x 103. As expected many of these errors were derived from insufficient vector trimming and base-calling errors. Surprisingly, however, some errors were due to cloning artifacts. These G C to A T transitions are restricted to HC clones; over 40% of HC clones contain at least one such artifact. Because it is not possible to distinguish the cloning artifacts from biologically relevant polymorphisms, HC sequences should be used with caution for the discovery of single nucleotide polymorphisms or paramorphisms. The average rate of sequencing errors was reduced 6-fold (to 3.6 x 104) by applying more stringent trimming parameters. This trimming resulted in the loss of only 11% of the bases (15,469/144,968). Due to redundancy among GSSs this more stringent trimming reduced coverage of promoters, exons, and introns by only 0%, 1%, and 4%, respectively. Hence, at the cost of a very modest loss of gene coverage, the quality of these maize GSSs can approach Bermuda standards, even prior to assembly.
The National Science Foundation is funding a project to compare two gene enrichment strategies for sequencing the gene space of the maize (Zea mays) genome. The first strategy uses methyl filtration (MF) to enrich for the gene-rich fraction of the genome (Rabinowicz et al., 1999
As of September 2003, 450,166 MF and 445,541 HC genome survey sequences (GSSs) from the maize inbred line B73 had been deposited in GenBank (http://www.tigr.org/tdb/tgi/maize/progress_graph.shtml). Ultimately, it will be important to assemble these GSSs into contigs. We have recently reported the development of innovative parallel algorithms that can assemble more than 730,000 GSS fragments in 4 h using 64 Pentium III 1.26 GHZ processors of a commodity cluster (Emrich et al., 2004
Sequencing errors in the GSSs also affect their utility for the detection of single nucleotide polymorphisms (SNPs) and nearly identical paralogs (NIPs). Comparisons between MF and HC GSSs generated from the inbred line B73 and sequences from other maize lines can reveal the presence of SNPs that have value both as molecular markers and as characters for phylogenetic analyses. Sequencing errors will, however, complicate SNP discovery. Recent segmental duplications of the human genome have generated large numbers of NIPs that have complicated the assembly, annotation, and analyses of that genome (Bailey et al., 2001
We previously estimated the rates of sequencing errors in the MF and HC GSSs as being 3.5 x 103 and 2.5 x 103, respectively (Emrich et al., 2004
Identification of Errors from Maize MF and HC GSSs The rate of sequencing errors in MF and HC GSSs was estimated by comparing GSSs to a set of 10 genes we cloned and sequenced (Table I; H. Yao, L. Guo, Y. Fu, T.-J. Wen, L. Borsuk, D. Skibbe, X.Q. Cui, B.E. Scheffler, J. Cao, F. Liu, D.A. Ashlock, and P.S. Schnable, unpublished data). Because both strands of each of these genes were sequenced and the resulting trace files were manually checked for base-calling errors, the error rate in this data set was expected to be low. This feature makes this data set of approximately 74 kb of finished sequence a suitable control for estimating the error rate in MF and HC data. MF and HC GSSs derived from these 10 control genes were identified using BLAST (see "Materials and Methods"). This process identified 105 MF and 78 HC GSSs (total length approximately 145 kb, Table I) for further analyses. Figure 1 depicts the GSSs derived from the rf2a gene (for similar data on the other control genes, see Supplemental Fig. 1, AI, which may be viewed at www.plantphysiol.org).
The fact that at least four MF and three HC reads matched each of the control genes provides encouraging evidence as to the success of the two gene enrichment approaches. Among the control genes, the ratios of recovered MF:HC GSSs range from 4:24 (rth1) to 14:3 (rf2e1; Table I), presumably reflecting gene-specific differences in the genomic sampling achieved by the two gene enrichment strategies. For four genes, the GSSs covered all of the exonic regions; the coverage of exonic regions among the remaining six genes ranged from 61% to 97% (Table II). For three genes, all of the intronic regions were covered by GSSs; the coverage of intronic regions among the six remaining genes that contain introns ranged from 54% to 90%. Promoter regions (defined for the purposes of this study as being the 500 bp upstream of the 5' end of the apparently full-length cDNA sequence) were less well represented among the GSSs. Of the seven promoters that could be analyzed, three were fully represented by GSSs, two were partially represented (45% to 57%), and two were not represented by any GSSs. As compared with HC GSSs, MF GSSs had a higher coverage of both promoters (53% versus 19%) and exons (71% versus 56%), but lower coverage of introns (37% versus 51%). Hence, these analyses demonstrate that the two gene enrichment strategies are complementary at identifying not only transcribed regions as previously demonstrated by Whitelaw et al. (2003) Each mismatch in an alignment between a GSS and a control gene triggered a second round of manual checking of the trace files associated with both the GSS and the control gene. These analyses identified a few errors in the approximately 74 kb of control sequences. After correcting these errors, the remaining 339 mismatches were deemed to be errors in the MF and HC reads, resulting in an average error rate of 2.3 x 103 (339/144,968; Table III) in the GSSs. The distributions of errors in each gene are diagramed in Figure 1 and supplemental data (Fig. 1, AI). The average error rates were lower in MF versus HC GSSs (2.1 x 103 versus 2.6 x 103; Table III). This was also usually true at the level of individual genes; in all but two genes (rf2b and rf2e1) the rates of sequencing errors in the MF GSSs were lower than in the corresponding HC GSSs.
Three Classes of Errors The sequencing errors detected in the GSSs can be grouped into three classes (Table III). The 21 Class I errors were located in the first eight bases of GSS reads and were associated with regions of the sequence traces that had high-quality Phred scores (>30, indicating the error probability was less than 103). Because these errors exhibited 100% identity to the sequences of the vectors used in the production of the MF and HC libraries (data not shown), we conclude that they are the result of insufficient vector trimming of GSS reads prior to deposition in GenBank. The 281 Class II errors were all associated with regions of the GSS reads that had low-quality scores (i.e. they are true base-calling errors). Ninety percent of these errors have quality scores <20 and all have quality scores <30. Class II errors consist of deletions (125), insertions (90), and base substitutions (66; data not shown). The average rate of Class II errors is 1.9 x 103 (281/144,968). Approximately 18% of the Class II errors detected in this study were located in the first 50 bp at the 5' ends of sequence reads; another 50% were located in the terminal 50 bp at the 3' ends of these sequence reads. All of the base substitution errors at the 5' ends of GSSs were due to aberrant T-traces (see Supplemental Fig. 2). The rates of sequencing errors in the first 50 bp (5.2 x 103) and last 50 bp (1.6 x 102) were approximately 3-fold and 8-fold higher than the overall rate of Class II errors, respectively. This suggests that the frequency of errors could be reduced substantially by more stringently trimming the ends of sequence reads.
To test this hypothesis, Lucy software (Chou and Holmes, 2001
The 37 remaining errors (Class III, Table III) were all G The 16 Class III errors located in overlapping regions of clone pairs were detected by both the forward and reverse sequencing reads. Manual inspection of the sequence traces and the results of the clone pair analysis demonstrate that the Class III errors are not base-calling errors. Instead, we assert that the affected HC clones contain SNPs relative to the control genes. We considered the possibility that the affected HC clones are derived from NIPs of the control genes. This possibility is not, however, consistent with the fact that Class III errors are detected only among the HC clones. In addition, existing DNA gel-blot hybridization data do not support the existence of NIPs for any of the control genes (data not shown). We instead conclude that the Class III errors are cloning artifacts. These cloning artifacts are particularly problematic because they cannot be distinguished from biologically relevant SNPs or paramorphisms. Although they occur at a relatively low rate within HC sequences (6 x 104), more than 40% (20/46) of HC clones contain at least one Type III error (Supplemental Table II).
The origin of these cloning artifacts is not known. The original protocol for generating HC libraries (Yuan et al., 2003
By aligning GSSs with previously sequenced genes, it was possible to detect two classes of sequencing errors (Types I and III) that were not detected via the analysis of GSS clone pairs (Emrich et al., 2004
The quality of the maize MF and HC GSSs released to GenBank by the Maize Genome Sequencing Consortium is quite high. For certain applications (e.g. genome assembly and the detection of SNPs and NIPs) it would, however, be desirable to have available sequences with even lower rates of errors. The quality of the maize MF and HC GSSs can be improved greatly by more stringently trimming vector and low quality sequences from the 5' and 3' ends of the sequence reads (viz., using Lucy parameters of Size 9, Bracket 20 0.003, Window 10 0.01, and Error 0.005 0.002). Because more than 40% of HC clones contain at least one Type III error, HC sequences should be used with caution in analyses in which errors must be minimized. It would be desirable that future HC libraries from maize or other species be prepared only after identifying reaction conditions that have reduced rates of cloning artifacts.
BLAST Search and Output Parsing
BLASTN searches (http://www.ncbi.nlm.nih.gov/blast/) without filtering low-complexity sequences were conducted using 10 control genes as query sequences and maize (Zea mays) GSSs as the object database (November 11, 2003). Default BLASTN settings for scoring alignments were used. Perl scripts were then used to parse the GenBank accession numbers of MF and HC GSSs from those alignments that met the following criteria: similarity
The trace files, quality score files, and clip site information of all qualified MF and HC GSSs were downloaded from the National Center for Biotechnology Information (NCBI) Trace database (http://www.ncbi.nlm.nih.gov/Traces). Only GSSs for which sequencing quality scores were available were analyzed further. Trace files were loaded into Sequencher software (version 4.1; Gene Codes, Ann Arbor, MI) for error verification.
Lucy (http://www.tigr.org/software/) was used to retrim GSSs. The sequences of the pBC SK- (Stratagene) and pCR4-TOPO (Invitrogen, Carlsbad, CA) vectors that had been used to construct the MF and HC libraries, respectively (http://www.tigr.org/tdb/tgi/maize/library_info.shtml), were used for vector trimming. Because Lucy indicates only the locations at which trimming should occur, an AWK script was written to trim GSSs and store trimmed data in FASTA format. The length of each trimmed GSS was calculated using a Perl script.
GSS coverage was calculated as the percentage (%) of each region of a control gene that was covered by at least one GSS. For the purposes of this study, the promoter region was defined as the 500 bp 5' of the 5' end of the apparently full-length cDNA sequence.
Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor. Sequence data from this article have been deposited with the EMBL/GenBank data libraries under the following accession numbers AF302098, AF215823, AF348412, AF348414, AF348418, AF370004, AF370006, AY265854, AY265855, and AY374447.
We thank Jun Cao, Xiangqin Cui, Chuck Dietrich, Feng Liu, Dave Skibbe, and Tsui-Jung Wen of the Schnable lab (Iowa State University) and Brian Scheffler (U.S. Department of Agriculture-Agricultural Research Service) for sharing sequencing data prior to publication. We thank Scott Emrich (Iowa State University) and Jeff Bennetzen (University of Georgia) for stimulating discussions. Received February 25, 2004; returned for revision April 2, 2004; accepted May 31, 2004.
1 This work was supported by competitive grants from the National Science Foundation Plant Genome Program (award nos. DBI9975868, DBI0121417, and DBI0321711). Support was also provided by the Hatch Act and State of Iowa funds.
[w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.104.041640. * Corresponding author; e-mail schnable{at}iastate.edu; fax 5152945256.
Bailey J, Gu Z, Clark R, Reinert K, Samonte R, Schwartz S, Adams M, Myers E, Li P, Eichler E (2002) Recent segmental duplications in the human genome. Science 297: 10031007
Bailey J, Yavor A, Massa H, Trask B, Eichler E (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11: 10051017 Brown KR, Weatherdon KL, Galligan CL, Skalski V (2002) A nuclear 3'-5' exonuclease proofreads for the exonuclease-deficient DNA polymerase alpha. DNA Repair (Amst) 1: 795810
Chou HH, Holmes MH (2001) DNA sequence quality trimming and vector removal. Bioinformatics 17: 10931104
Emrich SJ, Aluru S, Fu Y, Wen TJ, Narayanan M, Guo L, Ashlock D, Schnable PS (2004) A strategy for assembling the maize (Zea mays L.) genome. Bioinformatics 20: 140147 Hurles M (2002) Are 100,000 "SNPs" useless? Science 298: 1509 Kunkel TA, Bebenek K (2000) DNA replication fidelity. Annu Rev Biochem 69: 497529[CrossRef][ISI][Medline]
Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, de la Bastide M, Martienssen RA, McCombie WR (2003) Maize genome sequencing by methylation filtration. Science 302: 21152117 Rabinowicz PD, Schutz K, Dedhia N, Yordan C, Parnell LD, Stein L, McCombie WR, Martienssen RA (1999) Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome. Nat Genet 23: 305308[CrossRef][ISI][Medline]
Schaaper RM (1993) Base selection, proofreading and mismatch repair during DNA replication in Escherichia coli. J Biol Chem 268: 2376223765
Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, et al (2003) Enrichment of gene-coding sequences in maize by genome filtration. Science 302: 21182120 Yuan Y, SanMiguel PJ, Bennetzen JL (2003) High-Cot sequence analysis of the maize genome. Plant J 49: 249255 This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|