A quality control algorithm for DNA sequencing projects
- PMID: 8367301
- PMCID: PMC309901
- DOI: 10.1093/nar/21.16.3829
A quality control algorithm for DNA sequencing projects
Abstract
Heterologous DNA sequences from rearrangements with the genomes of host cells, genomic fragments from hybrid cells, or impure tissue sources can threaten the purity of libraries that are derived from RNA or DNA. Hybridization methods can only detect contaminants from known or suspected heterologous sources, and whole library screening is technically very difficult. Detection of contaminating heterologous clones by sequence alignment is only possible when related sequences are present in a known database. We have developed a statistical test to identify heterologous sequences that is based on the differences in hexamer composition of DNA from different organisms. This test does not require that sequences similar to potential heterologous contaminants are present in the database, and can in principle detect contamination by previously unknown organisms. We have applied this test to the major public expressed sequence tag (EST) data sets to evaluate its utility as a quality control measure and a peer evaluation tool. There is detectable heterogeneity in most human and C.elegans EST data sets but it is not apparently associated with cross-species contamination. However, there is direct evidence for both yeast and bacterial sequence contamination in some public database sequences annotated as human. Results obtained with the hexamer test have been confirmed with similarity searches using sequences from the relevant data sets.
Similar articles
-
Ancient conserved regions in new gene sequences and the protein databases.Science. 1993 Mar 19;259(5102):1711-6. doi: 10.1126/science.8456298. Science. 1993. PMID: 8456298
-
A fast algorithm for genome-wide analysis of proteins with repeated sequences.Proteins. 1999 Jun 1;35(4):440-6. Proteins. 1999. PMID: 10382671
-
Species-specific patterns of DNA bending and sequence.Nucleic Acids Res. 1991 Oct 11;19(19):5253-61. doi: 10.1093/nar/19.19.5253. Nucleic Acids Res. 1991. PMID: 1923808 Free PMC article.
-
acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data.BMC Bioinformatics. 2016 Dec 20;17(1):543. doi: 10.1186/s12859-016-1397-7. BMC Bioinformatics. 2016. PMID: 27998267 Free PMC article.
-
Gene identification through large-scale EST sequence processing.Appl Bioinformatics. 2003;2(3):123-9. Appl Bioinformatics. 2003. PMID: 15130797 Review.
Cited by
-
Contamination of cDNA- libraries and expressed-sequence-tags databases.Am J Hum Genet. 1995 Nov;57(5):1254-5. Am J Hum Genet. 1995. PMID: 7485181 Free PMC article. No abstract available.
-
Mobilomics in Saccharomyces cerevisiae strains.BMC Bioinformatics. 2013 Mar 20;14:102. doi: 10.1186/1471-2105-14-102. BMC Bioinformatics. 2013. PMID: 23514613 Free PMC article.
-
Comparative analysis of environmental sequences: potential and challenges.Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29;361(1467):519-23. doi: 10.1098/rstb.2005.1809. Philos Trans R Soc Lond B Biol Sci. 2006. PMID: 16524840 Free PMC article. Review.
-
On the species of origin: diagnosing the source of symbiotic transcripts.Genome Biol. 2001;2(9):RESEARCH0037. doi: 10.1186/gb-2001-2-9-research0037. Epub 2001 Aug 23. Genome Biol. 2001. PMID: 11574056 Free PMC article.
-
Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence.Genome Res. 1999 Nov;9(11):1116-27. doi: 10.1101/gr.9.11.1116. Genome Res. 1999. PMID: 10568751 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials