Biopolym. Cell. 1990; 6(6):7-13.
Filtration efficiency in rapid homology search statistical algorithms
1Pevzner P. A.
  1. All-Union Research Institute of Genetics and Breeding of Industrial Microorganism, Glavmicrobioprom at Council of Ministers of the USSR
    Moscow, USSR

Abstract

Upon searching local homologies in long sequences (homology search in nucleotide and amino acid sequences banks, selection of optimal oligonucleotide probes etc.) the necessity of a «rapid» homology search becomes acute. Quadratic complexity of (he dymanic programming algorithms (Needleman–Wunsch and Sellers type) forces the employment of filtration methods, that permits one to reject the sequences with a low homology level (among the filtration methods the 1–tuple analysis and the statistical method of Mironov–Alexandrov were used). But theoretical substantiations of such algorithms have not been made yet. The present work introduces the notion of filtration efficiency and the efficiency of several filters is given. It was shown that in the 1–tuple analysis the filtration efficiency is associated with the potential extension of the original four– letter alphabet. The formulas that allow choosing the filtration parameters are presented.

References

[1] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443-53.
[2] Sellers PH. On the Theory and Computation of Evolutionary Distances SIAM J Appl Math. 1974; 26(4):787-93.
[3] Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985;13(2):645-56.
[4] Lyall A, Hill C, Collins JF, Coulson AFW. Parallel computing'85. A. Eds M. Felmeier et al. North-Holland, 1986:235-258.
[5] Gotoh O, Tagashira Y. Sequence search on a supercomputer. Nucleic Acids Res. 1986;14(1):57-64.
[6] Collins JF, Coulson AF. Applications of parallel processing algorithms for DNA sequence analysis. Nucleic Acids Res. 1984;12(1 Pt 1):181-92.
[7] Goad WB, Kanehisa MI. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res. 1982;10(1):247-63.
[8] Masck W, Paterson MS. How to compute string-edit distance quickly. Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, 1983:337-349.
[9] Ukkonen E. On approximate string matching. Lecture Notes in Computer Science. 1983;487–95.
[10] Roytberg MA. Algorithm for determining the homology of the primary structures. Pushchino, 1984; 24 p. (Preprint. USSR. NIVTS AN SSSR).
[11] Hsu WJ, Du MW. New algorithms for the LCS problem. Journal of Computer and System Sciences. 1984;29(2):133–52.
[12] Kumar SK, Rangan CP. A linear space algorithm for the LCS problem. Acta Informatica. 1987;24(3):353-362.
[13] Gusev VD, Kulichkov VA, Titkova TN. Analysis of genetic texts. 1. L-programmatic profile. Vychislitel'nyye sistemy. 1980; 83(1):11-33.
[14] Fondrat C, Dessen P, Le Beux P. Principle of codification for quick comparisons with the entire biomolecule databanks and associated programs in FORTRAN 77. Nucleic Acids Res. 1986;14(1):197-204.
[15] Kolchanov NA, Solovyev VV, Zharkikh AA. Context theoretical analysis techniques of genetic macromolecules (DNA, RNA and proteins). Itogi nauki i tekhniki. Moscow, VINITI, (Ser. Molekulyarnaya biologiya, Vol. 210). 1985; 6-34.
[16] Mironov AA, Alexandrov NN. Statistical method for rapid homology search. Nucleic Acids Res. 1988;16(11):5169-73.
[17] Solovyev VV, Rogozin IB. Quick search on banks homology of nucleotide and amino acid sequences. Theoretical studies and databases of molecular biology and genetics. Novosibirsk, 1988; 40.
[18] Zharkikh AA, Rzhetskiy AYu, Rodin SN. Evaluation sequence homology over the frequencies of the oligonucleotides. I. Restrictions on the composition of nucleotides in the DNA. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988. 25 p.
[19] Zharkikh AA, Rzhetskiy AYu. Evaluation sequence homology over the frequencies of the oligonucleotides. II. Analysis of the distribution of nucleotide substitutions in the sequences. U-statistics. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988; 26 p.
[20] Zharkikh AA, Rzhetskiy AYu. Determination of sequence homology to the frequencies of oligonucleotides. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 40 p.
[21] Strelets VB, Shindyalov IYa. Quick search of similarity between the amino acid sequences. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 53 p.
[22] Strelets VB, Shindyalov IYa. Quick search methods of similarity between the amino acid sequences. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988; 25 p.
[23] Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983;80(3):726-30.
[24] Kanehisa M. Use of statistical criteria for screening potential homologies in nucleic acid sequences. Nucleic Acids Res. 1984;12(1 Pt 1):203-13.
[25] Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435-41.
[26] Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986;83(14):5155-9.
[27] Lawrence CB, Goldman DA, Hood RT. Optimized homology searches of the gene and protein sequence data banks. Bull Math Biol. 1986;48(5-6):569-83.
[28] Zhurkin VB. Periodicity in DNA primary structure is defined by secondary structure of the coded protein. Nucleic Acids Res. 1981;9(8):1963-71.
[29] Pevzner PA. Filtration efficiency in the way of rapid homology search. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 63-4.
[30] Guibas L., Odlyzko A. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A. 1981;30(2):183–208.
[31] Pevzner PA, Borodovsky MYu, Mironov AA. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989;6(5):1013-26.