Ефективність фільтрації у статистичних алгоритмах швидкого пошуку гомології

Певзнер, П. А.

doi:10.7124/bc.000299

Biopolym. Cell. 1990; 6(6):7-13.

http://dx.doi.org/10.7124/bc.000299

Ефективність фільтрації у статистичних алгоритмах швидкого пошуку гомології

1Певзнер П. А.

ВНДІ генетики та селекції промислових мікроорганізмів Главмікробіопрома при РМ СРСР
Москва, СРСР

Abstract

При пошуку локальних гомологій (пошук гомологій у генетичних банках, вибір оптимальних олігонуклеотидних зондів і т. п.) виникає проблема його «пришвидчення». Квадратична трудомісткість алгоритмів динамічного програмування змушує вдаватися до методів фільтрації, що дозволяє швидко «відбракувати» послідовності з низьким рівнем гомології. У роботі вводиться поняття ефективності фільтрації і дається оцінка ефективності деяких фільтрів, при цьому показано, що в l-грамному аналізі ефективність фільтрації пов’язана з потенційним розширенням вихідного 4-літерного алфавіту.

Повний текст: (PDF, російською)

References

[1] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443-53.

[2] Sellers PH. On the Theory and Computation of Evolutionary Distances SIAM J Appl Math. 1974; 26(4):787-93.

[3] Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985;13(2):645-56.

[4] Lyall A, Hill C, Collins JF, Coulson AFW. Parallel computing'85. A. Eds M. Felmeier et al. North-Holland, 1986:235-258.

[5] Gotoh O, Tagashira Y. Sequence search on a supercomputer. Nucleic Acids Res. 1986;14(1):57-64.

[6] Collins JF, Coulson AF. Applications of parallel processing algorithms for DNA sequence analysis. Nucleic Acids Res. 1984;12(1 Pt 1):181-92.

[7] Goad WB, Kanehisa MI. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res. 1982;10(1):247-63.

[8] Masck W, Paterson MS. How to compute string-edit distance quickly. Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, 1983:337-349.

[9] Ukkonen E. On approximate string matching. Lecture Notes in Computer Science. 1983;487–95.

[10] Roytberg MA. Algorithm for determining the homology of the primary structures. Pushchino, 1984; 24 p. (Preprint. USSR. NIVTS AN SSSR).

[11] Hsu WJ, Du MW. New algorithms for the LCS problem. Journal of Computer and System Sciences. 1984;29(2):133–52.

[12] Kumar SK, Rangan CP. A linear space algorithm for the LCS problem. Acta Informatica. 1987;24(3):353-362.

[13] Gusev VD, Kulichkov VA, Titkova TN. Analysis of genetic texts. 1. L-programmatic profile. Vychislitel'nyye sistemy. 1980; 83(1):11-33.

[14] Fondrat C, Dessen P, Le Beux P. Principle of codification for quick comparisons with the entire biomolecule databanks and associated programs in FORTRAN 77. Nucleic Acids Res. 1986;14(1):197-204.

[15] Kolchanov NA, Solovyev VV, Zharkikh AA. Context theoretical analysis techniques of genetic macromolecules (DNA, RNA and proteins). Itogi nauki i tekhniki. Moscow, VINITI, (Ser. Molekulyarnaya biologiya, Vol. 210). 1985; 6-34.

[16] Mironov AA, Alexandrov NN. Statistical method for rapid homology search. Nucleic Acids Res. 1988;16(11):5169-73.

[17] Solovyev VV, Rogozin IB. Quick search on banks homology of nucleotide and amino acid sequences. Theoretical studies and databases of molecular biology and genetics. Novosibirsk, 1988; 40.

[18] Zharkikh AA, Rzhetskiy AYu, Rodin SN. Evaluation sequence homology over the frequencies of the oligonucleotides. I. Restrictions on the composition of nucleotides in the DNA. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988. 25 p.

[19] Zharkikh AA, Rzhetskiy AYu. Evaluation sequence homology over the frequencies of the oligonucleotides. II. Analysis of the distribution of nucleotide substitutions in the sequences. U-statistics. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988; 26 p.

[20] Zharkikh AA, Rzhetskiy AYu. Determination of sequence homology to the frequencies of oligonucleotides. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 40 p.

[21] Strelets VB, Shindyalov IYa. Quick search of similarity between the amino acid sequences. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 53 p.

[22] Strelets VB, Shindyalov IYa. Quick search methods of similarity between the amino acid sequences. Novosibirsk, (Preprint. USSR. Inst of Cytology and Genetics). 1988; 25 p.

[23] Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983;80(3):726-30.

[24] Kanehisa M. Use of statistical criteria for screening potential homologies in nucleic acid sequences. Nucleic Acids Res. 1984;12(1 Pt 1):203-13.

[25] Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435-41.

[26] Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986;83(14):5155-9.

[27] Lawrence CB, Goldman DA, Hood RT. Optimized homology searches of the gene and protein sequence data banks. Bull Math Biol. 1986;48(5-6):569-83.

[28] Zhurkin VB. Periodicity in DNA primary structure is defined by secondary structure of the coded protein. Nucleic Acids Res. 1981;9(8):1963-71.

[29] Pevzner PA. Filtration efficiency in the way of rapid homology search. Theoretical studies and databases on molecular biology and genetics. Novosibirsk, 1988; 63-4.

[30] Guibas L., Odlyzko A. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A. 1981;30(2):183–208.

[31] Pevzner PA, Borodovsky MYu, Mironov AA. Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989;6(5):1013-26.