Sequence analysis has become such an important field because the sequence
of a biomolecule ---DNA or protein---carries a lot of information
about the biomolecule's function and history. In particular, we usually
assume that similar sequences hold a similar function---structure---, and
they have a similar evolutive history. There is very strong evidence
supporting such an assumption. It is not surprising, thus, that among the
first tools developed in the field of sequence analysis were those aimed
to determine the degree of similarity between two sequences.
In this lesson, we will address the issue of sequence similarity. In the first part of
the lesson, we will start introducing the concept of sequence alignment,
on which the concept of sequence similarity (distance) depends. An alignment
is simply a correspondence between the sequences, in which each symbol
in a sequences is assigned no more than one (maybe none) of the symbols
in the other sequence, and in which the order of the symbols in the sequence
is maintained. Given a matrix of similarities (distances) between all possible
pairs of symbols on which the sequences are built, the similarity (distance)
of the alignment is the sum of the similarities (distances) between the
aligned symbols. The problem is to find the alignment between two sequences
having the maximum similarity (minimum distance). We will see that exhautive
exploration of all possible alignments to determine the optimal one is
computationally impractical, and we will introduce a dynamic programming
algorithm much more efficient---the first version of it developed in the
early 1970s by Needelman and Wunsch. With the dynamic programming algorithm
only the optimal alignments between all initial segments of the sequences
to compare need to be determined in order to find the optimal global alignment
between such two sequences.
Since the aligment depends strongly on the
similarity (distance) matrix used, we will next describe the way in
which the two most commonly used sets of matrices---PAM and BLOSUM---are
derived. We will finish the first part of this lesson by addressing briefly
the issue of the statistical significance of the similarity between aligned
sequences. The second part of the lesson is devoted to the problem of finding
local alignments between sequences. Often, a global alignment between two
sequences has little biological meaning---for instance, when aligning two
genomic regions containing a gene. In such cases, it may be more relevant
to align those regions in the sequence showing high similarity than attempting
to align the whole sequences. We will describe the modification of the
global alignment dynamic programming algorithm that allows to find such
local alignments ---the modification is due to Smith and Waterman in the
early 1980s---. Finally, in the third part of the lesson we will address
the problem of building alignments between multiple sequences.
- SEQUENCE ALIGNMENT
- DYNAMIC PROGRAMMING GLOBAL ALIGNMENT
- S.B. Needleman and C.D. Wunsch.
"A general method applicable to the search for similarities in the amino acid sequence of of two proteins."
J.Mol.Biol. 48:443-453 (1970)
- SUBSTITUTION MATRICES
M.O. Dayhoff, R.M. Schwartz and B.C. Orcutt.
"A model of evolutionary change in proteins."
In M.O. Dayhoff, ed. Atlas of Protein Sequence and Structure." Vol.5, Suppl.3, pp:345-352 (1978)
Natonal Biomedical Research Foundation, Washington D.C.
S. Henikoff and J.G. Henikoff.
"Amino acid substitution matrices from protein blocks."
PNAS 89:10915-10919 (1992)
- STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENTS
- M. Waterman.
"Estimating statistical of sequence alignments."
Phil.Trans.R.Soc.Lond B 344:383-390 (1994)
- DYNAMIC PROGRAMMING LOCAL ALIGNMENT
- T.F. Smith and M. Waterman.
"Identification of common molecular subsequences."
J.Mol.Biol. 147:195-197 (1981)
- MULTIPLE ALIGNMENT
Handouts on Sequence Similarity (PostScript file)
Recommended Internet Documents
Click here to access to the Sequence Similarity Practical.