Sumary - Identifying distantly related proteins
Protein sequence comparison is the most powerful tool available today
for inferring structure and function from sequence because of the
constraints of protein evolution---a protein fold into a functional
structure. Protein sequence similarity can routinely be used to infer
relationships between proteins that last shared a common ancestor
1--2.5 billion years ago. Our ability to identify distantly
related proteins has improved over the past five years with the
development of accurate statistical estimates, which have provided better
normalization methods, and with the use of optimized scoring
parameters. In using sequence similarity to infer homology, one
should remember:
- Always compare protein sequences if the genes encode
proteins. Protein sequence comparison will typically double the
look back time over DNA sequence comparison.
- While most sequences that share statistically significant
similarity are homologous, many distantly related homologous sequences
do not share significant homology. (Low complexity regions display
significant similarity in the absense of homology). Homologous
sequences are usually similar over an entire sequence or domain.
Matches that are more than 50% identical in a 20--40 amino acid
region occur frequently by chance.
- Homologous sequences share a common ancestor, and thus a common
protein fold. Depending on the evolutionary distance and divergence
path, two or more homologous sequences may have very few absolutely
conserved residues. However, if homology has been inferred between
A and B, between B and C, and between C
and D, A and D must be homologous, even if they
share no significant similarity.
- Similarity searching techniques can be improved either by
increasing the ability of a method to recognize distantly related
sequences---increased sensitivity---or by lowering scores for
unrelated sequences---increased selectivity. Since there are
generally 1000-times more unrelated than related sequences in a
sequence database, improvements that reduce the scores of unrelated
sequences can have dramatic effects. The most dramatic improvements in
comparison methods recently have used this approach.
wrp@virginia.edu