
Various score matrices have been developed from biological data that assign a similarity score to every possible pair of amino acids. The PAM Matrices are based on evolutionary data and the more recent BLOSUM Matrices are based on statistical data [4].
Both matrices assign positive scores to matches and negative scores to mismatches. So the similarity of two strings is the sum of their scores for each pair.
Example: (Remember that proteins strings are represented from a larger alphabet)
String 1: GHALIQ
String 2: GMVLIP
To determine the similarity we must now look up a score for each of the 6 pairs in one of our score matrices. For this example we will use the BLOSUM45 Matrix.
Our 6 Pairs are: (G,G),
(H,M),
(A,V),
(L,L),
(I,I),
(Q,P)
The scores for these pairs according to BLOSUM45 are:
(G,G) = 7
(H,M) = 0
(A,V) = 0
(L,L) = 5
(I,I) = 5
(Q,P) = -1
Note that all matches have a positive score, and all mismatches have a zero or negative score. That is to say that matches contribute to the similarity of a string, while mismatches either do not contribute or deter from the similarity of a string.
The sum of these scores is 16. An overall positive score would indicate some level of similarity while a negative score would mean that two string are very much dissimilar. We can characterize a search by a threshold, or a minimum positive score. We then say that all pairs of strings that score above our threshold are significantly similar.
Keep in mind that because we are simply summing these values, larger strings will give way to a wider variety of scores. For this reason it is probably wise to scale our summed scores by dividing by the string length.
It should be clear at this point that taking similarity scores into consideration complicates the problem.