
In string comparison problems, we are searching for similar strings, rather than exact matches. A simple mathematical way to determine the similarity between two strings of equal length is to count the number of differences (called the Hamming Distance).
Example:
String 1: AGGCATCA
String 2: AGCCAGCA
It is clear to see that these two strings differ at the 3rd and 6th positions. We can then say that they have a Hamming Distance of 2. Since the strings are 8 characters in length, we can calculate the similarity as (8 - 2) / 8 = 75%. So similarity and distance are related values.
Unfortunately, this is a very simplified view that does not model the reality of the problem for several reasons. One issue that it does not cover is the potential for gaps. Gaps are not covered in this project, but many modern genomics algorithms take gaps into consideration.
Our focus in this project is the fact that some Amino Acids in protein chains swap more easily than others. In other words, we cannot simply look at how many characters differ because some differences are more significant than others.