Previous: Score Based String Comparison | Top: Table of Contents | Next: Problem of Distortion

I-D. Mapping Proteins to N-Length Binary Strings

Our main goal is to create a mapping that takes a real protein string and encodes it into a binary string (technically, since we are solving this problem on the computer, this has to be done anyway). To be useful, our mapping must accurately reflect the scoring matrix we wish to follow.

So our goal is to encode each amino acid in the protein alphabet to a binary string such that the hamming distance between the encoded strings is relative to the similarity of the two amino acids.

In other words, for any pair of acids, the more negative their similarity score, the more distant the encoded strings should be.

Example:

The score for the pair (D,G) = -1

The score for the pair (D,I) = -4

So a good encoding scheme for these three characters would be:
D -> 010101
G -> 000101
I  -> 011010

As a result, the hamming distance between the encoding for D and the encoding for G is 1. The hamming distance between the encoding for D and the encoding for I is 4. So if a pair is more similar, the hamming distance between its encodings is lower.

A new problem arises when we take into consideration the score of the pair (G,I) which is also -4. However, the hamming distance between our encodings for G and I is 5. Now our system has two pairs with the same similarity score, (D,I) = -4 = (G,I) but different hamming distances between the encodings. This is called distortion, and preventing or minimizing it becomes a very delicate optimization problem.


Previous: Score Based String Comparison | Top: Table of Contents | Next: Problem of Distortion