
We present the following Problem Model. We will use the n = 4 character gene alphabet as a simple example, but it can also be applied to the n = 20 character protein alphabet.
Assume that each of our n symbols map to an arbitrary binary string of length N by some arbitrary encoding scheme that we do not yet know.
So we have something like:
A = 01101001010...
G = 10101010101...
C = 11010100101...
T = 00101010010...
And each of these binary strings is of length N
Note: n and N are two different values. n = 4 since we have 4 characters in
our real gene strings, and N is an unknown and arbitrary number.
We then place these unknown encodings as columns in a matrix of Nxn, so our matrix would look something like:
| A | G | C | T |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 |
| . . . |
. . . |
. . . |
. . . |
We can now say that any particular row is some combination of 4 binary values, so we have a total set of 24 = 16 possible rows. This is fine for now but in the larger problem we want to cut corners wherever we can.