
If we assign a y value to each column we create a system of linear expressions. Each y is the frequency in which that partition is used as part of our original encoding scheme.
Example:
If y1 = 5, then we use the partition "0001"
five times, so in our encoding scheme there should be five positions at which the aligning bit of
A, G,
and C are 0,
and the same aligning bit in T is 1.
Note: The sum of each linear expression is the total hamming distance between a particular pair.
Example:
The first row creates the linear expression:
0y1 +
0y2 +
0y3 +
1y4 +
1y5 +
1y6 +
1y7
which simplifies to:
y4 +
y5 +
y6 +
y7
So the sum of these 4 y values is the hamming distance between the encodings for
A and G.
Our goal is to find y values that will make these hamming distances (the sums of these linear expressions) be relative to the real distances between the pairs.
An important question that still remains is: What are the real distances? We have biological data for similarity, but we need to modify these values to represent distance instead. We have followed the method discussed in [5], which is explained in detail below.
Let M be a biologically scored similarity matrix such that each Mij is the similarity between some characters i and j.
The distance between any characters i and j (Dij) is calculated as Mii + Mjj - 2Mij
Since the diagonal values are positive, and the rest are negative or zero, distance is always positive.
Example:
One way to score similarities in genes is as follows [5]:
Our resulting scoring matrix is:
| A | G | C | T | |
| A | 1 | -1 | -2 | -2 |
| G | -1 | 1 | -2 | -2 |
| C | -2 | -2 | 1 | -1 |
| T | -2 | -2 | -1 | 1 |
Note: All score matrices that ignore pair order are symmetrical
So DAG (which is the same as DGA) is calculated as MAA + MGG - 2MAG
= 1 + 1 - 2(-1) = 4
In this way we can generate a column vector D, indexed by pairs, like this:
| Pair | D |
| (A,G) | 4 |
| (A,C) | 6 |
| (A,T) | 6 |
| (G,C) | 6 |
| (G,T) | 6 |
| (C,T) | 4 |
Remember that our hamming distances are represented by the product of our partition matrix (which we will call A) and our unknown y-values (the vector y). Our goal is for this product to be relative to our true distances in our D vector, with room for some distortion error.
So the vector of all our pairwise hamming distances is Ay. We have been saying relative, rather than exact, because in truth there may be some scaling factor l.
Example:
If l=2, then our hamming distances might be 8, 12, 12, 12, 12, 8.
These would still be perfectly relative to the true distances 4, 6, 6, 6, 6, 4.
If there is some distortion e, then our hamming distances should be bounded by D(1 - e) and D(1 + e), so we get the inequality:
D(1 - e) <= lAy <= D(1 + e)
Note that e, y, and l are all unknown.
We can multiply l into y to get a vector of scaled values x and solve for that. In the end we won't even need the original y values, so this simplification is okay.
Now we have:
D(1 - e) <= Ax <= D(1 + e)
We represent this as two sets of linear inequalities:
Ax - De <= D
-Ax - De <= -D
We wish to minimize e (same as maximizing -e)
with these linear inequalities as constraints, as well as the constraints that all x values and the
distortion e must remain greater than or equal to 0.
We have just created an optimization problem for linear programming. We can solve this problem using known methods. In our work we used the CPLEX software to solve our optimization problems.