Student: Shiri Azenkot
School: Pomona College
Email: shiri.azenkot@pomona.edu
shiriaz@dimax.rutgers.edu
Project Name: String Edit Distance with Moves
Research Advisor: Dr. Graham Cormode, DIMACS Postdoc
Rutgers University




Algorithm Overview

I explored an algorithm that approximates the string edit distance with moves in near linear time. My DIMACS research advisor, Dr. Graham Cormode, introduced the algorithm in a paper he coauthored, The String Edit Distance Matching Problem with Moves (2002). The problem is NP-hard, and the algorithm yields a deterministic approximation in time O(n log n log*n). In theory, the algorithm’s output is within a factor of O(log n log*n) of the actual edit distance with moves.

The edit distance approximation is obtained in two main steps. First, each input string is parsed in a 2-3 tree using Edit Sensitive Parsing. The two parse trees are then compared to obtain a solution.

Parsing a string

Each string is parsed into a 2-3 tree using Edit Sensitive Parsing. In this parsing method, edit operations on a string should have a minimal effect on its parse tree. Parsing the strings is the bulk of the algorithm. In Edit Sensitive Parsing, the original string represents the bottom level of the tree, such that each character is a leaf node. The nodes are then divided into groups of 2 and 3, and a parent node is created for each group. Thus, the tree is constructed from the leaves to the root.

The first step in grouping the nodes is to isolate repeating sequences in the string. These sequences are divided into groups of 2 and 3 in a straight-forward way, such that groups of 3 are given preference.

Example:

aaaaaaa is grouped as (aaa)(aa)(aa)

In order to group the remaining nodes, the alphabet must first be reduced. This is done iteratively, until the alphabet consists of only three characters. The reduction method ensures that no two adjacent characters are equal. Landmarks are then set according to the locations of local maxima and local minima. Each landmark is grouped along with its adjacent nodes to form groups of 2 and 3.

Example:

sandwich is grouped as (san)(dw)(ich)
Underlined characters are landmarks.

Each node in the resulting tree represents a substring, according to its descendant leaves.

Example:

Could not load picture

Comparing the parse trees

Now that each string is represented by a parse tree, finding the solution is rather simple. The characteristic vector of each tree is found, containing substrings represented by all the nodes in the tree and their frequencies.

Example:

This is a complete parse tree, including the top and bottom levels.

Could not load picture

Finally, the characteristic vectors of both trees are subtracted. The difference is the approximation for the edit distance with moves.

For further details about how the algorithm works, please see The String Edit Distance Matching Problem with Moves.